For limited domain voices, the process of designing what to record is easier, even if there is not an automated process. At first approximation, the object is to include in the databases at least one occurrence of each word in the domain in each desired prosodic context. Hand design can often be adequate for such domains.
Our initial attempts to do this automatically for new domains was to take a large example of the intended output and greedily select sentences that maximized the word coverage. To some extent, this works, but obviously does not take into account the phonetic coverage of the database, and so word joins may be poor. We can take each sentence and generate the diphones that are required to synthesize it and then greedily select sentences which have the best diphone coverage. We also investigated selecting for demisyllables rather than diphones to see what sort of coverage was necessary.
As a test we took the text of Lewis Caroll's ``Alice's Adventures in Wonderland'' [5] to see how many sentences are sufficient given these various criteria. The Gutenberg version of Alice (alice29.txt) contains 26,457 words from which Festival finds 1,918 utterances. The following table shows the number of utterances needed to cover the criteria as selected by a simple greedy algorithm.
There are, of course, many ways to describe and define demisyllables; here we use onset as initial consonantal gestures (if any) as well as an initial portion of the vocalic nucleus, and the coda as the remainder of the vocalic portion into the the final consonantal gestures (if present). Syllable affixes were not treated distinctly from the coda. Thus, the units can be written as onset cluster - vowel and vowel - coda cluster, respectively. The demisyllable inventory and feature set is based upon [8], with some simplifications.
Optimize for utts % total Words 979 51% Diphones 196 10.2% Demisyllables 312 16.2%
We can easily add other factors to the units we are trying to optimize for, be that lexical stress, pitch and/or metrical accents, position in phrase etc. As pointed out in [13] getting all possible features and all contexts is prohibitive. The addition of each multiplies the amount of data requires to systematically cover the space.
Another direction to find the right data to record is to define the space and then explicitly create a comprehensive database by design. Simple diphone databases are a prototypical example of this, we define the phone set and then what diphones can appear in the language, and then carefully design a database to ensure it has one example of each of the token types defined (e.g. [11]). This direction seems feasible for smaller inventories, but as the combination of features grow we have to make more and more decisions about pruning that space, as collecting everything would be a monumental task - let alone the post-processing steps necessary.
In these two methodologies - define features and greedily select from data, and define features and expertly design the data - two distinct aspects are missing.