As we wish the result to be unencumbered by copyright restrictions, we started with a set of 19 out-of-copyright novels from Project Gutenberg [9]. This is a total of 1.2 million words. If synthesized, the audio would take just under 4 days to play.
As a starting cluster tree, we used the CMU Communicator domain voice. KAL, the voice used for that (created from an author of this paper), speaks with a standard US English accent. We used a version of the limited domain corpus consisting of around 900 utterances and built a standard unit selection synthesis (clunits) voice from it. That voice performs well on Communicator like utterances, but isn't particularly good on general synthesis.
We then synthesized the complete 1.2 million word corpus, counting the usages of each cluster in the clunits cluster tree. As we did not need to actually generate the audio, we skipped the final part of selection and synthesis once the required unit types were selected, but this pass still takes several hours.
Because we wish to use the result as prompts in recording, and hence we want the utterances to be relatively easy to say, we then pruned the complete text databases to give more reasonable sentences to from which to select. All candidate sentences were between five and twenty words in length.
After initial tests and noting that the scoring technique does not cater properly for repeated words, as items are not added to the tree during scoring but only on selection of the best utterance, repeat words are doubly scored. We added the restriction that candidate sentences do not contain any repeated words, which is not an unreasonable condition anyway.
The result of these restrictions gives us a total 34,796 utterances from which we select candidates. The candidate search is also computational expensive; although we do not do full unit selection synthesis, we still do a significant part of that process. Also, because the database is big, we cannot store all the synthesized utterance structures, and need to re-calculate on each pass. Thus, the initial pass to find the best selected takes over an hour on our current setup. The selection algorithm runs continuously over the data selecting the utterance that best contributes units to the cluster tree, and ignores utterances that add nothing. As this process proceeds, the cycles do get faster.
In our test, the first, i.e. most contributable utterance is
Allow me to interpret this interesting silence.which is from Bronte's ``Emma.'' This sentence contains six different vowels, and nine different consonants. We were also amused to find ``Humpty Dumpty sat on a wall.'' was included in the set of selected utterances, from ``Through the Looking Glass''.
After several days of computing, we ended up with a list of 241 sentences. On inspection, some of these sentences were very unusual, and some are hard to say, so we removed some of these, even though we are aware this is affecting the distribution of units we want in the database. To some extent, unusual sentences are expected, as our selection process is trying to maximized coverage and therefore mis-spellings, and unusual text (like letters spelled out) will have a higher score, for example ``Deed you A I N T'' from ``Huckleberry Finn.'' After hand checking the complete list and removing difficult to say or unnatural utterances we were left with a total of 221 utterances.
The number seems smaller than we expected, though it is perhaps not surprising, given this is effectively selecting one example of each cluster. But in order to get more examples, we ran the selection again, excluding the sentences from first selection list, and so generated a second list of 146 utterances. We could repeat this process a number of times.