Since one major part of this project is to investigate the best output voice quality, we want to have significant control of the synthetic voice output. In our initial design, we opted for the easiest solution that would give us a working system. Our very first version simply used a diphone synthesizer. The quality of the latter is basically inadequate for anyone but the dedicated to understand, and particularly not suitable for our target groups, the elderly and non-natives, with limited abilities in English.
Once we had a basic system running with a relatively stable language generation system, we built a limited-domain synthesizer using the techniques described in [9]. That is, we built a specific synthesized voice that is explicitly designed for the type of output we required. To do this we programmatically constructed all the phrases and templates that the language generation system could output. We then filled in the bus stop names, bus numbers, times, etc., generating a list of sentences (around 12,500). We then synthesized these to phoneme strings and greedily selected utterances with the best diphone coverage, which generated a list of 202 utterances. Then we removed this from the completed list and greedily selected a second set. This was done three times, generating a diphone-rich prompt set (for our domain) of 600 prompts. These were recorded and a voice was automaticaly built using the FestVox [10] build process including labeling phonetic data with a Sphinx acoustic model trained from this data.
The result is a high quality voice that works well for the sentences generated by our system. It does not, however, deal with all bus stop names at present (or at least not consistently well). We are currently working with a subset of the 15,000. A better name-specific selection technique would potentially offer more consistent coverage.