|
Audio files are 16bit Microsoft .WAV at (mostly) 16KHz sampling.
Festival is designed to allow new voices and languages to be added easily and consistently. In many cases no new C++ code is required. The document Building Voices in Festival includes instructions and scripts top aid building of new voices. The site http://www.festvox.org/ hosts the document, scripts and examples databases.
One major part is building a diphone database (if that route is taken). This involves collect all phone-phone transitions in the language. For example in (US) English this done through recording 1348 nonsense words. We synthesize prompts (if we have an existing synthesizer in the target language)
Which are spoken by our target speaker After recording these are autoaligned (using cross-language aligning if necessary) and a voice automatically built. Although autoaligning is usually good it sometimes fails requiring some amount of hand correction. For this voice, recording took one morning, aligning about an hour. and hand correction took two hours.This technique has also been used for building synthesis in other language including: Greek, Polish, Basque, Spanish, various English dialects, Swedish and German.
Initially simple rule models can be written for phrasing, duration and intonation. These can be improved on with suitable data and building statistical models. Of course, high quality, natural, controllable prosody is still a research issue, but simple forms should be possible in most languages.
Depending on time and data availability we can build more complex models
The task of producing a pronunciation given a word varies in complexity from language to language. In Spanish the task is mostly trivial, but in Japanese, kanji characters often have several readings and choosing between them may require quite high level linguistic and pragmatic information.
In many languages although there is a history relationship between the alphabetic written form and the pronunciation that relationship isn't so obvious. English and German are good examples. To synthesize these language a lexicon (list of words and pronunciation is required). But any list of of words, no matter how big, will not contain all the words that appear in text so a method for pronounce out-of-vocabulary words is necessary. Although such rule systems can be written by hand it is a slow and skilled process. We have developed a fully trainable method for producing letter-to-sound models from lexicons. We have successfully used it for, various dialects of English, French and German. See Black, Lenzo and Pagel 1998 html or postscript for technical details, and the Festival manual for instructions.
Text isn't as easy to say as one might think, numbers symbol, abbreviations are common and need to be expanded into words if they are to be spoken. In English, for example, numbers are pronounced differently depending on their type. The digits 1998 are pronounced as one thousand (and) nine hundred and ninety-eight if it is a quantity; nineteen ninety-eight if a year; and one nine nine eight if a phone number or part number. Statistical methods can be trained to choose between these. In other language number pronunciation may affect by the gender case, tense etc of the item being counted, even though no textual indication is given.
Festival offers a flexible rule and trainable system to build text analysis front ends. A much more detailed discussion of text analysis for synthesis (and language processing in general) was the subject of a project at the Johns Hopkins Summer Workshop 99 and is detailed here