The FestVox [1] document is a growing, publically available resource that contains tools, data, and text about building complete synthetic voices in English and other languages. That work covers everything from building text analyzers, lexicons, prosodic models as well as various waveform synthesis techniques including general unit selection and diphones. Inside this framework, we have collected a considerable amount of data, and refined the process. While building good, characteristic voices cannot be reduced to a simple recipe, we hope this will be a starting point for those interested in speech science and technology, and will provide a common basis for the comparison of various diphone synthesis techniques over the same data sets.
Although diphone-based synthesizers are only one of a number of techniques documented in FestVox, we believe they are, at present, the most reliable and resource-effective method for building new voices for general text-to-speech synthesizers. A diphone here is two connected half-phones, where a "phone" here may in fact be any segment including a traditional phoneme, allophone or consonant cluster. We carefully construct examples of each phone-phone transition in our phoneset, so as to capture all the implied sequential articulatory transitions, even though some may not be phonotactically valid (like [ZH-NG]).
To fully exercise our techniques, we are collecting ten sets of diphones, at 32KHz, with simultaneous electroglottograph (EGG) signal, from a single speaker of American English (KAL) with varying speaking rate. These databases are being released publicly with an open license, so that anyone who wishes can replicate our findings, study the voice, teach about synthesis, or build their own by comparison. We, and others, have also used these techniques on other voices and other languages. Four sets have been recorded so far, and the bulk of this paper relates to our experience with that set and the tools that have been created to help - including a recording session management tool called pointyclicky.