Once the recordings are made and are labelled, building the diphone synthesizer itself is mostly automatic. Our first stage is to extract the pitchmarks from the EGG signal, if an electroglottograph was used at recording time. We use the pitchmark program that is part of the Edinburgh Speech Tools, on which Festival is built, to do this. For diphone databases for which no EGG is provided, we can extract the pitchmarks form the waveform files directly, but this is typically not as good from the EGG signal. For all voice sections of the speech, we position the pitchmarks at the peak of the pitch period. For non-voiced sections, we introduce a ``fake'' or pitch mark evenly spaced through those sections. As our signal techniques for pitch and duration modification depend of pitch synchronous analysis, getting the pitch marks right is very important to the final quality of the synthesizer.
Although we try hard to ensure that the audio quality remains constant throughout the whole recording, it is unusual for the whole set to be done, perfectly, in a single sitting, and we have found that slight differences in power occur between different sittings, due to position of the microphone as well as the speaker delivering with different vocal effort. To combat this, we include a simple power normalization phase. As different phones have different inherent power can cannot simply normalize everything; therefore, we calculate the mean RMS power over all vowels in each nonsense word, then find the mean over all the files, and calculate a modification factor for each word that in the normalization.
After power normalization, we extract LPC parameters pitch synchronously.