We use a simple, effective technique based on [6] to segment the recorded prompts, by using DTW to align the pre-generated, labeled synthesized diphone prompts to the new recordings. This is the same technique we use for aligning data for limited domain synthesis; the technique has worked quite well for us to make initial passes, even when the source language and the target language differ: American English has been used to generate first-pass alignments for both Croatian and Nepali.
To confirm this technique's accuracy we used it against some earlier databases which had been hand labelled.
type | RMSE | stddev | |
KED-KED | self | 14.77ms | 17.08 |
MWM-KED | US-US | 27.23ms | 28.95 |
GSW-KED | UK-US | 25.25ms | 23.92 |
KED-WHY | US-Kor | 28.34ms | 27.52 |
KED is a US English voice, collected at University of Edinburgh, MWM was collected at Oregon Graduate Institute. A KED voice built directly from this fully automatic labelling technique was certainly understandable though not as good as the hand labelled form. GSW is a British English voice yet we used this to label US English (with reasonable mappings for phone names). Again the results were reasonable. The last example used KED, the US English voice, to align against WHY a Korean voice. In spite of these being different languages, the results was perfectly usable.
Even though there is a phonetic mismatch between English and Korean - English uses aspiration of stops in free variation as aloophones, whereas Korean has a phonemic distinction between aspirated and unaspirated stops - we note that mostly labelling is correct but for a small number of labels they are completely wrong having aligned to some lip smack, or some other noise or artifact. The success of this should not be surprising. This is a very constrained labelling task. We know exactly what phones are present and hence alignment should be trivial. In fact if it is not trivial it is likely there is a problem with the recording.
We do not pretend that this is perfect, but the alignments are very good as a first pass. We do usually also hand check all labelling and move boundaries as is required to get the best performance. But this level of fine tuning was also done in the days that we relied solely of hand labelling. In those days, initial hand labelling was always somewhat rough and prone to error. This technique has allowed us to remove that stage producing initial results in minutes rather than days.