From Table 1-2 and Table 4-6, we observe that the notion of equality or no preference (referred to as "=" in Table 7) occupies first position. This indicates that the listeners perceived the speech synthesized by different synthesizers as either equally good or equally bad. However, if we look at the choice of unit size, the results shown in Table 1-3, indicate that speech synthesized with syllable sized units is preferred over the speech synthesized with other choices of unit size. The results of Table 4 indicate that diphone performs better than phone while the results of Table 5-6 indicate that half phone performs better than phone and diphone. Table 7 summarizes the results of AB-test in terms of percentages (number of times a unit is favored / total utterances * 100).
It should be noted that the syllables as well as diphones in test sentences were covered by the speech database, though this will not be true in general. However, the prompt-list used for building the speech database was derived from a text corpus which covered a wide range of subjects including literature, dialog, novels, philosophy and short stories, while the 24 sentences used for testing were from a news bulletin describing the global events in the middle of March 2003. The context in which test sentences were derived was not related to the prompt-list used to generate the speech database.
Larger units such as syllables might assimilate prosodic and acoustic information better and have less discontinuities in synthesized speech, resulting in better performance over other units. Units such as diphones have performed better than phone as they preserve the phone-to-phone transitions. However the small differences are due to the joinings moved within the previous units even in the case of phones as a method of optimal coupling.
The smaller units such as half phones involve more number of joinings and could lead to the impression that it produces more discontinuous speech. The results of Table 5-6 indicate that the half phone synthesizers perform better than diphone and phone synthesizers. To join two consecutive units we use optimal coupling [9]. The better performance of half phones could be attributed to its vast coverage and hence the chance of finding an optimal sub segment with required acoustic features would be more.
The choice of larger unit such as syllable seems to be appropriate choice for syllabic languages such as Hindi and seems to be a better representation for the Indian language scripts. But larger the unit the lesser would be the coverage, which has to be dealt with. Given an arbitrary text, we found that the syllable coverage by this Hindi database was around 84% and the diphone coverage was 88%. With a more careful selection of the prompt-list we believe that it is possible to cover most of the frequently occurring syllables in Hindi, but some back-off method is required too.