This algorithm has a number of advantages over other selection based synthesis techniques. First the cluster method based on acoustic distances avoids the problem of estimating weights in a feature based target distance measure as described in [7], but still allows unit clusters to be sensitive to general prosodic and phonetic distinctions. It also neatly finesses the problem of variability in sparseness of units. The tree building algorithm only splits a cluster when there are a significant number and identifiable variation to make the split worthwhile. The second advantage over [7] is that no target cost measurement need be done at synthesis time as the tree effectively has pre-calculated the ``target cost'' (in this case simply the distance from the cluster center). This makes for more efficient synthesis as many distance measurements now need not be done.
Although this method removes the need to generate the target feature weights generated in [7] used in estimating acoustic distance there are still many other places in the model where parameters need to be estimated, particularly the acoustic cost and the continuity cost. Any frame based distance measure will not easily capture ``discontinuity errors'' perceived as bad joins between units. This probably makes it difficult to find automatic training methods to measure the quality of the synthesis produced.
Donovan and Woodland [5] use a similar clustering method, but the method described here differs in that instead of a single example being chosen from the cluster, all the members are used so that continuity costs may take part in the criteria for selection of the best units.
In [5], HMMs are used instead of a direct frame-based measure for acoustic distance. The advantage in using an HMM is that different states can be used for different parts of the unit. Our model is equivalent to a single state HMM and so may not capture transient information in the unit. We intend to investigate the use of HMMs as representations of units as this should lead to a better unit distance score.
Other selection algorithms use clustering, though not always in the way presented here. As stated, the cluster method presented here is most similar to [5]. Sagisaka et al. [9] also clusters units but only using phonetic information, they combine units forming longer, ``non-uniform'' units based on the distribution found in the database. Campbell and Black [3] also use similar phonetic based clustering but further cluster the units based on prosodic features, but still resorts to a weighted feature target distance for ultimate selection.
It is difficult to give realistic comparisons of the quality of this method over others. Unit selection techniques are renowned for both their extreme high quality examples and their extreme low quality ones, and minimising the bad examples is a major priority. This technique does not yet remove all low quality examples, but does try to minimise them. Most examples lie in the middle of the quality spectrum with mostly good selection but a few noticable errors which detract from the overall acceptability of the utterance. The best examples, however, are nearly indistinguishable from natural utterances.
This cluster method is fully implemented as a waveform synthesis component using the Festival Speech Synthesis System [1].