To cluster the units, we first define an acoustic measure to measure the distance between two units of the same phone type. Expanding on [7], we use an acoustic vector which comprises Mel frequency cepstrum coefficients, /, power, and delta cepstrum, / and power. The acoustic distance between two units is simply the average distance for the vectors of all the frames in the units plus X% of the frames in the previous units, which helps ensure that close units will have similar preceding contexts. More formally, we use a weighted mahalanobis distance metric to define the acoustic distance Adist(U,V) between two units U and V of the same phoneme class as
where |U| is number of frames in U, is parameter y of frame x of unit U, is the standard deviation of parameter j, is weight for parameter j. This measure gives the mean weighted distance between units with the shorter unit linear interpolated to the longer unit. WD is the duration penalty weighting the difference between the two units' lengths.
This acoustic measure is used to define the impurity of a cluster of units as the mean acoustic distance between all members. The object is to split clusters based on questions to produce a better classification of the units. A CART method [2] is used to build a decision tree whose questions best minimise the impurity of the sub-clusters at that point in the tree. A standard greedy algorithm is used for building the tree. This technique may not be globally optimal but a full global search would be prohibitively computationally expensive. A minimum cluster size is specified (typically between 10-20).
Although the available questions are the same for each phone type, the tree building algorithm will select only the questions that are significant in partitioning that particular type. The features used for CART questions include only those features that are available for target phones during synthesis. In our experiments these were: previous and following phonetic context (both phonetic identity and phonetic features), prosodic context (pitch and duration including that of previous and next units), stress, position in syllable, and position in phrase. Additional features were originally included, such as delta / between a phone and its preceding phone, but they did not appear as significant and were removed. Different features are significant for different phones, for example we see that lexical stress is only used in the phones schwa, i, a and n, while a feature representing pitch is only rarely used in unvoiced consonants.
The CART building algorithm implicitly deals with sparseness of units in that it will only split a cluster if there are sufficient examples and significant difference to warrant it.