Ananlada Chotimongkol and Alan W Black
Language Technologies Institute
Carnegie Mellon University
ananlada@.cs.cmu.edu, awb@cs.cmu.edu
ABSTRACT
Many languages have a non-obvious, but not unrelated, relationship between orthography and pronunciation. Traditional methods for automatic conversion from letters to phones involve hand-crafted letter-to-sound rules, but these require care and expertise to develop. This paper presents a letter-to-sound rule system for Thai, that is trained automatically from lexicons. A statistical model, decision trees, is used to predict phones from letters. Letters mappping to multi-phones are used to solve the problem of implicit vowels and final consonants propagation and pre- and post-processing techniques are used to handle the inversion of initial consonants and vowels. For tone prediction, hand-crafted rules are used instead since there is no ambiguation if the phonological composition is known. Combining the n-gram of phone model with the decision trees, we can achieve 68.76% word accuracy which is better than 65.15% word accuracy in the rule-based approach.
Finding the pronunciation of a word is important in
both text-to-speech and automatic speech recognition systems. Although
lexicons are the most reliable method for many languages, they will never
be complete due to neologisms, proper names, morphological variants etc.
Thus a fallback position is required that predicts the pronunciation from
the orthographic form.
For Thai, like many other languages, this problem is not trivial due to a weak relationship between letters and phones. Some letters produce different sounds in different context. For example "h" can be pronounced as /h/ in "hat" or can become silent in "hour". In Thai, "±" can be pronounced as /th/ in "Á³â±" or /d/ as in "Á³±»". The pronunciation of the letter in Thai is based on its phonological composition in the syllable (initial consonant, vowel and final consonant). Traditional letter-to-sound modules usually consist of two stages. First, the input word is divided into a sequence of syllables by matching the string of letter against hand-crafted rules of all possible syllabic structures. After that, each phonological composition is mapped to phone, and the rules for derived tone are then applied. This technique is used in [5] and [6] for Thai speech synthesis, and also in [2] and [4] for a similar problem romanization and soundex system respectively.
The problem of the rule based approach is in the first step, syllabification. There are many cases where more than one rule can be applied to the input string, due to the implicit vowels and final consonant propagation. In order to disambiguate between these rules, extensive linguistic knowledge must be used. However, the complicated rules may not be practical in the real system and also consume a lot of time and prone to error. The conventional approaches usually use context independent rules and resolve the ambiguity by always choosing the longest rule that match or the most frequently used one. This might cause the system to make the mistake in the case that the context is needed. Another technique is to generate all possible syllable sequences and then use the probabilistic model to choose the most probable one [2]. However, the number of all possible syllable sequences is very large for some words. Therefore, a searching technique must be used to reduce the search space.
Statistical modeling has been successfully applied to the problem of letter-to-sound rules in many languages such as English, French and German [1]. This approach can eliminate the time consuming step of rule writing, so the training process can be done almost fully automatically. Based on the techniques in [1], we use decision trees to predict phones based on letters and their context. However, some augmentations need to be done when we apply it for Thai. In those other languages, letters map to epsilon, a phone or occasionally two phones. For example, in English letters map to epsilon in consonant clusters and map to two phones for some letter such as "x", which often be pronounced as the phone combination of /k/ and /s/ as in "box". In Thai we found that letters may map to many more phones for implicit vowels and final consonants propagation. Another problem in aligning letters to phones is inversion of initial consonants and vowels, which occurs in syllables with leading vowels. To handle tones, hand-written rules are used instead of the decision trees. If phonological composition and a tone marker are known, applying the rules for predicting tone is straightforward. The decision tree makes a prediction based on the letter context alone. So we take into account the phone context by using the n-gram of phone model on the possible phone groups generated by decision trees.
|
|
44 consonants (42 currently in use) | 21 initial consonants |
17 consonant clusters | |
8 final consonants | |
19 vowels (in term of ASCII character) | 24 unique vowels (excluding ones that have implicit final consonants) |
4 tones | 5 tones |
1 silent marker | - |
Table 1: Summary of number of alphabet and phones in Thai.
Like English, multi-letter to phone mapping is needed for consonant clusters and multi-letter vowels, this is achieved through mapping some letters to epsilon. However, there are some characteristics of Thai pronunciation system that poses more difficulty in letter-to-sound rules.
In our statistical trained model, we use decision trees to predict phones from letters and their context. Decision trees are trained from a lexicon of pronunciations. Before training, we must align each letter to its corresponding zero or more phones. Then from these alignments, a decision tree is trained for each letter. The detail of the training process is given below.
In this step, we define all possible phones and multi-phones
for each letter. From the first problem in the previous section, /r/, /n/
and /a/ are considered as allowable phones for a letter Ã.
Letters map to epsilon is needed for consonant clusters and multi-letter
vowels. For example, »
in »Ã
is mapped to /pr/ while Ã
is mapped to epsilon. All tonal makers are mapped to epsilon since we will
use rules to predict tones after we predict all other phones. The detail
of tone prediction is described in section 3.3.
Mapping letter to multi-phones is used to solve implicit vowels and consonants propagation. A syllable boundary marker is also considered as a phone, as we use it to predict a tone in the next stage. For the case of implicit vowel, consider the word "ÇþÅ" /w-@:-|-r-a-| -p-o-n-|/, "Ç" is mapped to three phones /w-@:-|/ and "Ã" is mapped to /r-a-|/. For the last syllable, "¾" is mapped to two phones /o-ph/. The reason for the reverse order of the phone will be described in the next step. For the case of final consonant propagation, considered the word "¨ÑµØÃÑÊ" /c-a-t-|-t-u-|-r-a-t-|/, "µ" which functions as both final consonant and initial consonant of the following syllable is mapped to three phones /t-|-t /. We need one letter to map up to five phones when the final consonant that propagates also has implicit vowel. For example, "ÇÔ·ÂÒ" /w-i-t-|-t-a-|-j-a:-|/, "·" is mapped to /t-|-t-a-|/, since the final consonant of the first syllable propagates to become the initial consonant of the second syllable which also has implicit /a/ vowel.
Before aligning the letters to phones, some preprocessing
is needed. From the inversion of the initial consonant and the leading
vowel, we need to invert the order of the phones in corresponding to the
letters that generate them. For instance, the pronunciation of "à¡"
is changed to /e-k-|/. The first letter is easily mapped to the first sound
and so is the second letter. For the implicit /o/ vowel, since /o/ is also
a leading vowel, it is placed before the initial consonant to be consistent.
Therefore, "¾"
with implicit /o/ vowel is mapped to /o-ph/.
We find all possible alignments given the set of allowable
letter/phone group mappings and calculate the likelihood of each. Then
we use that information to score all possible alignments and select the
most likely as the actual alignment.
((p.name is Ò)
((n.name is #)
(((a 0) (n 1) (r 0) n)))
((n.name is Õ)
(((a 0) (n 0) (r 1) r)))
((p.name is ¡)
((n.name is Ã)
((n.n.name is Á)
(((a 0.672) (n 0.251) (r 0.077) n))))))
Figure 1: Decision tree for "Ã"
"p.name" means previous letter while "n.name" and "n.n.name" mean next letter and next next letter respectively. The first rule can be read as following: If the previous character is "Ò" and the next character is a word boundary marker, then "Ã" should be pronounced as /n/ with probability equal to 1 and should be pronounced as /a/ and /r/ with probability equal to 0. So given this context "Ã" should be pronounced as /n/.
3.3 Predicting Tones
For predicting tones, hand-written rules are used as no ambiguity exists, if the following components of each syllable are known; Initial consonant group (high, middle and low), The length of the vowel, Final consonant and A tone marker. The detail of the rules can be found in [7] and also in most of the Thai grammar books. The rules can cover most of the cases, except for the following 2 cases.
Our training set consists of 22,818 words from a Thai
pronunciation dictionary, LEXITRON, with 2,535 words, which is every tenth
word in the dictionary, held-out for testing.
|
|
|
||
|
|
|
||
Rule-based |
|
|
|
|
Decision Tree |
|
|
|
|
Decision Tree (without syllable boundaries) |
|
|
|
|
Table 2: Accuracy of the models. Exp1 is the word accuracy ignoring tones. Exp2 is word accuracy ignoring tones and the length of the vowels. Exp3 is word accuracy including tones and the length of the vowels.
In our initial model, we achieved 94.47% letter accuracy and 62.17% complete word accuracy. Letter accuracy is defined as the number of letters that are correctly converted to epsilon, a phone, or multi-phone. Word accuracy is defined as the number of words that all phones (once resplit) match exactly all the phones in the entry in the test data.
As many of the errors were caused by misplaced syllable boundaries, and assuming we can predict syllable boundaries from a sequences of phones, we retrained without them. This improved our scores to 95.25% letter accuracy and 66.26% word accuracy. The second model out-performs a previously existing rule-based model, which achieves 65.15% word accuracy on the same data.
The accuracy drops more in our model than the rule-based model when including the tone. This is due to the inconsistent of the length of the compound vowels in the dictionary. Some compound vowels are coded as short vowels, but the are intended to be pronounced as long vowels as reflect in tones. Even we can get the same vowel length as in the dictionary, we cannot get the same tone since it does not follow the rules.
Another frequent mistake is the length of the vowels. However, sometimes it is negligible since in some words both pronunciations are actually acceptable. The length of the vowels is sometimes depended on the context and also varies from dialects to dialects. For example, "äËé" in "ÃéͧäËé" /r-@:-ng-3| h-a:-j-2|/ is pronounced using a long vowel, /a:/, while in "àÊÒäËé" /s-a-w-4| h-a-j-2|/ is pronounced using a short vowel, /a/. Since we can ignore the variation in the length of the vowels as long as it does not change to meaning of the word, the word accuracy is higher about 10% on both models.
The above simple model uses decision trees to predict the most likely phone group based on the letter context alone. No account of previous (or following) phone predictions is taken into account. In an attempt to improve prediction, we built a second more complex model following [3] where the decision trees predict a probability distribution of possible phone groups and best path is selected using Viterbi search and n-grams. More formally we wish to find most probable string of phone groups given a string of letters
argmax P(p1,...pn | l1, ... ln) (1)
We can approximate this as
N
argmax Õ
P(pi | pi-1, p i-2, ...) P(l i)
(2)
i=1
The P(li) term in (2) is approximated unigram,
the results from the decision tree, which actually gives P(p | l i-3,...l
i+3) but can be inverted by dividing by the unigram probability of
p. P(p) is approximated by an n-gram. We tried varies n-gram, with Good-Turing
smoothing (sm1) and backoff (bk) and found the following results.
|
|
|
||
|
|
|
||
No ngram |
94.47%
|
68.71%
|
|
|
Unigram (raw) |
94.71%
|
70.29%
|
|
|
Unigram (sm1) |
94.71%
|
70.29%
|
|
|
Unigram (bk) |
94.71%
|
70.29%
|
|
|
Bigram (raw) |
95.37%
|
74.79%
|
|
|
Bigram (sm1) |
95.40%
|
74.79%
|
|
|
Bigram (bk) |
95.33%
|
74.04%
|
|
|
Trigram (raw) |
95.16%
|
73.33%
|
|
|
Trigram (sm1) |
95.08%
|
73.17%
|
|
|
Trigram (bk) |
95.60%
|
75.31%
|
|
|
Table 3: Letter accuracy and word accuracy of the n-gram models. Exp1, Exp2 and Exp3 are the same as in Table 2.
When we use equi-probable phones we get the same result as using the best predicted value from the decision tree. But we get an immediate gain with unigrams with appropriate probabilities, this gets substantially better for bigrams, though reduces for trigrams. As we increase the size of the context the n-gram representation gets larger such that even for trigrams we have a model which requires hundred of megabytes of memory to run. Even if we had more efficient representations of n-grams, it seems that bigrams offer the best compromise between accuracy and space.
Interestingly when we apply the tree plus n-gram model using the trees that do not predict syllable boundaries our overall word accuracy on bi-gram and tri-grams drops some 1%. This implies that the syllable boundary information is in fact useful in the prediction.
We have presented a statistically trained model, decision
tree, for Thai letter to sound rules, with augmentation for solving letter
to phone alignment problem in Thai. The results reveal that the decision
trees with n-gram of phones model out-performs the traditional rule-based
model. We also found that the bigrams model offers the best compromise
between accuracy and space.
There is some more work that can be done to improve the model. The problem which is still left unsolved is the inversion of phones across syllable boundaries. One way to handle this problem is to revert the position of the letters to make them correspond to the pronunciation, in the preprocessing process. The word "à¡ÉÁ" /k-a-1|s-e-m-4|/, will become "¡àÉÁ" which "¡" is now easily mapped to the first syllable and so as "àÉÁ" to the second syllable. However, this will eliminate another possible pronunciation /k-e-0|s-o-m-4|/, which "à¡" go to the first syllable and "ÉÁ" go to second syllable, that needs the original written form. Giving both written forms to the model, the normalized probabilities from the decision trees can be used to select the most probable one.
Using the decision tree to predict syllable boundary instead of phone is also worth a try. If a syllable boundary is obtained, the pronunciation of most syllables can be predicted accurately by using a rule.
We would like to express our gratitude to Dr. Virach
Sornlertlamvanich, the head of Software and Language Engineering Laboratory,
National Electronics and Computer Technology Center, Thailand, for the
Thai pronunciation dictionary, LEXITRON and the rule-based letter-to
sound rule. We also would like to thank Monthika Boriboon for the linguistic
discussion.
This work was supported in part by an NSF Combined Research and Curriculum Development grant, and by Space and Naval Warfare Systems Center, San Diego, under Grant No. N66001-99-1-8905. The opinions expressed in this work do not necessarily represent the opinions of these funding bodies.
[2] Charoenporn, T., Chotimongkol, A., and Sornlertlamvanich, V. "Automatic Romanization for Thai", in Proceeding of the 2 nd International Workshop on East-Asian Language Resources and Evaluation", Taipei, Taiwan, 1999.
[3] Jiang, L., Hon, H. and Huang, X. "Improvements on a Trainable Letter-to-Sound Converter", Eurospeech 97, Rhodes, Greece, 1997.
[4] Karoonboonyanan, T., Sornlertlamvanich, V. and Meknavin, S. "A Thai Soundex System for Spelling Correction", Proceeding of the National Language Processing Pacific Rim Symposium, pp. 633-636, Phuket, Thailand, 1997.
[5] Mittrapiyanuruk, P., Hansakunbuntheung, C., Tesprasit, V. and Sornlertlamvanich, V. "Issues in Thai Text-to-Speech Synthesis: The NECTEC Approach", Proceedings of NECTEC Annual Conference 2000, Bangkok Thailand, 2000.
[6] Taisetwatkul, S., Kanawaree, W. "Thai Speech Synthesizer" (in Thai), Senior Project for completion of B. Eng., Faculty of Engineering, Chulalongkorn University, Bangkok Thailand, 1996.
[7] Thonglo, K., Thai Grammar (in Thai), Bangkok,
Thailand 1952.