AND | AE N D |
ONE | W AH N |
TWO | T UW |
Of course syllables include more context (adjacent phonemes influence the pronunciation) and are therefore more specific but also require more total training data. Under the assumption that the pronunciation of a phoneme depends mainly on the current, the previous and the next phoneme we define the term triphone that is a context dependent phone like the N(AE,D) in AND. Considering even more context leads to polyphones.
monophone | N |
biphone | N(-1=AE), N(+1=D) |
triphone | N(-1=AE,+1=D) |
polyphone | N(-n=.., ... ,+m=..) |
The frequency of those very specific models is rather small and many of them
are so similar that they would be better modeled together. So often triphones
are clustered to be generalized (or clustered) triphones.
Even if we found units to represent words now the according acoustic features
are not stationary that means they change there properties over time. We could
think about approaches to model this dynamic behaviour but can also split to get
quasi-stationary segments. A common approach is to have sub-phones like
begin, middle and end segments of phones.
word AND phonemes AE N D triphones AE(*,N) N(AE,D) D(N,*) subtriphones AE(*,N)-b AE(*,N)-m AE(*,N)-e N(AE,D)-b N(AE,D)-m N(AE,D)-e D(N,*)-b D(N,*)-m D(N,*)-eAs we will see the possible subphone sequences and their probabilities can be modeled with Hidden Markov Models. We will call these smallest units that are modeled senones. What we also need is a prediction how likely a certain acoustic feature is for a given senone.
f(x|s)we can assume a certain distribution. A very simple approach is the asumption of a gaussian distribution. With a mixture of Gaussians we can model any distribution as long as we have enough Gaussians.
f(x|s) = SUM_over_c P(c|s) * f(x|c,s) weighting factor P(c|s) gaussian f(x|c,s) = k(c) * exp(-0.5 * (x-m(c))' * K(c)^-1 * (x-m(c))) mean vector m(c) covariance matrix K(c)The weighting factors (in JANUS called distribution), the mean vectors and the covariance matrices (in JANUS called codebooks) are estimated during the training of the recognizer and used during the test (or search) to get the most likely hypothesis.