main
User Manual
Introduction to JANUS for Users
Overview: Automatic Speech Recognition

Speech Model

Language Model

Dictionary

Although we could have an acoustic representation for each word (you can do that for an application with only a couple of words), in most cases it is necessary to break the words in smaller pieces. These pieces are then the atoms of speech that can build every possible word. In practice these 'phones' are eather based on syllables or phonemes.

a phoneme dictionary
AND	AE N D
MORE	M AO R
ONE	W AH N
THIS	DH IH S
TWO	T UW

Of course syllables include more context (adjacent phonemes influence the pronunciation) and are therefore more specific but also require more total training data. Under the assumption that the pronunciation of a phoneme depends mainly on the current, the previous and the next phoneme we define the term triphone that is a context dependent phone like the N(AE,D) in AND. Considering even more context leads to polyphones.

phones
monophone	N
biphone	N(-1=AE), N(+1=D)
triphone	N(-1=AE,+1=D)
polyphone	N(-n=.., ... ,+m=..)

The frequency of those very specific models is rather small and many of them are so similar that they would be better modeled together. So often triphones are clustered to be generalized (or clustered) triphones.
Even if we found units to represent words now the according acoustic features are not stationary that means they change there properties over time. We could think about approaches to model this dynamic behaviour but can also split to get quasi-stationary segments. A common approach is to have sub-phones like begin, middle and end segments of phones.

    word         AND

    phonemes     AE       N        D

    triphones    AE(*,N)  N(AE,D)  D(N,*)

    subtriphones AE(*,N)-b AE(*,N)-m AE(*,N)-e N(AE,D)-b N(AE,D)-m N(AE,D)-e D(N,*)-b D(N,*)-m D(N,*)-e

As we will see the possible subphone sequences and their probabilities can be modeled with Hidden Markov Models. We will call these smallest units that are modeled senones. What we also need is a prediction how likely a certain acoustic feature is for a given senone.

Acoustic Model

There are many different ways to model the likelihood of an acoustic feature 'x' given a senone 's'. Since we have to estimate the probability density function (pdf)

         f(x|s)

we can assume a certain distribution. A very simple approach is the asumption of a gaussian distribution. With a mixture of Gaussians we can model any distribution as long as we have enough Gaussians.

         f(x|s) = SUM_over_c P(c|s) * f(x|c,s)

         weighting factor   P(c|s)
         gaussian           f(x|c,s) = k(c) * exp(-0.5 * (x-m(c))' * K(c)^-1 * (x-m(c)))
         mean vector        m(c)
         covariance matrix  K(c)

The weighting factors (in JANUS called distribution), the mean vectors and the covariance matrices (in JANUS called codebooks) are estimated during the training of the recognizer and used during the test (or search) to get the most likely hypothesis.

Hidden Markov Model (HMM)

Maintainer: maier@ira.uka.de