Polyphones

The first step towards training a context dependent recognizer is collecting all the contexts that can be modeled with the given task. This step can be done at any phase of the development, even at the beginning before any parameters have been trained or initialized.

What Is a Polyphone

For Janus, a polyphone is something very similar to the pronounciation part of a dictionary entry. A polyphone is a list of optionally tagged phonemes plus some information about which of the phonemes is the central phone. There should always be a central phone in a polyphone, even if it seems possible in some cases to ommit it. The central phone is the one that is modeled within its context. Polyphones can look like these examples:

	{{H WB} E L {O WE}} -1 2
	{A B D E F G H I} -3 4
	{A} 0 0
	{A B} 0 1

The first example above is a polyphone modeling the phoneme E of the word HALLO. The H is tagged with WB (meaning word beginning) and the O is tagged with WE (meaning word end). The integers -1 and 2 mean that the context width goes from -1, i.e. one to the left to +2, i.e. two to the right. The second examples context width is three to the left and 4 to the right,which means that the E is the central phone, and A B D is its left context and F G H I is its right context. The third example shows a context-independent phoneme, and the last example shows the phoneme A modeled as a biphone with one single right context B.

Context Width

While in the first example above, the context of the phone E covered the entire word HALLO, this is usually neither wanted nor helpful. For a short word like HALLO this is not a problem but for larger words like ENCYCLOPAEDIA, you wouldn't expect the pronounciation of the last phone to be dependent on the first phone very much. And even if this was the case, you most likely wouldn't have enough training examples to estimate the acoustic model for such a wide context reliably. Another disadvantage of using very wide contexts is the fact that the recognizers tend to become very large. Even if you are using a context of three to the left and three to the right (also called septphones), you can easily get close to a million different acoustic models, when your database is large enough and your dictionary contains many variants. So in most cases you will want to limit your maximum context width to something like two or three to both sides.

Cross-Word Contexts

For a couple of reasons, Janus allows cross-word contexts to go only one phone into the neighboring word. If this constraint was loosened we'd end up with a much too complicated search algorithm. Another constraint is, that only the last or first phone of a word can be modeled with a cross-word context. This means that the last phoneme of the word HALLO can be modeled in different ways, depending on the following word's first phone. So the final O of HALLO can be modelled as follows:

  succcessor   successor     polyphone
  word         phone
  ---------------------------------------------------------
  WORLD        W             {E L {O WE} {W WB}} -2 1
  YOU          Y             {E L {O WE} {Y WB}} -2 1
  THERE        T             {E L {O WE} {T WB}} -2 1

In these examples we've used a maximum context of 2 to the left and two to the right, only the right context was reduced to 1 because of the cross-word model constraint.

Collecting Polyphones

Polyphones are collected in a tree structure, the PTree object in Janus. The tree is grown while processing all the training data. The transcription text of every utterances is examined, optional silences can be inserted between words and optional alternative pronounciation variants can be allowed. Then Janus extracts all contexts and collects them in a PTree object. PTree objects are part of a distribution tree (or actually any kind of tree that can hold acoustic models). Sometimes even Janus-insiders use the term "polyphone" for different things, namely once for a phoneme in context, and another time for a subsegment of such a phoneme. As you probably already know, Janus uses state-types that identify the type of an HMM-state within a phoneme. Such state-types usually are b, m, and e, indicating the beginning, middle and end segment of a phone. Later, we will discuss the clustering of context dependent models. Then we will see that it makes more sense to cluster the subsegments of polyphones than to cluster entire polyphones. Therefore, in Janus we usually build an extra clustering decision tree for every sub-monophone. So, if we have 50 phonemes in our recognizer and every phoneme is modeled by three HMM states, then we'd end up with 150 decision trees. Currently (96/12/03), in Janus, decision trees can only use questions about phonemes and tags. Maybe some day we will also be able to use questions about other things, like e.g. the state-type (i.e. the phone-sub-segment). There are many different ways how one can imagine a decision tree to look like in Janus. We usually do it by putting all the subsegments of the same HMM state-type into one tree, such that in the end we have only three trees, one for each state-type. Technically, even this is considered to be only one tree, but one that has three root nodes. The root nodes' names are the same as the names of the HMM state-types. Before we can start collecting polyphones we build such a decision tree. One that is context independent (i.e. only asks questions about the central phone). Such a three-rooted tree has - in our example - 150 leaf nodes, which usually hold the information about which acoustic model should be used to model the node. But now we can attach a so called PTree object to every leaf node. You can imagine this to be a bucket that can hold polyphones. The polyphone collection process then works like this:

see some polyphone in the training sentence
for each subsegment of the polyphone do
- starting at the corresponding root node descend the decision tree by answering the questions of the tree
- when a leaf node is reached, put the polyphone into the attached PTree-Bucket and ask a user-defined function what acoustic model is to be used for this subsegment of the polyphone

When all the training data is processed we end up with a decision tree whose attached PTree-buckets are filled: The following figure illustrates this procedure:

At the beginning we have an empty bucket attached to every leaf node of the decision tree. On the right you can see a part of a decision tree. There are three models displayed. When the top question "0=A" is answered "yes" then we end up at a leaf node which sais that model-index (i.e. usually distribution index) 99 is to be used to model the current subsegment of the phone A. Likewise, 62 is used for B, and 121 is used for D.

After training the buckets contain polyphones together with their corresponding model index that was computed by a user defined Tcl-procedure. While before training every context of the A was modeled with model 99, after training only those contexts that are not in the bucket of A will still be modeled with 99, the five explicitely listed contexts in the bucket are modeled with their own model.

Clustering Polyphones

There are two different ways of clustering anything. One is "divisive", the other is "agglomerative". The agglomerative version is when we start with as many classes as we have models, one model for each class. Then we keep mergin classes until we are finished. The divisive version starts with one class containing all models and we keep splitting one class into two (or more) untik we are finished. The agglomerative clustering has the advantage of not needing any additional information and it is nearly optimal for the given classes. But its main disadvantage is that it does not offer any good way of handling unseen models. Whenever we'd get a model that we haven't seen before, we wouldn't know which cluster it should belong to. The divisive clustering algorithm can overcome this problem by using class-splitting questions. This means that at every clustering step we find out which way of splitting a class gives us the best perfromance, then we we perform the best split and remember the "way-of-splitting". If the "way-of-splitting" is a question that has a finite number of answers (usually just "yes" or "no") then we can grow a decision tree. Every unseen model can be assigned to one of the resulting classes by descending the decision tree, answering all the questions for the unseen model.

In Janus we do the polyphone clustering by starting with a distribution tree that has PTree-buckets attached to it's leaves. Then, for every leaf-node that has a bucket attached, we compute the benfit of every allowed question, and remember the best. Out of all best questions we take the very best and perform the split, which means attaching two new leaf-nodes to the node that is being split and giving an extra PTree-bucket to each of the new leaf-nodes.

Consider our example from above. Let's assume that we looked at all possible ways of splitting the leaf nodes whose model index was 12, 62, and 99. And let's assume we found the the very best splitting would be to split the 121-node using the question "is the right neighboring phoneme a B" (in Janus the question's representation would be "+1=B"). Then the resulting tree would look like this:

With every clustering step, we increase the number of leaf nodes by one and increase the number of buckets by one. We can either do this until every bucket contains only a non-divisible group of polyphones (nondivisible because we don't a any question in our repertoir that would split them), or we can stop splitting when we don't find any splits that are good enough for our needs. The latter case usually means defining a minimal count of training frames that we would like to have for each bucket. This way we won't get any bucket that has fewer training frames than this limit.

Once we decide to stop splitting we can throw away all the buckets and live with a Ptree-free model-clustering decision tree. We would then train an extra acoustic model (usually this means an extra Gaussian codebook) for every leaf node of the decision tree.

Two-Stage Clustering

When using Gaussian mixtures for acoustic models, it often makes sense, to use more mixture weight distributions than codebooks. In the case of using only one codebook but many different distributions we would have a clean semi-continuous recognizer. In the case where there is the same number of distrbutions as codebooks, we would have a clean fully continuous recognizer. The optimal recognizer is very often somwhere in between.

Imagine we have build a context-dependent clustered currently still fully continous recognizer. We gave an extra codebook and an extra distribution to every leaf node of the decision tree. We can now get the PTree-buckets back, yes, the ones that contained the remaining polyphones that we decided not to split any more. We can attach these buckets to the currently used tree and continue clustering as if nothing had happened. This time we'd use a different stopping criterion, because we want to cluster further (usually this means using a smaller minimal frame count). When we decide not to coninue any more, we end up with more leaf nodes than we had before. We can now assign an extra mixture weight distribution to every resulting leaf node, and define this distribution over the codebook fo the original leaf node out of which it was split off. This way the number of codebooks remains the same, but we get a bigger number of distributions.