Polyphones
The first step towards training a context dependent recognizer is
collecting all the contexts that can be modeled with the given task.
This step can be done at any phase of the development, even at the
beginning before any parameters have been trained or initialized.
What Is a Polyphone
For Janus, a polyphone is something very similar to the pronounciation
part of a dictionary entry. A polyphone is a list of optionally tagged
phonemes plus some information about which of the phonemes is the
central phone. There should always be a central phone in a polyphone,
even if it seems possible in some cases to ommit it. The central phone
is the one that is modeled within its context. Polyphones can look
like these examples:
{{H WB} E L {O WE}} -1 2
{A B D E F G H I} -3 4
{A} 0 0
{A B} 0 1
The first example above is a polyphone modeling the phoneme E of the
word HALLO. The H is tagged with WB (meaning word beginning) and the
O is tagged with WE (meaning word end). The integers -1 and 2 mean
that the context width goes from -1, i.e. one to the left to +2, i.e.
two to the right. The second examples context width is three to the
left and 4 to the right,which means that the E is the central phone,
and A B D is its left context and F G H I is its right context. The
third example shows a context-independent phoneme, and the last example
shows the phoneme A modeled as a biphone with one single right context B.
Context Width
While in the first example above, the context of the phone E covered
the entire word HALLO, this is usually neither wanted nor helpful.
For a short word like HALLO this is not a problem but for larger
words like ENCYCLOPAEDIA, you wouldn't expect the pronounciation of the
last phone to be dependent on the first phone very much. And even if
this was the case, you most likely wouldn't have enough training examples
to estimate the acoustic model for such a wide context reliably.
Another disadvantage of using very wide contexts is the fact that
the recognizers tend to become very large. Even if you are using a
context of three to the left and three to the right (also called
septphones), you can easily get close to a million different acoustic
models, when your database is large enough and your dictionary contains
many variants. So in most cases you will want to limit your maximum
context width to something like two or three to both sides.
Cross-Word Contexts
For a couple of reasons, Janus allows cross-word contexts to go only
one phone into the neighboring word. If this constraint was loosened
we'd end up with a much too complicated search algorithm. Another
constraint is, that only the last or first phone of a word can be
modeled with a cross-word context. This means that the last phoneme
of the word HALLO can be modeled in different ways, depending on
the following word's first phone. So the final O of HALLO can be
modelled as follows:
succcessor successor polyphone
word phone
---------------------------------------------------------
WORLD W {E L {O WE} {W WB}} -2 1
YOU Y {E L {O WE} {Y WB}} -2 1
THERE T {E L {O WE} {T WB}} -2 1
In these examples we've used a maximum context of 2 to the left and
two to the right, only the right context was reduced to 1 because of
the cross-word model constraint.
Collecting Polyphones
Polyphones are collected in a tree structure, the PTree object
in Janus. The tree is grown while processing all the training data.
The transcription text of every utterances is examined, optional
silences can be inserted between words and optional alternative
pronounciation variants can be allowed. Then Janus extracts all
contexts and collects them in a PTree object. PTree objects are
part of a distribution tree (or actually any kind of tree that can
hold acoustic models). Sometimes even Janus-insiders use the term
"polyphone" for different things, namely once for a phoneme in
context, and another time for a subsegment of such a phoneme. As
you probably already know, Janus uses state-types that identify
the type of an HMM-state within a phoneme. Such state-types usually
are b, m, and e, indicating the beginning, middle
and end segment of a phone. Later, we will discuss the clustering of
context dependent models. Then we will see that it makes more sense
to cluster the subsegments of polyphones than to cluster entire
polyphones. Therefore, in Janus we usually build an extra clustering
decision tree for every sub-monophone. So, if we have 50 phonemes
in our recognizer and every phoneme is modeled by three HMM states,
then we'd end up with 150 decision trees. Currently (96/12/03),
in Janus, decision trees can only use questions about phonemes and
tags. Maybe some day we will also be able to use questions about
other things, like e.g. the state-type (i.e. the phone-sub-segment).
There are many different ways how one can imagine a decision tree
to look like in Janus. We usually do it by putting all the subsegments
of the same HMM state-type into one tree, such that in the end we
have only three trees, one for each state-type. Technically, even this
is considered to be only one tree, but one that has three root nodes.
The root nodes' names are the same as the names of the HMM state-types.
Before we can start collecting polyphones we build such a decision tree.
One that is context independent (i.e. only asks questions about the
central phone). Such a three-rooted tree has - in our example - 150
leaf nodes, which usually hold the information about which acoustic
model should be used to model the node. But now we can attach a
so called PTree object to every leaf node. You can imagine this
to be a bucket that can hold polyphones. The polyphone collection
process then works like this:
- see some polyphone in the training sentence
- for each subsegment of the polyphone do
- starting at the corresponding root node descend the decision tree
by answering the questions of the tree
- when a leaf node is reached, put the polyphone into the attached
PTree-Bucket and ask a user-defined function what acoustic model is
to be used for this subsegment of the polyphone
When all the training data is processed we end up with a decision tree
whose attached PTree-buckets are filled: The following figure illustrates
this procedure:
At the beginning we have an empty bucket attached to every
leaf node of the decision tree. On the right you can see a
part of a decision tree. There are three models displayed.
When the top question "0=A" is answered "yes" then we end
up at a leaf node which sais that model-index (i.e. usually
distribution index) 99 is to be used to model the current
subsegment of the phone A. Likewise, 62 is used for B, and
121 is used for D.
After training the buckets contain polyphones together with
their corresponding model index that was computed by a user
defined Tcl-procedure. While before training every context of
the A was modeled with model 99, after training only those
contexts that are not in the bucket of A will still be modeled
with 99, the five explicitely listed contexts in the bucket
are modeled with their own model.
|
|
Clustering Polyphones
There are two different ways of clustering anything. One is
"divisive", the other is "agglomerative". The agglomerative version is
when we start with as many classes as we have models, one model for
each class. Then we keep mergin classes until we are finished. The
divisive version starts with one class containing all models and we
keep splitting one class into two (or more) untik we are finished.
The agglomerative clustering has the advantage of not needing any
additional information and it is nearly optimal for the given
classes. But its main disadvantage is that it does not offer any good
way of handling unseen models. Whenever we'd get a model that we
haven't seen before, we wouldn't know which cluster it should belong
to. The divisive clustering algorithm can overcome this problem by
using class-splitting questions. This means that at every clustering
step we find out which way of splitting a class gives us the best
perfromance, then we we perform the best split and remember the
"way-of-splitting". If the "way-of-splitting" is a question that has
a finite number of answers (usually just "yes" or "no") then we can
grow a decision tree. Every unseen model can be assigned to one of
the resulting classes by descending the decision tree, answering all
the questions for the unseen model.
In Janus we do the polyphone clustering by starting with a
distribution tree that has PTree-buckets attached to it's
leaves. Then, for every leaf-node that has a bucket attached, we
compute the benfit of every allowed question, and remember the best.
Out of all best questions we take the very best and perform the split,
which means attaching two new leaf-nodes to the node that is being
split and giving an extra PTree-bucket to each of the new leaf-nodes.
Consider our example from above. Let's assume that we looked at all
possible ways of splitting the leaf nodes whose model index was 12,
62, and 99. And let's assume we found the the very best splitting
would be to split the 121-node using the question "is the right
neighboring phoneme a B" (in Janus the question's representation
would be "+1=B"). Then the resulting tree would look like this:
With every clustering step, we increase the number of leaf nodes by
one and increase the number of buckets by one. We can either do this
until every bucket contains only a non-divisible group of polyphones
(nondivisible because we don't a any question in our repertoir that
would split them), or we can stop splitting when we don't find any
splits that are good enough for our needs. The latter case usually
means defining a minimal count of training frames that we would like
to have for each bucket. This way we won't get any bucket that has
fewer training frames than this limit.
Once we decide to stop splitting we can throw away all the buckets and
live with a Ptree-free model-clustering decision tree. We would then
train an extra acoustic model (usually this means an extra Gaussian
codebook) for every leaf node of the decision tree.
Two-Stage Clustering
When using Gaussian mixtures for acoustic models, it often makes
sense, to use more mixture weight distributions than codebooks. In the
case of using only one codebook but many different distributions we
would have a clean semi-continuous recognizer. In the case where there
is the same number of distrbutions as codebooks, we would have a clean
fully continuous recognizer. The optimal recognizer is very often
somwhere in between.
Imagine we have build a context-dependent clustered currently still
fully continous recognizer. We gave an extra codebook and an extra
distribution to every leaf node of the decision tree. We can now get
the PTree-buckets back, yes, the ones that contained the remaining
polyphones that we decided not to split any more. We can attach these
buckets to the currently used tree and continue clustering as if
nothing had happened. This time we'd use a different stopping
criterion, because we want to cluster further (usually this means
using a smaller minimal frame count). When we decide not to coninue
any more, we end up with more leaf nodes than we had before. We can
now assign an extra mixture weight distribution to every resulting
leaf node, and define this distribution over the codebook fo the
original leaf node out of which it was split off. This way the number
of codebooks remains the same, but we get a bigger number of distributions.