The Alignment Path Module - Overview
This module implements the object classes Path, PathItemList,
and PathItem. Paths are created by either a forced
aligment procedure or by reading
labels from a file. Paths are needed to write labels and
to accumulate tranining data. Also you might like to look at paths to see what is happening
and to base the decision how to continue you work on the shape and the features of the path.
In the design of the recognizer we've decided to regard the forced aligment as a method of
the Path object. We could have as well chosen it to be a method of the
HMM object, but found the Path to be
more suitable.
Viterbi vs. Forward-Backward
There are different kinds of paths, depending on what alignment procedure has created them.
A Viterbi-path is created by the Viterbi algorithm. It has one information cell (PathItem)
per aligned frame. This cell contains indices for the HMM-state, the senone, the phoneme, and the word
that were aligned to this frame. Also it can contain a local acoustic score and the best score
that was found for this fame. A forward-backward-path is created by the forward-backward
algorithm. It has the same information as a Viterbi-path, but has additionally three
entires: alpha, beta, and gamma. Read a tutorial on HMMs if you don't know what they mean.
Another difference between Viterbi and forward-backward is the 'thickness' of the path.
While Viterbi is assigning only one HMM state to a frame, forward-backward can assign any
number of states, each accompanied by a gamma value indicating it's relative probability
among all states of the frame. These state-assignment are stored in lists (PathItemList) .
The Life of a Path
A Path object is usually created from the Tcl script that is used for training. Unless
you are working with labels or doing something very special, there is no need for more
than one Path object. You call this (initially empty) object's alignment method
(viterbi or fwdBwd) which will result in a now filled Path object.
'Filled' means that for every frame index you can find one or more state, senone, phone,
and word indices. This filled path is now given to a SenoneSet object for
accumulation of training data. There, every cell's acoustic score will be recomputed to
get the needed training information. The data that are stored in the path and the allocated
memory will be freed every time an alignment method is called. So after alignment and
even after training data accumulation the path is still filled and can be viewed or used
for creating labels or whatever.
If you are interested in confidence measures or need local acoustic scores along a path
for any other reason, you can enrich a Path object by that information by calling
a compute-local-scores method (lscore). A puts will then also display the
local scores.
Guided Alignment Paths
Sometimes we have labels available from which we would like to use phoneme or word
boundaries (or other boundaries) to 'guide' the forced alignment. I.e. let the alignment
find a path whose unit-boundaries match with those from the labels. We acomplish this
by filling a Viterbi-path's senone, phone, or word index entries with what we want there
to be. If we leave them unitialized (i.e. indices are all -1) then this means that we
don't want to guide the alignment at all. Any number different from -1 will be treated
as index for a speech unit which must be matched by the alignment. If the alignment wants
to use a different speech unit it will be puinished according to three configurable
variables (wordMissPen, phoneMissPen, and senoneMissPen).
This way, we can use different levels of guidance: word-level (force only word-index to
match), phone-level (force only phone-index to match), senone-level (force only senone-index to
match, i.e. the only freedom left to the alignment is the state sequence of subsequent state
that are modelled by the same senone), and state-level (i.e. we don't do aligment, because
we already can use the labels' path for whatever we need it). These different levels can
be mixed. We can guide the alignment with a firmer hand on some segment of the utterance,
while we can give it more freedom on some other speech segment.
Frame Indices
When we talk about a path we mean (besides other details) the assignment of acoustic units or unit-boundaries
to frame indices.
However, there are different frame indices in use. When one module sais "frameX" this doesn't necessarily mean the
same as if another module sais "frameX". Basically there are four different ways how to interpret a frameX
variable:
- file position index
- Since we don't necessarily have exactly one feature file for each utterance, one
utterance could be split over multiple files or one file can contain many utterances.
The only module that takes care of this is the feature module any other module does not
know anything about the storage of an utterance.
Be careful, never change the beginning of an utterance's
feature in a feature file. This can cause severe and difficult to find problems when working
with labels, because any other module, besides the feature module, doesn't care about the
storage of features. Should it be necessary to modify the beginning positions of utterances,
then make sure to also change the utterance ID, such that a later confusion is impossible.
- physical index
- is the index that is used by the feature
module after the feature has been read from file or created. This does not necessarily
mean that you can actually compute the position in the feature file where the frame is
located, because an utterance's feature could be taken out of the middle of a file.
In any case, this physical index is always the same for some utterance. Physical frame
number n will always be the same frame.
- feature index
- This is the index of the frame that is returned from or given to the feature module.
An utterances first frame is always frame number zero. If there is any need for cutting
one or more frames off the beginning or end of an utterance then this is done in the feature
module. Other modules (forced aligment, score computation etc.) always asume that the first
frame's index is zero. This makes it necessary to store the number of cut-off frames in the
path object to allow comparison between two paths that were created with different numbers
of cutt-off frames.
- path index
- This is the index that is used in the path object to address its arrays. Because
a path does not necessarily have to be defined over an entire utterance but could very
well be defined over a subsection only, this means that we also have to store from where
to where (in terms of feature-indices) the path is defined.
Let's resume the above four points mathematically as follows:
|
|
fileX |
the file position index as described above |
physX |
the physical index as described above |
featX |
the feature index as described above |
pathX |
the path index as described above |
beginX |
the utterances first frame in a file |
skipN |
the number of ignored leading frames when reading |
fromX |
the frame from where the path starts |
Then we get:
|
|
|
|
|
fileX | = |
physX | + | beginX |
physX | = |
featX | + | skipN |
featX | = |
pathX | + | fromX |
Further information about the module: