The Alignment Path Module - Overview

This module implements the object classes Path, PathItemList, and PathItem. Paths are created by either a forced aligment procedure or by reading labels from a file. Paths are needed to write labels and to accumulate tranining data. Also you might like to look at paths to see what is happening and to base the decision how to continue you work on the shape and the features of the path.

In the design of the recognizer we've decided to regard the forced aligment as a method of the Path object. We could have as well chosen it to be a method of the HMM object, but found the Path to be more suitable.

Viterbi vs. Forward-Backward

There are different kinds of paths, depending on what alignment procedure has created them. A Viterbi-path is created by the Viterbi algorithm. It has one information cell (PathItem) per aligned frame. This cell contains indices for the HMM-state, the senone, the phoneme, and the word that were aligned to this frame. Also it can contain a local acoustic score and the best score that was found for this fame. A forward-backward-path is created by the forward-backward algorithm. It has the same information as a Viterbi-path, but has additionally three entires: alpha, beta, and gamma. Read a tutorial on HMMs if you don't know what they mean. Another difference between Viterbi and forward-backward is the 'thickness' of the path. While Viterbi is assigning only one HMM state to a frame, forward-backward can assign any number of states, each accompanied by a gamma value indicating it's relative probability among all states of the frame. These state-assignment are stored in lists (PathItemList) .

The Life of a Path

A Path object is usually created from the Tcl script that is used for training. Unless you are working with labels or doing something very special, there is no need for more than one Path object. You call this (initially empty) object's alignment method (viterbi or fwdBwd) which will result in a now filled Path object. 'Filled' means that for every frame index you can find one or more state, senone, phone, and word indices. This filled path is now given to a SenoneSet object for accumulation of training data. There, every cell's acoustic score will be recomputed to get the needed training information. The data that are stored in the path and the allocated memory will be freed every time an alignment method is called. So after alignment and even after training data accumulation the path is still filled and can be viewed or used for creating labels or whatever.

If you are interested in confidence measures or need local acoustic scores along a path for any other reason, you can enrich a Path object by that information by calling a compute-local-scores method (lscore). A puts will then also display the local scores.

Guided Alignment Paths

Sometimes we have labels available from which we would like to use phoneme or word boundaries (or other boundaries) to 'guide' the forced alignment. I.e. let the alignment find a path whose unit-boundaries match with those from the labels. We acomplish this by filling a Viterbi-path's senone, phone, or word index entries with what we want there to be. If we leave them unitialized (i.e. indices are all -1) then this means that we don't want to guide the alignment at all. Any number different from -1 will be treated as index for a speech unit which must be matched by the alignment. If the alignment wants to use a different speech unit it will be puinished according to three configurable variables (wordMissPen, phoneMissPen, and senoneMissPen). This way, we can use different levels of guidance: word-level (force only word-index to match), phone-level (force only phone-index to match), senone-level (force only senone-index to match, i.e. the only freedom left to the alignment is the state sequence of subsequent state that are modelled by the same senone), and state-level (i.e. we don't do aligment, because we already can use the labels' path for whatever we need it). These different levels can be mixed. We can guide the alignment with a firmer hand on some segment of the utterance, while we can give it more freedom on some other speech segment.

Frame Indices

When we talk about a path we mean (besides other details) the assignment of acoustic units or unit-boundaries to frame indices. However, there are different frame indices in use. When one module sais "frameX" this doesn't necessarily mean the same as if another module sais "frameX". Basically there are four different ways how to interpret a frameX variable:

file position index: Since we don't necessarily have exactly one feature file for each utterance, one utterance could be split over multiple files or one file can contain many utterances. The only module that takes care of this is the feature module any other module does not know anything about the storage of an utterance. Be careful, never change the beginning of an utterance's feature in a feature file. This can cause severe and difficult to find problems when working with labels, because any other module, besides the feature module, doesn't care about the storage of features. Should it be necessary to modify the beginning positions of utterances, then make sure to also change the utterance ID, such that a later confusion is impossible.
physical index: is the index that is used by the feature module after the feature has been read from file or created. This does not necessarily mean that you can actually compute the position in the feature file where the frame is located, because an utterance's feature could be taken out of the middle of a file. In any case, this physical index is always the same for some utterance. Physical frame number n will always be the same frame.
feature index: This is the index of the frame that is returned from or given to the feature module. An utterances first frame is always frame number zero. If there is any need for cutting one or more frames off the beginning or end of an utterance then this is done in the feature module. Other modules (forced aligment, score computation etc.) always asume that the first frame's index is zero. This makes it necessary to store the number of cut-off frames in the path object to allow comparison between two paths that were created with different numbers of cutt-off frames.
path index: This is the index that is used in the path object to address its arrays. Because a path does not necessarily have to be defined over an entire utterance but could very well be defined over a subsection only, this means that we also have to store from where to where (in terms of feature-indices) the path is defined.

Let's resume the above four points mathematically as follows:


fileX	the file position index as described above
physX	the physical index as described above
featX	the feature index as described above
pathX	the path index as described above
beginX	the utterances first frame in a file
skipN	the number of ignored leading frames when reading
fromX	the frame from where the path starts

Then we get:


fileX	=	physX	+	beginX
physX	=	featX	+	skipN
featX	=	pathX	+	fromX

Further information about the module: