Sphinx-3 s3.X Decoder (X=6)

Mosur K. Ravishankar (aka Ravi Mosur)
Sphinx Speech Group
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

Introduction

Sphinx-3 is the successor to the Sphinx-II speech recognition system from Carnegie Mellon University. It includes both an acoustic trainer and various decoders, i.e., text recognition, phoneme recognition, N-best list generation, etc. In this document, "Sphinx-3" refers to any version of the Sphinx-3 decoder, and "s3.X" refers to the version available in this distribution. Notice that s3.X is in fact a branch from Sphinx-3, not a more recent release.

The s3.X decoder is a recent implementation for speech-to-text recognition, its main goal being speed improvements over the original Sphinx-3 decoder. It runs about 10 times faster than the latter on large vocabulary tasks. The following is a brief summary of its main features and limitations:

5-10x real-time recognition time on large vocabulary tasks
Limited to fully continuous acoustic models
Limited to 3 or 5-state left-to-right HMM topologies
Bigram or trigram language model
Batch-mode or live operation from pre-recorded speech

This package contains the following programs:

s3decode: The Sphinx-3 s3.2/s3.3/s3.X decoder processing cesptra files
s3gausubvq: Sub-vector clustered acoustic model building
s3livedecode: The Sphinx-3 s3.X decoder in live mode
s3livepretend: The Sphinx-3 s3.X decoder in batch mode
s3align: The Sphinx-3 aligner
s3allphone: The Sphinx-3 phoneme recognizer
s3astar: The Sphinx-3 N-best generator
s3dag: The Sphinx-3 application for best-path searching

This distribution has been prepared for Unix platforms. Port to MS Windows (MS Visual C++ 6.0 workspace and project files) has been provided.

This document is a brief user's manual for the above programs. It is not meant to be a detailed description of the decoding algorithm, or an in-depth tutorial on speech recognition technology. However, a set of Microsoft PowerPoint slides are available that give additional information about the decoder. Even though the slides refer to s3.2, keep in mind that the basic search structure remains te same in s3.X (where x=3,4 and 5)

The initial part of this document provides an overview of the decoder. It is followed by descriptions of the main input and output databases; i.e., the lexicon, language model, acoustic model, etc.

¤ Back to top of this section

Overview of the s3.X Decoder

The s3.X decoder is based on the conventional Viterbi search algorithm and beam search heuristics. It uses a lexical-tree search structure somewhat like the Sphinx-II decoder, but with some improvements for greater accuracy than the latter. It takes its input from pre-recorded speech in raw PCM format and writes its recognition results to output files.

Inputs

We first give a brief outline of the input and output characteristics of the decoder. More detailed information is available in later sections. The decoder needs the following inputs:

Lexical model: The lexical or pronunciation model contains pronunciations for all the words of interest to the decoder. Like most modern speech recognition systems, Sphinx-3 uses phonetic units to build word pronunciations. Currently, the pronunciation lexicon is almost entirely hand-crafted.

Acoustic model: Sphinx uses acoustic models based on statistical hidden Markov models (HMMs). The acoustic model is trained from acoustic training data using the Sphinx-3 trainer. The trainer is capable of building acoustic models with a wide range of structures, such as discrete, semi-continuous, or continuous. However, the s3.X decoder is only capable of handling continuous acoustic models.

Language model (LM): Sphinx-3 uses a conventional backoff bigram or trigram language model.

Speech input specification: This distribution contains four executable files, three of which perform recognition. s3livedecode decodes live speech, that is, speech incoming from your audio card. s3livepretend decodes in batch mode using a control file that describes the input to be decoded into text. s3decode decodes also uses a control file for batch mode processing. In the latter, the entire input to be processed must be available beforehand, i.e., the raw audio samples must have been preprocessed into cepstrum files. Also note that the decoder cannot handle arbitrary lengths of speech input. Each separate piece (or utterance) to be processed by the decoder must be no more than 300 sec. long. Typically, one uses a segmenter to chop up a cepstrum stream into manageable segments of up to 20 or 30 sec. duration.

Outputs

The decoder can produce two types of recognition output:

Recognition hypothesis: A single best recognition result (or hypothesis) for each utterance processed. It is a linear word sequence, with additional attributes such as their time segmentation and scores.

Word lattice: A word-graph of all possible candidate words recognized during the decoding of an utterance, including other attributes such as their time segmentation and acoustic likelihood scores.

In addition, the decoder also produces a detailed log to stdout/stderr that can be useful in debugging, gathering statistics, etc.

¤ Back to top of this section

Compiling s3.X

The current distribution has been set up for Unix platforms. The following steps are needed to compile the decoder:

./configure [--prefix=/my/install/directory]: the argument is optional. If not given, it will install s3.X under /usr/local, provided you have the proper permissions. This step is only necessary the first time you compile s3.X.
make clean: This should remove any old object files.
make: This compiles the libraries and example programs.
make install: This will install s3.X in the directory that you specified when you ran configure and also the provided models and documentation.

Note that the Makefiles are not foolproof; they do not eliminate the need for sometimes manually determining dependencies, especially upon updates to header files. When in doubt, first clean out the compilation directories entirely by running make distclean and start over.

¤ Back to top of this section

Running s3.X

Running the decoder is simply a matter of invoking the binary (i.e., s3decode, s3livedecode or s3livepretend), with a number of command line arguments specifying the various input files, as well as decoding configuration parameters. s3decode and s3livepretend require a control file, the directory where the audio files are available, and a file containing the configuration arguments. s3livedecode, which runs live, requires only the file with the arguments.

Invoking the binary without any argument produces a help message with short descriptions of all the configuration arguments.

Configuration Arguments Overview

This section gives a brief overview of the main configuration arguments. They are broken down into separate groups, based on whether they are the primary flags specifying input and output data, arguments for optional configuration, or for performance tuning.

Note that not all the available configuration arguments are covered below. There are a few additional and undocumented flags, intended mainly for debugging purposes.

Primary Flags

Many of the flags have reasonable defaults. The ones that a user minimally needs to provide are the input and output databases or files, which have been discussed above:

`-mdef`	Model definition input file
`-mean` `-var` `-mixw` `-tmat` `-subvq`	Acoustic model files
`-dict` `-fdict`	Main and filler lexicons
`-lm`	Language model binary dump file
`-fillpen` `-fillprob` `-silprob`	Filler word probabilities
`-hypseg`	Output hypotheses file

Additional Configuration Flags

It may often be necessary to provide additional parameters to obtain the right decoder configuration:

`-cmn` `-agc` `-varnorm` `-lowerf` `-upperf` `-nfilt` `-samprate`	Feature type configuration
`-cepdir`	Directory prefix for cepstrum files specified in the control file, ignored in `s3livedecode` and `s3livepretend`
`-ctloffset` `-ctlcount`	Selecting a portion of the control file to be processed
`-outlatdir` `-latext`	Directory, file-extension for word lattices output

Performance Tuning Flags

In yet other cases, it may be necessary to tune the following parameters to obtain the optimal computational efficiency or recognition accuracy:

`-beam` `-pbeam` `-wbeam` `-subvqbeam`	Beam pruning parameters
`-maxwpf` `-maxhistpf` `-maxhmmpf`	Absolute pruning parameters
`-ci_pbeam` `-ds`	Fast GMM Computation parameters
`-lw` `-wip`	Language weight, word insertion penalty
`-Nlextree`	Number of lexical tree instances

Decoder Operation

This section is a bit of a mish-mash; its contents probably belong in an FAQ section. But, hopefully, through this section a newcomer to Sphinx can get an idea of the structure, capabilities, and limitations of the s3.X decoder.

Initialization

The decoder is configured during the initialization step, and the configuration holds for the entire run. This means, for example, that the decoder does not dynamically reconfigure the acoustic models to adapt to the input. To choose another example, there is no mechanism in this decoder to switch language models from utterance to utterance, unlike in Sphinx-II. The main initialization steps are outlined below.

Log-Base Initialization. Sphinx performs all likelihood computations in the log-domain. Furthermore, for computational efficiency, the base of the logarithm is chosen such that the likelihoods can be maintained as 32-bit integer values. Thus, all the scores reported by the decoder are log-likelihood values in this peculiar log-base. The default base is typically 1.0003, and can be changed using the -logbase configuration argument. The main reason for modifying the log-base would be to control the length (duration) of an input utterance before the accumulated log-likelihood values overflow the 32-bit representation, causing the decoder to fail catastrophically. The log-base can be changed over a wide range without affecting the recognition.

Models Initialization. The lexical, acoustic, and language models specified via the configuration arguments are loaded during initialization. This set of models is used to decode all the utterances in the input. (The language model is actually only partly loaded, since s3.X uses a disk-based LM strategy.)

Effective Vocabulary. After the models are loaded, the effective vocabulary is determined. It is the set of words that the decoder is capable of recognizing. Recall that the decoder is initialized with three sources of words: the main and filler lexicon files, and the language model. The effective vocabulary is determined from them as follows:

Find the intersection of the words in the LM and the main pronunciation lexicon
Include all the alternative pronunciations to the set derived above (using the main lexicon)
Include all the filler words from the filler lexicon, but excluding the distinguished beginning and end of sentence words: <s> and </s>.

The effective vocabulary remains in effect throughout the batch run. It is not possible to add to or remove from this vocabulary dynamically, unlike in the Sphinx-II system.

Lexical Tree Construction. The decoder constructs lexical trees from the effective vocabulary described above. Separate trees are constructed for words in the main and filler lexicons. Furthermore, several copies may be instantiated for the two, depending on the -Nlextree configuration argument. Further details of the lexical tree construction are available on the PowerPoint slides.

Control File Processing

Following initialization, s3decode and s3livepretend processes the entries in the control file sequentially, one at a time. It is possible to process a contiguous subset of the control file, using the -ctloffset and -ctlcount flags, as mentioned earlier. There is no learning or adaptation capability as decoding progresses. Since s3livepretend behaves as if the files were being spoken at the time of processing, rearranging the order of the entries in the control file may affect the individual results, but this change may be imperceptible if the environment in which the files were recorded remains constant. The order of entries in the control file does not affect s3decode.

Pruning

Each entry in the control file, or utterance, is processed using the given input models, and using the Viterbi search algorithm. In order to constrain the active search space to computationally manageable limits, pruning is employed, which means that the less promising hypotheses are continually discarded during the recognition process. There are two kinds of pruning in s3.X, beam pruning and absolute pruning.

Beam Pruning. Each utterance is processed in a time-synchronous manner, one frame at a time. At each frame the decoder has a number of currently active HMMs to match with the next frame of input speech. But it first discards or deactivates those whose state likelihoods are below some threshold, relative to the best HMM state likelihood at that time. The threshold value is obtained by multiplying the best state likelihood by a fixed beamwidth. The beamwidth is a value between 0 and 1, the former permitting all HMMs to survive, and the latter permitting only the best scoring HMMs to survive.

Similar beam pruning is also used in a number of other situations in the decoder, e.g., to determine the candidate words recognized at any time, or to determine the component densities in a mixture Gaussian that are closest to a given speech feature vector. The various beamwidths have to be determined empirically and are set using configuration arguments.

Absolute Pruning. Even with beam pruning, the number of active entities can sometimes become computationally overwhelming. If there are a large number of HMMs that fall within the pruning threshold, the decoder will keep all of them active. However, when the number of active HMMs grows beyond certain limits, the chances of detecting the correct word among the many candidates are considerably reduced. Such situations can occur, for example, if the input speech is noisy or quite mismatched to the acoustic models. In such cases, there is no point in allowing the active search space to grow to arbitrary extents. It can be contained using pruning parameters that limit the absolute number of active entities at any instant. These parameters are also determined empirically, and set using configuration arguments.

(Added in s3.4) Fast GMM Computation

The computation of likelihood Gaussian distribution can be one of the dominating factor of the GMM computation. Tuning the following parameters can control the amount of time required.

-ci_pbeam: Enable a two-pass computation where CI models were computed first and the CD models were then computed. If this beam is used, only CD models, which correspond to CI models within the beam (relative to the max CI scores), are computed
-ds : Enable frame down-sampling. Only 1 another N frames were computed.

Output Generation

During recognition, the decoder builds an internal backpointer table data structure, from which the final outputs are generated. This table records all the candidate words recognized during decoding, and their attributes such as their time segmentation, acoustic and LM likelihoods, as well as their predecessor entries in the table. When an utterance has been fully processed, the best recognition hypothesis is extracted from this table. Optionally, the table is also converted into a word-lattice and written out to a file.

More information on the backpointer table is available in the PowerPoint slides.

Miscellaneous Issues

Role of <s> and </s>. The distinguished beginning-of-sentence and end-of-sentence tokens <s> and </s> are not in the effective vocabulary, and no part of the input speech is decoded into either of them. They are merely anchors at the ends of each utterance, and provide context for the LM. This is in contrast to earlier versions of Sphinx, which required some silence at either end of each speech utterance, to be decoded into these tokens.

Performance Tuning

To obtain the best recognition performance, it is necessary to select the appropriate front-end and feature type computation, train the various models, as well as tune the decoder configuration parameters. This section deals with the last issue. There are mainly two groups of parameters to be tuned, pertaining to pruning and LM. Unfortunately, there are no automatic methods for determining the values of these parameters; it is necessary to derive them by trial and error. Additionally, the following points should be kept in mind with regard to the pruning parameters:

The pruning parameters need to be tuned whenever the acoustic model is changed.

Changing the LM related parameters affects the effective pruning behaviour. The pruning parameters ought to be re-tuned after the former have been tuned (although this step is often skipped in practice).

For computational efficiency, the beamwidth parameter values should be as narrow as possible (values closer to 1.0 are narrower), and the absolute pruning parameter values should be as small as possible.

But, for recognition accuracy, the pruning parameters should be as relaxed as possible. (However, relaxing the beamwidth parameters too much can actually worsen recognition accuracy. The reasons for such perverse behaviour are not quite understood.)

Tuning the Pruning Behaviour

The pruning parameters are the following:

-beam: Determines which HMMs remain active at any given point (frame) during recognition. (Based on the best state score within each HMM.)
-pbeam: Determines which active HMM can transition to its successor in the lexical tree at any point. (Based on the exit state score of the source HMM.)
-wbeam: Determines which words are recognized at any frame during decoding. (Based on the exit state scores of leaf HMMs in the lexical trees.)
-maxhmmpf: Determines the number of HMMs (approx.) that can remain active at any frame.
-maxwpf: Controls the number of distinct words recognized at any given frame.
-maxhistpf: Controls the number of distinct word histories recorded in the backpointer table at any given frame.
-subvqbeam: For each senone and its underlying acoustic model, determines its active mixture components at any frame.

In order to determine the pruning parameter values empirically, it is first necessary to obtain a test set, i.e., a collection of test sentences not used in any training data. The test set should be sufficiently large to ensure statistically reliable results. For example, a large-vocabulary task might require a test set that includes a half-hour of speech, or more.

It is difficult to tune a handful of parameters simultaneously, especially when the input models are completely new. The following steps may be followed to deal with this complex problem.

To begin with, set the absolute pruning parameters to large values, making them essentially ineffective. Set both -beam and -pbeam to 1e-60, and -wbeam to 1e-30. Set -subvqbeam to a small value (e.g., the same as -beam). Run the decoder on the chosen test set and obtain accuracy results. (Use default values for the LM related parameters when tuning the pruning parameters for the first time.)

Repeat the decoder runs, varying -beam up and down, until the setting for best accuracy is identified. (Keep -pbeam the same as -beam every time.)

Now vary -wbeam up and down and identify its best possible setting (keeping -beam and -pbeam fixed at their most recently obtained value).

Repeat the above two steps, alternately optimizing -beam and -wbeam, until convergence. Note that during these iterations -pbeam should always be the same as -beam. (This step can be omitted if the accuracy attained after the first iteration is acceptable.)

Gradually increase -subvqbeam (i.e., towards 1.0 for a narrower setting), stopping when recognition accuracy begins to drop noticeably. Values near the default are reasonable. (This step is needed only if a sub-vector quantized model is available for speeding up acoustic model evaluation.)

Now gradually increase -pbeam (i.e., towards 1.0), stopping when recognition accuracy begins to drop noticeably. (This step is optional; it mainly optimizes the computational effort a little more.)

Reduce -maxhmmpf gradually until accuracy begins to be affected. Repeat the process with -maxwpf, and then with -maxhistpf. (However, in some situations, especially when the vocabulary size is small, it may not be necessary to tune these absolute pruning parameters.)

In practice, it may not always be possible to follow the above steps strictly. For example, considerations of computational cost might dictate that the absolute pruning parameters or the -subvqbeam parameter be tuned earlier in the sequence.

Tuning Language Model Related Parameters

The parameters needed to be tuned are the following:

-lw: The language weight.
-wip: The word insertion penalty.

Like the pruning parameters, the above two are tuned on a test set. Since the decoder is much more sensitive to the language weight, that is typically tuned first, using the default word insertion penalty. The latter is then tuned. It is usually not necessary to repeat the process.

Some Common Errors and Failure Modes

To be completed.

¤ Back to top of this section

Pronunciation Lexicon

Lexicon Structure
- Multiple Pronunciations
- Compound Words
Main and Filler Lexicons

Lexicon Structure

A pronunciation lexicon (or dictionary) file specifies word pronunciations. In Sphinx, pronunciations are specified as a linear sequence of phonemes. Each line in the file contains one pronunciation specification, except that any line that begins with a "#" character in the first column is treated as a comment and is ignored. Example dictionary for digits:

ZERO               Z IH R OW
ONE                W AH N
TWO                T UW
THREE              TH R IY
FOUR               F AO R
FIVE               F AY V
SIX                S IH K S
SEVEN              S EH V AX N
EIGHT              EY TD
NINE               N AY N

The lexicon is completely case-insensitive (unfortunately). For example, it's not possible to have two different entries Brown and brown in the dictionary.

Multiple Pronunciations

A word may have more than one pronunciation, each one on a separate line. They are distinguished by a unique parenthesized suffix for the word string. For example:

ACTUALLY          AE K CH AX W AX L IY
ACTUALLY(2nd)       AE K SH AX L IY
ACTUALLY(3rd)       AE K SH L IY

If a word has more than one pronunciation, its first appearance must be the unparenthesized form. For the rest, the parenthesized suffix may be any string, as long as it is unique for that word. There is no other significance to the order of the alternatives; each one is considered to be equally likely.

Compound Words

In Sphinx-3, the lexicon may also contain compound words. A compound word is usually a short phrase whose pronunciation happens to differ significantly from the mere concatenation of the pronunciations of its constituent words. Compound word tokens are formed by concatenating the component word strings with an underscore character; e.g.:

WANT_TO           W AA N AX

(The s3.X decoder, however, treats a compound word as just another word in the language, and does not do anything special with it.)

Main and Filler Lexicons

The Sphinx-3 decoders actually need two separate lexicons: a "regular" one containing the words in the language of interest, and also a filler or noise lexicon. The latter defines "words" not in the language. More specifically, it defines legal "words" that do not appear in the language model used by the decoder, but are nevertheless encountered in normal speech. This lexicon must include the silence word <sil>, as well as the special beginning-of-sentence and end-of-sentence tokens <s>, and </s>, respectively. All of them usually have the silence-phone SIL as their pronunciation. In addition, this lexicon may also contain "pronunciations" for other noise event words such as breath noise, "UM" and "UH" sounds made during spontaneous speech, etc.

¤ Back to top of this section

Introduction

Sphinx-3 is based on subphonetic acoustic models. First, the basic sounds in the language are classified into phonemes or phones. There are roughly 50 phones in the English language. For example, here is a pronunciation for the word LANDSAT:

L  AE  N  D  S  AE  TD

Phones are then further refined into context-dependent triphones, i.e., phones occurring in given left and right phonetic contexts. The reason is that the same phone within different contexts can have widely different acoustic manifestations, requiring separate acoustic models. For example, the two occurrences of the AE phone above have different contexts, only the first of which is nasal.

In contrast to triphones, a phone considered without any specific context is referred to as a context-independent phone or basephone. Note also that context-dependency gives rise to the notion of cross-word triphones. That is, the left context for the leftmost basephone of a word depends on what was the previous word spoken.

Phones are also distinguished according to their position within the word: beginning, end, internal, or single (abbreviated b, e, i and s, respectively). For example, in the word MINIMUM with the following pronunciation:

M IH N AX M AX M

the three occurrences of the phone M have three different position attributes. The s attribute applies if a word has just a single phone as its pronunciation.

For most applications, one builds acoustic models for triphones, qualified by the four position attributes. (This provides far greater modelling detail and accuracy than if one relies on just basephone models.) Each triphone is modelled by a hidden Markov model or HMM. Typically, 3 or 5 state HMMs are used, where each state has a statistical model for its underlying acoustics. But if we have 50 basephones, with 4 position qualifiers and 3-state HMMs, we end up with a total of 50³*4*3 distinct HMM states! Such a model set would be too large and impractical to train. To keep things manageable, HMM states are clustered into a much smaller number of groups. Each such group is called a senone (in Sphinx terminology), and all the states mapped into one senone share the same underlying statistical model. (The clustering of HMM states into senones is described in Mei-Yuh Hwang's PhD Thesis.)

Each triphone also has a state transition probability matrix that defines the topology of its HMM. Once again, to conserve resources, there is a considerable amount of sharing. Typically, there is one such matrix per basephone, and all triphones derived from the same parent basephone share its state transition matrix.

The information regarding triphones and mapping from triphone states to senones and transition matrices is captured in a model definition, or mdef input file.

Acoustic Features Computation

For various reasons, it is undesirable to build acoustic models directly in terms of the raw audio samples. Instead, the audio is processed to extract a vector of relevant features. All acoustic modelling is carried out in terms of such feature vectors.

In Sphinx, feature vector computation is a two-stage process. An off-line front-end module is first responsible for processing the raw audio sample stream into a cepstral stream, which can then be input to the Sphinx software. The input audio stream consists of 16-bit samples, at a sampling rate of 8 or 16 KHz depending on whether the input is narrow or wide-band speech. The input is windowed, resulting in frames of duration 25.625 ms. The number of samples in a frame depends on the sampling rate. The output is a stream of 13-dimensional real-valued cepstrum vectors. The frames overlap, thus resulting in a rate of 100 vectors/sec.

In the second stage, the Sphinx software (both trainer and decoder) internally converts the stream of cepstrum vectors into a feature stream. This process consists of the following steps:

An optional cepstrum mean-normalization (CMN) step, which itself includes an optional variance normalization (VN) step.

An optional automatic gain control (AGC) step, in which the signal power component of the cepstral vectors is normalized.

Feature vector generation. The final speech feature vector is created by typically augmenting the cepstrum vector (after CMN and AGC, if any) with one or more time derivatives. In s3.X, the feature vector in each frame is computed by concatenating first and second derivatives to the cepstrum vector, giving a 39-dimensional vector:

Acoustic Model Training

This refers to the computation of a (statistical) model for each senone in the model. As a very rough approximation, this process can be described by the following conceptual steps:

Obtain a corpus of training data. This may include thousands of sentences (or utterances, in Sphinx jargon), consisting of the spoken text and corresponding audio sample stream.

For each utterance, convert the audio data to a stream of feature vectors as described above.

For each utterance, convert the text into a linear sequence of triphone HMMs using the pronunciation lexicon. (This is usually called the sentence HMM.)

For each utterance, find the best state sequence or state alignment through the sentence HMM, for the corresponding feature vector sequence. For example, the figure below shows a single HMM with 3 states (using senones 0, 1, 2), and an utterance of 14-frames of feature vectors. The figure also shows a sample HMM-state (senone) sequence: each feature frame is labelled with a senone ID.

The best state sequence is one with the smallest mismatch between the input feature vectors and the labelled senones' underlying statistical models.

For each senone, gather all the frames in the training corpus that mapped to that senone in the above step, and build a suitable statistical model for the corresponding collection of feature vectors.

Note that there is a circularity in the above description. We wish to train the senone models, but in the penultimate step, we need the senone models to compute the best possible state alignment. This circularity is resolved by using the iterative Baum-Welch or forward-backward training algorithm. The algorithm begins with some initial set of models, which could be completely flat, for the senones. It then repeats the last two steps several times. Each iteration uses the model computed at the end of the previous iteration.

Although not mentioned above, the HMM state-transition probability matrices are also trained from the state alignments. Acoustic modelling is described in greater detail in the Sphinx-3 trainer module.

Model Structures

The acoustic models trained as described above can be of different degrees of sophistication. Two forms are commonly used:

continuous, and
semi-continuous or tied-mixture.

In a continuous model, each senone has its own, private mixture-Gaussian distribution that describes the statistics of its underlying speech feature space. In a semi-continuous model, all the senones share a single codebook of Gaussian distributions, but each senone has its own set of mixture weights applied to the codebook components. Sphinx-3 supports both models, and other, intermediate degrees of state-tying as well. (The s3.X decoder, however, can only handle continuous density acoustic models.)

Similarly, Sphinx-3 in general supports "arbitrary" HMM topologies, unlike Sphinx-II, which is restricted to a specific 5-state topology. However, for efficiency's sake, the s3.X decoder is hardwired to deal with only two types of HMM topologies: 3-state and 5-state, described briefly in hmm.h.

Sub-Vector Quantized Models

Continuous density acoustic models are computationally expensive to deal with, since they can contain hundreds of thousands of Gaussian densities that must be evaluated in each frame. To reduce this cost, one can use an approximate model that efficiently identifies the top scoring candidate densities in each Gaussian mixture in any given frame. The remaining densities can be ignored during that frame.

In Sphinx-3, such an approximate model is built by sub-vector quantizing the acoustic model densities. The utility that performs this conversion is included in this distribution and is called gausubvq, which stands for Gaussian Sub-Vector Quantization.

Note that if the original model consists of mixture Gaussians that only contain a few component densities (say, 4 or fewer per mixture), a sub-vector quantized model may not be effective in reducing the computational load.

Model Files

An acoustic model is represented by the following collection of files:

A model definition (or mdef) file. It defines the set of basephone and triphone HMMs, the mapping of each HMM state to a senone, and the mapping of each HMM to a state transition matrix.
Gaussian mean and variance (or mean and var) files. These files contain all the Gaussian codebooks in the model. The Gaussian means and corresponding variance vectors are separated into the two files.
A mixture weights (or mixw) file containing the Gaussian mixture weights for all the senones in the model.
A state transition matrix (or tmat) file containing all the HMM state transition topologies and their transition probabilities in the model.
An optional sub-vector quantized model (or subvq) file containing an approximation of the acoustic model, for efficient evaluation.

The mean, var, mixw, and tmat files are produced by the Sphinx-3 trainer, and their file formats should be documented there.

¤ Back to top of this section

Language Model

Introduction
Unigrams, Bigrams, Trigrams, LM Vocabulary
Pronunciation and Case Considerations
Binary LM File
Silence and Filler Words
Language Weight and Insertion Penalty

Introduction

The main language model (LM) used by the Sphinx decoder is a conventional bigram or trigram backoff language model. The CMU-Cambridge SLM toolkit is capable of generating such a model from LM training data. Its output is an ascii text file. But a large text LM file can be very slow to load into memory. To speed up this process, the LM must be compiled into a binary form. The code to convert from an ascii text file to the binary format is available at SourceForge in the CVS tree, in a module named share.

Unigrams, Bigrams, Trigrams, LM Vocabulary

A trigram LM primarily consists of the following:

Unigrams: The entire set of words in this LM, and their individual probabilities of occurrence in the language. The unigrams must include the special beginning-of-sentence and end-of-sentence tokens: <s>, and </s> respectively.

Bigrams: A bigram is mathematically P(word2 | word1). That is, the conditional probability that word2 immediately follows word1 in the language. An LM typically contains this information for some subset of the possible word pairs. That is, not all possible word1 word2 pairs need be covered by the bigrams.

Trigrams: Similar to a bigram, a trigram is P(word3 | word1, word2), or the conditional probability that word3 immediately follows a word1 word2 sequence in the language. Not all possible 3-word combinations need be covered by the trigrams.

The vocabulary of the LM is the set of words covered by the unigrams.

The LM probability of an entire sentence is the product of the individual word probabilities. For example, the LM probability of the sentence "HOW ARE YOU" is:

P(HOW | <s>) * 
P(ARE | <s>, HOW) * 
P(YOU | HOW, ARE) * 
P(</s> | ARE, YOU)

Pronunciation and Case Considerations

In Sphinx, the LM cannot distinguish between different pronunciations of the same word. For example, even though the lexicon might contain two different pronunciation entries for the word READ (present and past tense forms), the language model cannot distinguish between the two. Both pronunciations would inherit the same probability from the language model.

Secondly, the LM is case-insensitive. For example, it cannot contain two different tokens READ and read.

The reasons for the above restrictions are historical. Precise pronunciation and case information has rarely been present in LM training data. It would certainly be desirable to do away with the restrictions at some time in the future.

Binary LM File

The binary LM file (also referred to as the LM dump file) is more or less a disk image of the LM data structure constructed in memory. This data structure was originally designed during the Sphinx-II days, when efficient memory usage was the focus. In Sphinx-3, however, memory usage is no longer an issue since the binary file enables the decoder to use a disk-based LM strategy. That is, the LM binary file is no longer read entirely into memory. Rather, the portions required during decoding are read in on demand, and cached. For large vocabulary recognition, the memory resident portion is typically about 10-20% of the bigrams, and 5-10% of the trigrams.

Since the decoder uses a disk-based LM, it is necessary to have efficient access to the binary LM file. Thus, network access to an LM file at a remote location is not recommended. It is desirable to have the LM file be resident on the local machine.

The binary dump file can be created from the ascii form using the lm3g2dmp utility, which is part of the Sphinx-II distribution, and also available as standalone code, as mentioned before. (The header of the dump file itself contains a brief description of the file format.)

Silence and Filler Words

Language models typically do not cover acoustically significant events such as silence, breath-noise, UM or UH sounds made by a person hunting for the right phrase, etc. These are known generally as filler words, and are excluded from the LM vocabulary. The reason is that a language model training corpus, which is simply a lot of text, usually does not include such information.

Since the main trigram LM ignores silence and filler words, their "language model probability" has to be specified in a separate file, called the filler penalty file. The format of this file is very straightforward; each line contains one word and its probability, as in the following example:

++UH++      0.10792
++UM++      0.00866
++BREATH++  0.00147

The filler penalty file is not required. If it is present, it does not have to contain entries for every filler word. The decoder allows a default value to be specified for filler word probabilities (through the -fillprob configuration argument), and a default silence word probability (through the -silprob argument).

Like the main trigram LM, filler and silence word probabilities are obtained from appropriate training data. However, training them is considerably easier since they are merely unigram probabilities.

Filler words are invisible or transparent to the trigram language model. For example, the LM probability of the sentence "HAVE CAR <sil> WILL TRAVEL" is:

P(HAVE | <s>) * 
P(CAR | <s>, HAVE) * 
P(<sil>) *
P(WILL | HAVE, CAR) * 
P(TRAVEL | CAR, WILL) * 
P(</s> | WILL, TRAVEL)

Language Weight and Word Insertion Penalty

During recognition the decoder combines both acoustic likelihoods and language model probabilities into a single score in order to compare various hypotheses. This combination of the two is not just a straightforward product. In order to obtain optimal recognition accuracy, it is usually necessary to exponentiate the language model probability using a language weight before combining the result with the acoustic likelihood. (Since likelihood computations are actually carried out in the log-domain in the Sphinx decoder, the LM weight becomes a multiplicative factor applied to LM log-probabilities.)

The language weight parameter is typically obtained through trial and error. In the case of Sphinx, the optimum value for this parameter has usually ranged between 6 and 13, depending on the task at hand.

Similarly, though with lesser impact, it has also been found useful to include a word insertion penalty parameter which is a fixed penalty for each new word hypothesized by the decoder. It is effectively another multiplicative factor in the language model probability computation (before the application of the language weight). This parameter has usually ranged between 0.2 and 0.7, depending on the task.

¤ Back to top of this section

Speech Input Control File

The Sphinx-3 decoder processes entries listed in a control file. Each line in the control file identifies a separate utterance. A line has the following format (the brackets indicate a group of fields that is optional):

AudioFile [ StartFrame EndFrame UttID ]

AudioFile is the speech input file. In this distribution of s3.X, this file is in raw audio format. In all other versions of Sphinx-3, this file contains cepstrum data. The filename extension should be omitted from the specification. If this is the only field in the line, the entire file is processed as one utterance. In this case, an utterance ID string is automatically derived from the cepstrum filename, by stripping any leading directory name components from it. E.g.: if the control file contains the following entries:

/net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0201
/net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0202
/net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0203

three utterances are processed, with IDs 4t0c0201, 4t0c0202, and 4t0c0203, respectively.

If, on the other hand, a control file entry includes the StartFrame and EndFrame fields, only that portion of the cepstrum file is processed. This form of the control file is frequently used if the speech input can be arbitrarily long, such as an entire TV news show. There is one big cepstrum file, but it is processed in smaller chunks or segments. In this case, the final UttID field is the utterance ID string for the entry.

The utterance ID associated with a control file entry is used to identify all the output from the decoder for that utterance. For example, if the decoder is used to generate word lattice files, they are named using the utterance ID. Hence, each ID, whether automatically derived or explicitly specified, should be unique over the entire control file.

Any line in the control file beginning with a # character is a comment line, and is ignored.

¤ Back to top of this section

Recognition Hypothesis Output

The Sphinx-3 decoder produces a single recognition hypothesis for each utterance it processes. The hypotheses for all the utterances processed in a single run are written to a single output file, one line per utterance. The line format is as follows:

u S s T t A a L l sf wa wl wd sf wa sl wd ... nf

The S, T, A, and L fields are keywords and appear in the output as shown. The remaining fields are briefly described below:

u: the utterance ID
s: an acoustic score scaling done during acoustic likelihood computation (However, this field is 0 in the s3.X decoder output.)
t: the total score for this hypothesis
a: the total acoustic score for this hypothesis
l: the total language model score for this hypothesis

The l score field is followed by groups of four fields, one group for each successive word in the output hypothesis. The four fields are:

sf: Start frame for the word (its end frame is just before the start frame of the next word)
wa: Acoustic score for the word
wl: LM score for the word
wd: The word string itself.

The final field, nf, in each hypothesis line is the total number of frames in the utterance.

Note that all scores are log-likelihood values in the peculiar logbase used by the decoder. Secondly, the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0. Finally, the language model scores reported include the language weight and word-insertion penalty parameters.

Here is an example hypothesis file for three utterances.

¤ Back to top of this section

Word Lattice Output

Word Lattice Overview
Word Lattice File Format

Word Lattice Overview

During recognition the decoder maintains not just the single best hypothesis, but also a number of alternatives or candidates. For example, REED is a perfectly reasonable alternative to READ. The alternatives are useful in many ways: for instance, in N-best list generation. To facilitate such post-processing, the decoder can optionally produce a word lattice output for each input utterance. This output records all the candidate words recognized by the decoder at any point in time, and their main attributes such as time segmentation and acoustic likelihood scores.

The term "lattice" is used somewhat loosely. The word-lattice is really a directed acyclic graph or DAG. Each node of the DAG denotes a word instance that begins at a particular frame within the utterance. That is, it is a unique <word,start-time> pair. (However, there could be a number of end-times for this word instance. One of the features of a time-synchronous Viterbi search using beam pruning is that word candidates hypothesized by the decoder have a well-defined start-time, but a fuzzy range of end-times. This is because the start-time is primarily determined by Viterbi pruning, while the possible end-times are determined by beam pruning.)

There is a directed edge between two nodes in the DAG if the start-time of the destination node immediately follows one of the end times of the source node. That is, the two nodes can be adjacent in time. Thus, the edge determines one possible segmentation for the source node: beginning at the source's start-time and ending one frame before the destination's start-time. The edge also contains an acoustic likelihood for this particular segmentation of the source node.

Note: The beginning and end of sentence tokens, <s> and </s>, are not decoded as part of an utterance by the s3.X decoder. However, they have to be included in the word lattice file, for compatibility with the older Sphinx-3 decoder software. They are assigned 1-frame segmentations, with log-likelihood scores of 0. To accommodate them, the segmentations of adjacent nodes have to be "fudged" by 1 frame.

Word Lattice File Format

A word lattice file essentially contains the above information regarding the nodes and edges in the DAG. It is structured in several sections, as follows:

A comment section, listing important configuration arguments as comments
Frames section, specifying the number of frames in utterance
Nodes section, listing the nodes in the DAG
Initial and Final nodes (for <s> and </s>, respectively)
BestSegAscr section, a historical remnant now essentially empty
Edges section, listing the edges in the DAG

The file is formatted as follows. Note that any line in the file that begins with the # character in the first column is considered to be a comment.

# getcwd: <current-working-directory>
# -logbase <logbase-in-effect>
# -dict <main lexicon>
# -fdict <filler lexicon>
# ... (other arguments, written out as comment lines)
#
Frames <number-of-frames-in-utterance>
#
Nodes <number-of-nodes-in-DAG> (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME)
<Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time>
<Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time>
<Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time>
... (for all nodes in DAG)
#
Initial <Initial-Node-ID>
Final <Final-Node-ID>
#
BestSegAscr 0 (NODEID ENDFRAME ASCORE)
#
Edges (FROM-NODEID TO-NODEID ASCORE)
<Source-Node-ID> <Destination-Node-ID> <Acoustic Score>
<Source-Node-ID> <Destination-Node-ID> <Acoustic Score>
<Source-Node-ID> <Destination-Node-ID> <Acoustic Score>
... (for all edges in DAG)
End

Note that the node-ID values for DAG nodes are assigned sequentially, starting from 0. Furthermore, they are sorted in descending order of their earliest-end-time attribute.

Here is an example word lattice file.

¤ Back to top of this section

`approx_cont_mgau.c`	Fast Gaussian Distribution Computation
`agc.c`	Automatic gain control (on signal energy)
`ascr.c`	Senone acoustic scores
`beam.c`	Pruning beam widths
`bio.c`	Binary file I/O support
`cmn.c`	Cepstral mean normalization and variance normalization
`corpus.c`	Control file processing
`cont_mgau.c`	Mixture Gaussians (acoustic model)
`s3decode.c`	Main file for `s3decode`
`dict.c`	Pronunciation lexicon
`dict2pid.c`	Generation of triphones for the pronunciation dictionary
`feat.c`	Feature vectors computation
`fillpen.c`	Filler word probabilities
`gausubvq.c`	Standalone acoustic model sub-vector quantizer
`hmm.c`	HMM evaluation
`hyp.h`	Recognition hypotheses data type
`kb.h`	All knowledge bases, search structures used by decoder
`kbcore.c`	Collection of core knowledge bases
`lextree.c`	Lexical search tree
`s3live.c`	Live decoder functions
`lm.c`	Trigram language model
`logs3.c`	Support for log-likelihood operations
`main_live_example.c`	Main file for `s3livedecode` showing use of `live_decode_API.h`
`main_live_pretend.c`	Main file for `s3livepretend` showing use of `live_decode_API.h`
`mdef.c`	Acoustic model definition
`mllr.c`	transformation of mean based on a linear regression matrix.
`s3types.h`	Various data types, for ease of modification
`subvq.c`	Sub-vector quantized acoustic model
`tmat.c`	HMM transition matrices (topology definition)
`vector.c`	Vector operations, quantization, etc.
`vithist.c`	Backpointer table (Viterbi history)
`wid.c`	Mapping between LM and lexicon word IDs

¤ Back to top of this section

Maintained by Evandro B. Gouvêa and Arthur Chan

Last modified: Thu Jul 22 09:35:27 EDT 2004

Sphinx-3 s3.X Decoder (X=6)

Contents

Output Generation

Miscellaneous Issues