Sphinx-3 is the successor to the Sphinx-II speech recognition system from Carnegie Mellon University. It includes both an acoustic trainer and various decoders, i.e., text recognition, phoneme recognition, N-best list generation, etc. In this document, "Sphinx-3" refers to any version of the Sphinx-3 decoder, and "s3.X" refers to the version available in this distribution. Notice that s3.X is in fact a branch from Sphinx-3, not a more recent release.
The s3.X decoder is a recent implementation for speech-to-text recognition, its main goal being speed improvements over the original Sphinx-3 decoder. It runs about 10 times faster than the latter on large vocabulary tasks. The following is a brief summary of its main features and limitations:
This package contains the following programs:
s3decode
: The Sphinx-3 s3.2/s3.3/s3.X decoder processing cesptra filess3gausubvq
: Sub-vector clustered acoustic model buildings3livedecode
: The Sphinx-3 s3.X decoder in live modes3livepretend
: The Sphinx-3 s3.X decoder in batch modes3align
: The Sphinx-3 aligners3allphone
: The Sphinx-3 phoneme recognizers3astar
: The Sphinx-3 N-best generators3dag
: The Sphinx-3 application for best-path searchingThis distribution has been prepared for Unix platforms. Port to MS Windows (MS Visual C++ 6.0 workspace and project files) has been provided.
This document is a brief user's manual for the above programs. It is not meant to be a detailed description of the decoding algorithm, or an in-depth tutorial on speech recognition technology. However, a set of Microsoft PowerPoint slides are available that give additional information about the decoder. Even though the slides refer to s3.2, keep in mind that the basic search structure remains te same in s3.X (where x=3,4 and 5)
The initial part of this document provides an overview of the decoder. It is followed by descriptions of the main input and output databases; i.e., the lexicon, language model, acoustic model, etc.
The s3.X decoder is based on the conventional Viterbi search algorithm and beam search heuristics. It uses a lexical-tree search structure somewhat like the Sphinx-II decoder, but with some improvements for greater accuracy than the latter. It takes its input from pre-recorded speech in raw PCM format and writes its recognition results to output files.
We first give a brief outline of the input and output characteristics of the decoder. More detailed information is available in later sections. The decoder needs the following inputs:
s3livedecode
decodes live
speech, that is, speech incoming from your audio
card. s3livepretend
decodes in batch mode using a
control file that describes the input to be decoded
into text. s3decode
decodes also uses a control file
for batch mode processing. In the latter, the entire input to be
processed must be available beforehand, i.e., the raw
audio samples must have been preprocessed into cepstrum
files. Also note that the decoder cannot handle arbitrary
lengths of speech input. Each separate piece (or
utterance) to be processed by the decoder must be no
more than 300 sec. long. Typically, one uses a
segmenter to chop up a cepstrum stream into manageable
segments of up to 20 or 30 sec. duration.
The decoder can produce two types of recognition output:
In addition, the decoder also produces a detailed log to stdout/stderr that can be useful in debugging, gathering statistics, etc.
The current distribution has been set up for Unix platforms. The following steps are needed to compile the decoder:
./configure [--prefix=/my/install/directory]
:
the argument is optional. If not given, it will install s3.X
under /usr/local, provided you have the proper permissions. This
step is only necessary the first time you compile s3.X.make clean
: This should remove any old object
files.make
: This compiles the libraries and example
programs.make install
: This will install s3.X in the
directory that you specified when you ran configure
and also the provided models and documentation.Note that the Makefiles are not foolproof; they do not eliminate
the need for sometimes manually determining dependencies,
especially upon updates to header files. When in doubt, first
clean out the compilation directories entirely by running
make distclean
and start over.
Running the decoder is simply a matter of invoking the binary
(i.e., s3decode
, s3livedecode
or
s3livepretend
), with a number of command line arguments
specifying the various input files, as well as decoding
configuration parameters. s3decode
and
s3livepretend
require a control file, the directory
where the audio files are available, and a file containing the
configuration arguments. s3livedecode
, which runs live,
requires only the file with the arguments.
Invoking the binary without any argument produces a help message with short descriptions of all the configuration arguments.
This section gives a brief overview of the main configuration arguments. They are broken down into separate groups, based on whether they are the primary flags specifying input and output data, arguments for optional configuration, or for performance tuning.
Note that not all the available configuration arguments are covered below. There are a few additional and undocumented flags, intended mainly for debugging purposes.
Many of the flags have reasonable defaults. The ones that a user minimally needs to provide are the input and output databases or files, which have been discussed above:
|
Model definition input file |
|
Acoustic model files |
|
Main and filler lexicons |
|
Language model binary dump file |
|
Filler word probabilities |
|
Output hypotheses file |
It may often be necessary to provide additional parameters to obtain the right decoder configuration:
|
Feature type configuration |
|
Directory prefix for cepstrum files specified in the
control file, ignored in s3livedecode and s3livepretend |
|
Selecting a portion of the control file to be processed |
|
Directory, file-extension for word lattices output |
In yet other cases, it may be necessary to tune the following parameters to obtain the optimal computational efficiency or recognition accuracy:
|
Beam pruning parameters |
|
Absolute pruning parameters |
|
Fast GMM Computation parameters |
|
Language weight, word insertion penalty |
|
Number of lexical tree instances |
This section is a bit of a mish-mash; its contents probably belong in an FAQ section. But, hopefully, through this section a newcomer to Sphinx can get an idea of the structure, capabilities, and limitations of the s3.X decoder.
The decoder is configured during the initialization step, and the configuration holds for the entire run. This means, for example, that the decoder does not dynamically reconfigure the acoustic models to adapt to the input. To choose another example, there is no mechanism in this decoder to switch language models from utterance to utterance, unlike in Sphinx-II. The main initialization steps are outlined below.
Log-Base Initialization. Sphinx performs all likelihood
computations in the log-domain. Furthermore, for computational
efficiency, the base of the logarithm is chosen such that
the likelihoods can be maintained as 32-bit integer values. Thus,
all the scores reported by the decoder are log-likelihood
values in this peculiar log-base. The default base is typically
1.0003, and can be changed using the -logbase
configuration argument. The main reason for modifying the
log-base would be to control the length (duration) of an input
utterance before the accumulated log-likelihood values overflow
the 32-bit representation, causing the decoder to fail
catastrophically. The log-base can be changed over a wide range
without affecting the recognition.
Models Initialization. The lexical, acoustic, and language models specified via the configuration arguments are loaded during initialization. This set of models is used to decode all the utterances in the input. (The language model is actually only partly loaded, since s3.X uses a disk-based LM strategy.)
Effective Vocabulary. After the models are loaded, the effective vocabulary is determined. It is the set of words that the decoder is capable of recognizing. Recall that the decoder is initialized with three sources of words: the main and filler lexicon files, and the language model. The effective vocabulary is determined from them as follows:
<s>
and </s>
.The effective vocabulary remains in effect throughout the batch run. It is not possible to add to or remove from this vocabulary dynamically, unlike in the Sphinx-II system.
Lexical Tree Construction. The decoder constructs
lexical trees from the effective vocabulary described
above. Separate trees are constructed for words in the main and filler lexicons.
Furthermore, several copies may be instantiated for the two,
depending on the -Nlextree
configuration argument.
Further details of the lexical tree construction are available on
the PowerPoint slides.
Following initialization, s3decode
and
s3livepretend
processes the entries in the control file
sequentially, one at a time. It is possible to process a
contiguous subset of the control file, using the
-ctloffset
and -ctlcount
flags, as
mentioned earlier. There is no learning or adaptation
capability as decoding progresses. Since s3livepretend
behaves as if the files were being spoken at the time of
processing, rearranging the order of the entries in the control
file may affect the individual results, but this change may be
imperceptible if the environment in which the files were recorded
remains constant. The order of entries in the control file does
not affect s3decode
.
Each entry in the control file, or utterance, is processed using the given input models, and using the Viterbi search algorithm. In order to constrain the active search space to computationally manageable limits, pruning is employed, which means that the less promising hypotheses are continually discarded during the recognition process. There are two kinds of pruning in s3.X, beam pruning and absolute pruning.
Beam Pruning. Each utterance is processed in a time-synchronous manner, one frame at a time. At each frame the decoder has a number of currently active HMMs to match with the next frame of input speech. But it first discards or deactivates those whose state likelihoods are below some threshold, relative to the best HMM state likelihood at that time. The threshold value is obtained by multiplying the best state likelihood by a fixed beamwidth. The beamwidth is a value between 0 and 1, the former permitting all HMMs to survive, and the latter permitting only the best scoring HMMs to survive.
Similar beam pruning is also used in a number of other situations in the decoder, e.g., to determine the candidate words recognized at any time, or to determine the component densities in a mixture Gaussian that are closest to a given speech feature vector. The various beamwidths have to be determined empirically and are set using configuration arguments.
Absolute Pruning. Even with beam pruning, the number of active entities can sometimes become computationally overwhelming. If there are a large number of HMMs that fall within the pruning threshold, the decoder will keep all of them active. However, when the number of active HMMs grows beyond certain limits, the chances of detecting the correct word among the many candidates are considerably reduced. Such situations can occur, for example, if the input speech is noisy or quite mismatched to the acoustic models. In such cases, there is no point in allowing the active search space to grow to arbitrary extents. It can be contained using pruning parameters that limit the absolute number of active entities at any instant. These parameters are also determined empirically, and set using configuration arguments.
The computation of likelihood Gaussian distribution can be one of the dominating factor of the GMM computation. Tuning the following parameters can control the amount of time required.
-ci_pbeam
: Enable a two-pass computation where
CI models were computed first and the CD models were then
computed. If this beam is used, only CD models, which correspond
to CI models within the beam (relative to the max CI scores), are
computed -ds
: Enable frame down-sampling. Only 1
another N frames were computed. During recognition, the decoder builds an internal backpointer table data structure, from which the final outputs are generated. This table records all the candidate words recognized during decoding, and their attributes such as their time segmentation, acoustic and LM likelihoods, as well as their predecessor entries in the table. When an utterance has been fully processed, the best recognition hypothesis is extracted from this table. Optionally, the table is also converted into a word-lattice and written out to a file.
More information on the backpointer table is available in the PowerPoint slides.
Role of <s>
and
</s>
. The distinguished
beginning-of-sentence and end-of-sentence tokens
<s>
and </s>
are not in the
effective vocabulary, and no part of the input speech is decoded
into either of them. They are merely anchors at the ends of each
utterance, and provide context for the LM. This is in contrast to
earlier versions of Sphinx, which required some silence at either
end of each speech utterance, to be decoded into these tokens.
To obtain the best recognition performance, it is necessary to select the appropriate front-end and feature type computation, train the various models, as well as tune the decoder configuration parameters. This section deals with the last issue. There are mainly two groups of parameters to be tuned, pertaining to pruning and LM. Unfortunately, there are no automatic methods for determining the values of these parameters; it is necessary to derive them by trial and error. Additionally, the following points should be kept in mind with regard to the pruning parameters:
The pruning parameters are the following:
-beam
: Determines which HMMs remain active at
any given point (frame) during recognition. (Based on the best
state score within each HMM.)
-pbeam
: Determines which active HMM can
transition to its successor in the lexical tree at any point.
(Based on the exit state score of the source HMM.)
-wbeam
: Determines which words are recognized
at any frame during decoding. (Based on the exit state scores
of leaf HMMs in the lexical trees.)-maxhmmpf
: Determines the number of HMMs
(approx.) that can remain active at any frame.-maxwpf
: Controls the number of distinct words
recognized at any given frame.-maxhistpf
: Controls the number of distinct
word histories recorded in the backpointer table at any given
frame.-subvqbeam
: For each senone and its underlying acoustic model,
determines its active mixture components at any frame.In order to determine the pruning parameter values empirically, it is first necessary to obtain a test set, i.e., a collection of test sentences not used in any training data. The test set should be sufficiently large to ensure statistically reliable results. For example, a large-vocabulary task might require a test set that includes a half-hour of speech, or more.
It is difficult to tune a handful of parameters simultaneously, especially when the input models are completely new. The following steps may be followed to deal with this complex problem.
-beam
and -pbeam
to
1e-60
, and -wbeam
to
1e-30
. Set -subvqbeam
to a small
value (e.g., the same as -beam
). Run the decoder
on the chosen test set and obtain accuracy results. (Use
default values for the LM related
parameters when tuning the pruning parameters for the first
time.)
-beam
up and
down, until the setting for best accuracy is identified. (Keep
-pbeam
the same as -beam
every time.)
-wbeam
up and down and identify its
best possible setting (keeping -beam
and
-pbeam
fixed at their most recently obtained
value).
-beam
and -wbeam
, until convergence.
Note that during these iterations -pbeam
should
always be the same as -beam
. (This step can be
omitted if the accuracy attained after the first iteration is
acceptable.)
-subvqbeam
(i.e.,
towards 1.0 for a narrower setting), stopping when recognition
accuracy begins to drop noticeably. Values near the default are
reasonable. (This step is needed only if a sub-vector quantized model is available for
speeding up acoustic model evaluation.)
-pbeam
(i.e.,
towards 1.0), stopping when recognition accuracy begins to drop
noticeably. (This step is optional; it mainly optimizes the
computational effort a little more.)
-maxhmmpf
gradually until accuracy
begins to be affected. Repeat the process with
-maxwpf
, and then with -maxhistpf
.
(However, in some situations, especially when the vocabulary
size is small, it may not be necessary to tune these absolute
pruning parameters.)
In practice, it may not always be possible to follow the above
steps strictly. For example, considerations of computational cost
might dictate that the absolute pruning parameters or the
-subvqbeam
parameter be tuned earlier in the
sequence.
The parameters needed to be tuned are the following:
-lw
: The language weight.
-wip
: The word insertion
penalty.
Like the pruning parameters, the above two are tuned on a test set. Since the decoder is much more sensitive to the language weight, that is typically tuned first, using the default word insertion penalty. The latter is then tuned. It is usually not necessary to repeat the process.
To be completed.
A pronunciation lexicon (or dictionary) file specifies word pronunciations. In Sphinx, pronunciations are specified as a linear sequence of phonemes. Each line in the file contains one pronunciation specification, except that any line that begins with a "#" character in the first column is treated as a comment and is ignored. Example dictionary for digits:
ZERO Z IH R OW ONE W AH N TWO T UW THREE TH R IY FOUR F AO R FIVE F AY V SIX S IH K S SEVEN S EH V AX N EIGHT EY TD NINE N AY N
The lexicon is completely case-insensitive
(unfortunately). For example, it's not possible to have two
different entries Brown
and brown
in the
dictionary.
A word may have more than one pronunciation, each one on a separate line. They are distinguished by a unique parenthesized suffix for the word string. For example:
ACTUALLY AE K CH AX W AX L IY ACTUALLY(2nd) AE K SH AX L IY ACTUALLY(3rd) AE K SH L IY
If a word has more than one pronunciation, its first appearance must be the unparenthesized form. For the rest, the parenthesized suffix may be any string, as long as it is unique for that word. There is no other significance to the order of the alternatives; each one is considered to be equally likely.
In Sphinx-3, the lexicon may also contain compound words. A compound word is usually a short phrase whose pronunciation happens to differ significantly from the mere concatenation of the pronunciations of its constituent words. Compound word tokens are formed by concatenating the component word strings with an underscore character; e.g.:
WANT_TO W AA N AX
(The s3.X decoder, however, treats a compound word as just another word in the language, and does not do anything special with it.)
The Sphinx-3 decoders actually need two separate lexicons: a
"regular" one containing the words in the language of interest,
and also a filler or noise lexicon. The latter
defines "words" not in the language. More specifically, it
defines legal "words" that do not appear in the language model
used by the decoder, but are nevertheless encountered in normal
speech. This lexicon must include the silence word
<sil>
, as well as the special
beginning-of-sentence and end-of-sentence tokens
<s>
, and </s>
, respectively.
All of them usually have the silence-phone SIL
as
their pronunciation. In addition, this lexicon may also contain
"pronunciations" for other noise event words such as breath noise,
"UM" and "UH" sounds made during spontaneous speech, etc.
Sphinx-3 is based on subphonetic acoustic models.
First, the basic sounds in the language are classified into
phonemes or phones. There are roughly 50 phones in the
English language. For example, here is a pronunciation for the
word LANDSAT
:
L AE N D S AE TD
Phones are then further refined into context-dependent
triphones, i.e., phones occurring in given left
and right phonetic contexts. The reason is that the same phone
within different contexts can have widely different acoustic
manifestations, requiring separate acoustic models. For example,
the two occurrences of the AE
phone above have
different contexts, only the first of which is nasal.
In contrast to triphones, a phone considered without any specific context is referred to as a context-independent phone or basephone. Note also that context-dependency gives rise to the notion of cross-word triphones. That is, the left context for the leftmost basephone of a word depends on what was the previous word spoken.
Phones are also distinguished according to their position
within the word: beginning, end, internal, or single (abbreviated
b
, e
, i
and s
,
respectively). For example, in the word MINIMUM
with
the following pronunciation:
M IH N AX M AX M
the three occurrences of the phone M
have three
different position attributes. The s
attribute
applies if a word has just a single phone as its pronunciation.
For most applications, one builds acoustic models for triphones, qualified by the four position attributes. (This provides far greater modelling detail and accuracy than if one relies on just basephone models.) Each triphone is modelled by a hidden Markov model or HMM. Typically, 3 or 5 state HMMs are used, where each state has a statistical model for its underlying acoustics. But if we have 50 basephones, with 4 position qualifiers and 3-state HMMs, we end up with a total of 503*4*3 distinct HMM states! Such a model set would be too large and impractical to train. To keep things manageable, HMM states are clustered into a much smaller number of groups. Each such group is called a senone (in Sphinx terminology), and all the states mapped into one senone share the same underlying statistical model. (The clustering of HMM states into senones is described in Mei-Yuh Hwang's PhD Thesis.)
Each triphone also has a state transition probability matrix that defines the topology of its HMM. Once again, to conserve resources, there is a considerable amount of sharing. Typically, there is one such matrix per basephone, and all triphones derived from the same parent basephone share its state transition matrix.
The information regarding triphones and mapping from triphone states to senones and transition matrices is captured in a model definition, or mdef input file.
For various reasons, it is undesirable to build acoustic models directly in terms of the raw audio samples. Instead, the audio is processed to extract a vector of relevant features. All acoustic modelling is carried out in terms of such feature vectors.
In Sphinx, feature vector computation is a two-stage process. An off-line front-end module is first responsible for processing the raw audio sample stream into a cepstral stream, which can then be input to the Sphinx software. The input audio stream consists of 16-bit samples, at a sampling rate of 8 or 16 KHz depending on whether the input is narrow or wide-band speech. The input is windowed, resulting in frames of duration 25.625 ms. The number of samples in a frame depends on the sampling rate. The output is a stream of 13-dimensional real-valued cepstrum vectors. The frames overlap, thus resulting in a rate of 100 vectors/sec.
In the second stage, the Sphinx software (both trainer and decoder) internally converts the stream of cepstrum vectors into a feature stream. This process consists of the following steps:
This refers to the computation of a (statistical) model for each senone in the model. As a very rough approximation, this process can be described by the following conceptual steps:
Note that there is a circularity in the above description. We wish to train the senone models, but in the penultimate step, we need the senone models to compute the best possible state alignment. This circularity is resolved by using the iterative Baum-Welch or forward-backward training algorithm. The algorithm begins with some initial set of models, which could be completely flat, for the senones. It then repeats the last two steps several times. Each iteration uses the model computed at the end of the previous iteration.
Although not mentioned above, the HMM state-transition probability matrices are also trained from the state alignments. Acoustic modelling is described in greater detail in the Sphinx-3 trainer module.
The acoustic models trained as described above can be of different degrees of sophistication. Two forms are commonly used:
In a continuous model, each senone has its own, private mixture-Gaussian distribution that describes the statistics of its underlying speech feature space. In a semi-continuous model, all the senones share a single codebook of Gaussian distributions, but each senone has its own set of mixture weights applied to the codebook components. Sphinx-3 supports both models, and other, intermediate degrees of state-tying as well. (The s3.X decoder, however, can only handle continuous density acoustic models.)
Similarly, Sphinx-3 in general supports "arbitrary" HMM topologies, unlike Sphinx-II, which is restricted to a specific 5-state topology. However, for efficiency's sake, the s3.X decoder is hardwired to deal with only two types of HMM topologies: 3-state and 5-state, described briefly in hmm.h.
Continuous density acoustic models are computationally expensive to deal with, since they can contain hundreds of thousands of Gaussian densities that must be evaluated in each frame. To reduce this cost, one can use an approximate model that efficiently identifies the top scoring candidate densities in each Gaussian mixture in any given frame. The remaining densities can be ignored during that frame.
In Sphinx-3, such an approximate model is built by
sub-vector quantizing the acoustic model densities. The
utility that performs this conversion is included in this
distribution and is called gausubvq
, which stands for
Gaussian Sub-Vector Quantization.
Note that if the original model consists of mixture Gaussians that only contain a few component densities (say, 4 or fewer per mixture), a sub-vector quantized model may not be effective in reducing the computational load.
An acoustic model is represented by the following collection of files:
The mean, var, mixw, and tmat files are produced by the Sphinx-3 trainer, and their file formats should be documented there.
The main language model (LM) used by the Sphinx decoder is a conventional bigram or trigram backoff language model. The CMU-Cambridge SLM toolkit is capable of generating such a model from LM training data. Its output is an ascii text file. But a large text LM file can be very slow to load into memory. To speed up this process, the LM must be compiled into a binary form. The code to convert from an ascii text file to the binary format is available at SourceForge in the CVS tree, in a module named share.
A trigram LM primarily consists of the following:
<s>
, and </s>
respectively.The vocabulary of the LM is the set of words covered by the unigrams.
The LM probability of an entire sentence is the product of the
individual word probabilities. For example, the LM probability of
the sentence "HOW ARE YOU"
is:
P(HOW | <s>) * P(ARE | <s>, HOW) * P(YOU | HOW, ARE) * P(</s> | ARE, YOU)
In Sphinx, the LM cannot distinguish between different
pronunciations of the same word. For example, even though the
lexicon might contain two different pronunciation entries for the
word READ
(present and past tense forms), the
language model cannot distinguish between the two. Both
pronunciations would inherit the same probability from the
language model.
Secondly, the LM is case-insensitive. For example, it
cannot contain two different tokens READ
and
read
.
The reasons for the above restrictions are historical. Precise pronunciation and case information has rarely been present in LM training data. It would certainly be desirable to do away with the restrictions at some time in the future.
The binary LM file (also referred to as the LM dump file) is more or less a disk image of the LM data structure constructed in memory. This data structure was originally designed during the Sphinx-II days, when efficient memory usage was the focus. In Sphinx-3, however, memory usage is no longer an issue since the binary file enables the decoder to use a disk-based LM strategy. That is, the LM binary file is no longer read entirely into memory. Rather, the portions required during decoding are read in on demand, and cached. For large vocabulary recognition, the memory resident portion is typically about 10-20% of the bigrams, and 5-10% of the trigrams.
Since the decoder uses a disk-based LM, it is necessary to have efficient access to the binary LM file. Thus, network access to an LM file at a remote location is not recommended. It is desirable to have the LM file be resident on the local machine.
The binary dump file can be created from the ascii form using
the lm3g2dmp
utility, which is part of the Sphinx-II
distribution, and also available as standalone code, as mentioned
before. (The header of the dump file itself contains a brief
description of the file format.)
Language models typically do not cover acoustically significant events such as silence, breath-noise, UM or UH sounds made by a person hunting for the right phrase, etc. These are known generally as filler words, and are excluded from the LM vocabulary. The reason is that a language model training corpus, which is simply a lot of text, usually does not include such information.
Since the main trigram LM ignores silence and filler words, their "language model probability" has to be specified in a separate file, called the filler penalty file. The format of this file is very straightforward; each line contains one word and its probability, as in the following example:
++UH++ 0.10792 ++UM++ 0.00866 ++BREATH++ 0.00147
The filler penalty file is not required. If it is
present, it does not have to contain entries for every filler
word. The decoder allows a default value to be specified for
filler word probabilities (through the -fillprob
configuration argument), and a default silence word probability
(through the -silprob
argument).
Like the main trigram LM, filler and silence word probabilities are obtained from appropriate training data. However, training them is considerably easier since they are merely unigram probabilities.
Filler words are invisible or transparent to the
trigram language model. For example, the LM probability of the
sentence "HAVE CAR <sil> WILL TRAVEL"
is:
P(HAVE | <s>) * P(CAR | <s>, HAVE) * P(<sil>) * P(WILL | HAVE, CAR) * P(TRAVEL | CAR, WILL) * P(</s> | WILL, TRAVEL)
During recognition the decoder combines both acoustic likelihoods and language model probabilities into a single score in order to compare various hypotheses. This combination of the two is not just a straightforward product. In order to obtain optimal recognition accuracy, it is usually necessary to exponentiate the language model probability using a language weight before combining the result with the acoustic likelihood. (Since likelihood computations are actually carried out in the log-domain in the Sphinx decoder, the LM weight becomes a multiplicative factor applied to LM log-probabilities.)
The language weight parameter is typically obtained through trial and error. In the case of Sphinx, the optimum value for this parameter has usually ranged between 6 and 13, depending on the task at hand.
Similarly, though with lesser impact, it has also been found useful to include a word insertion penalty parameter which is a fixed penalty for each new word hypothesized by the decoder. It is effectively another multiplicative factor in the language model probability computation (before the application of the language weight). This parameter has usually ranged between 0.2 and 0.7, depending on the task.
The Sphinx-3 decoder processes entries listed in a control file. Each line in the control file identifies a separate utterance. A line has the following format (the brackets indicate a group of fields that is optional):
AudioFile [ StartFrame EndFrame UttID ]
AudioFile is the speech input file. In this distribution of s3.X, this file is in raw audio format. In all other versions of Sphinx-3, this file contains cepstrum data. The filename extension should be omitted from the specification. If this is the only field in the line, the entire file is processed as one utterance. In this case, an utterance ID string is automatically derived from the cepstrum filename, by stripping any leading directory name components from it. E.g.: if the control file contains the following entries:
/net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0201 /net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0202 /net/alf20/usr/rkm/SHARED/cep/nov94/h1_et_94/4t0/4t0c0203
three utterances are processed, with IDs 4t0c0201
,
4t0c0202
, and 4t0c0203
, respectively.
If, on the other hand, a control file entry includes the StartFrame and EndFrame fields, only that portion of the cepstrum file is processed. This form of the control file is frequently used if the speech input can be arbitrarily long, such as an entire TV news show. There is one big cepstrum file, but it is processed in smaller chunks or segments. In this case, the final UttID field is the utterance ID string for the entry.
The utterance ID associated with a control file entry is used to identify all the output from the decoder for that utterance. For example, if the decoder is used to generate word lattice files, they are named using the utterance ID. Hence, each ID, whether automatically derived or explicitly specified, should be unique over the entire control file.
Any line in the control file beginning with a #
character is a comment line, and is ignored.
The Sphinx-3 decoder produces a single recognition hypothesis for each utterance it processes. The hypotheses for all the utterances processed in a single run are written to a single output file, one line per utterance. The line format is as follows:
u S s T t A a L l sf wa wl wd sf wa sl wd ... nf
The S
, T
, A
, and
L
fields are keywords and appear in the output as
shown. The remaining fields are briefly described below:
The l score field is followed by groups of four fields, one group for each successive word in the output hypothesis. The four fields are:
The final field, nf, in each hypothesis line is the total number of frames in the utterance.
Note that all scores are log-likelihood values in the peculiar logbase used by the decoder. Secondly, the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0. Finally, the language model scores reported include the language weight and word-insertion penalty parameters.
Here is an example hypothesis file for three utterances.
During recognition the decoder maintains not just the single
best hypothesis, but also a number of alternatives or candidates.
For example, REED
is a perfectly reasonable
alternative to READ
. The alternatives are useful in
many ways: for instance, in N-best list generation. To facilitate
such post-processing, the decoder can optionally produce
a word lattice output for each input utterance. This
output records all the candidate words recognized by the decoder
at any point in time, and their main attributes such as time
segmentation and acoustic likelihood scores.
The term "lattice" is used somewhat loosely. The word-lattice
is really a directed acyclic graph or DAG. Each
node of the DAG denotes a word instance that begins at a
particular frame within the utterance. That is, it is a unique
<word,start-time>
pair. (However, there could
be a number of end-times for this word instance. One of the
features of a time-synchronous Viterbi search using beam pruning
is that word candidates hypothesized by the decoder have a
well-defined start-time, but a fuzzy range of end-times. This is
because the start-time is primarily determined by Viterbi
pruning, while the possible end-times are determined by beam
pruning.)
There is a directed edge between two nodes in the DAG if the start-time of the destination node immediately follows one of the end times of the source node. That is, the two nodes can be adjacent in time. Thus, the edge determines one possible segmentation for the source node: beginning at the source's start-time and ending one frame before the destination's start-time. The edge also contains an acoustic likelihood for this particular segmentation of the source node.
Note: The beginning and end of sentence tokens,
<s>
and </s>
, are not decoded
as part of an utterance by the s3.X decoder. However, they have
to be included in the word lattice file, for compatibility with
the older Sphinx-3 decoder software. They are assigned 1-frame
segmentations, with log-likelihood scores of 0. To accommodate
them, the segmentations of adjacent nodes have to be "fudged" by 1
frame.
A word lattice file essentially contains the above information regarding the nodes and edges in the DAG. It is structured in several sections, as follows:
Frames
section, specifying the number of frames
in utteranceNodes
section, listing the nodes in the DAGInitial
and Final
nodes (for
<s>
and </s>
,
respectively)BestSegAscr
section, a historical remnant now
essentially emptyEdges
section, listing the edges in the DAGThe file is formatted as follows. Note that any line in the
file that begins with the #
character in the first
column is considered to be a comment.
# getcwd: <current-working-directory> # -logbase <logbase-in-effect> # -dict <main lexicon> # -fdict <filler lexicon> # ... (other arguments, written out as comment lines) # Frames <number-of-frames-in-utterance> # Nodes <number-of-nodes-in-DAG> (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME) <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> <Node-ID> <Word-String> <Start-Time> <Earliest-End-time> <Latest-End-Time> ... (for all nodes in DAG) # Initial <Initial-Node-ID> Final <Final-Node-ID> # BestSegAscr 0 (NODEID ENDFRAME ASCORE) # Edges (FROM-NODEID TO-NODEID ASCORE) <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> <Source-Node-ID> <Destination-Node-ID> <Acoustic Score> ... (for all edges in DAG) End
Note that the node-ID values for DAG nodes are assigned sequentially, starting from 0. Furthermore, they are sorted in descending order of their earliest-end-time attribute.
Here is an example word lattice file.
In addition to the s3.X decoders, s3decode
,
s3livedecode
and livrepretend
, this
distribution also provides other utility programs.
In alphabetical order:
approx_cont_mgau.c |
Fast Gaussian Distribution Computation |
agc.c |
Automatic gain control (on signal energy) |
ascr.c |
Senone acoustic scores |
beam.c |
Pruning beam widths |
bio.c |
Binary file I/O support |
cmn.c |
Cepstral mean normalization and variance normalization |
corpus.c |
Control file processing |
cont_mgau.c |
Mixture Gaussians (acoustic model) |
s3decode.c |
Main file for s3decode |
dict.c |
Pronunciation lexicon |
dict2pid.c |
Generation of triphones for the pronunciation dictionary |
feat.c |
Feature vectors computation |
fillpen.c |
Filler word probabilities |
gausubvq.c |
Standalone acoustic model sub-vector quantizer |
hmm.c |
HMM evaluation |
hyp.h |
Recognition hypotheses data type |
kb.h |
All knowledge bases, search structures used by decoder |
kbcore.c |
Collection of core knowledge bases |
lextree.c |
Lexical search tree |
s3live.c |
Live decoder functions |
lm.c |
Trigram language model |
logs3.c |
Support for log-likelihood operations |
main_live_example.c |
Main file for s3livedecode showing use of
live_decode_API.h |
main_live_pretend.c |
Main file for s3livepretend showing use of
live_decode_API.h |
mdef.c |
Acoustic model definition |
mllr.c |
transformation of mean based on a linear regression matrix. |
s3types.h |
Various data types, for ease of modification |
subvq.c |
Sub-vector quantized acoustic model |
tmat.c |
HMM transition matrices (topology definition) |
vector.c |
Vector operations, quantization, etc. |
vithist.c |
Backpointer table (Viterbi history) |
wid.c |
Mapping between LM and lexicon word IDs |