Original by Mosur Ravishankar (Ravi)
Maintained by Kevin A. Lenzo (lenzo@cs.cmu.edu)
School of Computer Science
Carnegie Mellon University
Copyright (c) 1997-2001 Carnegie Mellon University.
Sphinx2 consists of a set of libraries that include core speech recognition functions as well as auxiliary ones such as low-level audio capture. The libraries are written in C and have been compiled on several Unix platforms (DEC Alpha, Sun Sparc, HPs) and Pentium/PentiumPro PCs running WindowsNT or Windows95. A number of demo applications based on this recognition engine are also provided.
Several features specifically intended for developing real applications have been included in Sphinx2. For example, many aspects of the decoder can be reconfigured at run time. New language models can be loaded or switched dynamically. Similarly, new words and pronunciations can be added. The audio input data can be automatically logged to files for any future analysis.
The rest of this document is structured as follows:
int
variables. The memory requirements can be
considerably reduced by converting the senone PDF values to 8-bit quantities (see
Section Building 8-Bit Senone Dump Files).
.phone
and a
.map
file.)
It is not necessary to create a new senone mapping for every distinct dictionary. Smaller dictionaries can use the senone mapping created for larger dictionaries. However, the mapping information must be consistent with the acoustic model being used.
Large LMs load very slowly. The delay can be avoided by providing LM dump files along with the original LMs. The Sphinx2 decoder automatically creates LM dump files for large LMs (see Section Building LM Dump Files).
Details of the recognition engine can be found in Ravishankar's Ph.D thesis poscript file.
The active vocabulary is the intersection of the words in the active language model and the pronunciation dictionary. The recognizer can only output words from this intersection.
The recognition engine can be recognfigured in several ways, but generally only in between utterances:
As we shall see below, none of the core decoder API functions directly accesses any audio device. Rather, the application is responsible for collecting audio data to be decoded. This gives applications the freedom to decode audio data originating at any source at all---standard audio devices, pre-recorded files, data from a remote location over a network, etc. Since most applications ultimately need to access common audio devices and to perform some form of silence filtering to detect speech/no-speech conditions, the two additional modules are provided with Sphinx2 as a convenience.
(NOTE: The APIs often use int32
and int16
for 32-bit and 16-bit integer types.
These are #defined at compile time, usually as
int
and short
,
respectively.)
include/ad.h
.
ad_open :
|
Opens an audio device for recording. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.) |
ad_start_rec :
|
Starts recording on the audio device associated with the specified handle. |
ad_read :
|
Reads upto a specified number of samples into a given buffer. It returns the number of samples actually read, which may be less than the number requested. In particular it may return 0 samples if no data is available. Most systems typically have a limited amount of internal buffering (at most a few seconds). Hence, this function must be called frequently enough to avoid buffer overflow. |
ad_stop_rec :
|
Stops recording. (However, the system may still have internally buffered data remaining to be read.) |
ad_close :
|
Closes the audio device associated with the specified audio handle. |
examples/adrec.c
and examples/adpow.c
for two examples demonstrating the use of the above functions.
A similar set of playback functions are provided (currently implemented only on WindowsNT/Windows95 PC platforms):
ad_open_play :
|
Opens an audio device for playback. Returns a handle to the opened device. (Currently 16KHz, 16-bit PCM only.) |
ad_start_play :
|
Starts playback on the device associated with the given handle. |
ad_write :
|
Sends a buffer of samples for playback. The function may accept fewer than the samples provided, depending on available internal buffers. It returns the number of samples actually accepted. The application must provide data sufficiently rapidly to avoid breaks in playback. |
ad_stop_play :
|
End of playback. Playback is continued until all buffered data has been consumed. |
ad_close_play :
|
Closes the audio device associated with the specified handle. |
ad_mu2li
for converting 8-bit mu-law samples into
16-bit linear PCM samples.
See examples/adplay.c
for an example that plays back audio samples from a given input file.
The implementation of the audio API for various platforms is contained in analog-to-digital library for the given architecture.
The silence filtering module is interposed between the raw audio input source
and the application. The application calls the function
cont_ad_read
instead of directly reading the raw A/D
input source (e.g., via the ad_read
function described
above).
cont_ad_read
returns only those segments of
input audio that it determines to be non-silence. Additional timestamp information
is provided to inform the application about silence regions that have been dropped.
The complete continuous listening API is defined in
include/cont_ad.h
and is summarized below:
cont_ad_init :
|
Associates a new continuous listening module instance with a specified raw A/D
handle and a corresponding read function pointer. E.g., these
may be the handle returned by ad_open and function
ad_read described above.
|
cont_ad_calib :
|
Calibrates the background silence level by reading the raw audio for a few seconds.
It should be done once immediately after cont_ad_init ,
and after any environmental change.
|
cont_ad_read :
|
Reads and returns the next available block of non-silence data in a given buffer.
(Uses the read function and handle supplied to
cont_ad_init to obtain the raw A/D data.) More details
are provided below.
|
cont_ad_reset :
|
Flushes any data buffered inside the module. Useful for discarding accumulated, but unprocessed speech. |
cont_ad_set_thresh :
|
Useful for adjusting the silence and speech thresholds. |
cont_ad_detach :
|
Detaches the specified continuous listening module from the associated audio device. |
cont_ad_attach :
|
Attaches the specified continuous listening module to the specified audio device.
(Similar to cont_ad_init , but without the need to
calibrate the audio device.)
|
cont_ad_close :
|
Closes the continuous listening module. |
cont_ad_read
function:
Operationally, every call to cont_ad_read
causes the
module to read the associated raw A/D source (as much data as possible and available),
scan it for speech (non-silence) segments and enqueue them internally. It returns
the first available segment of speech data, if any. In addition to returning
non-silence data, the function also updates a couple of parameters that may be of
interest to the application:
siglvl
member variable of the
cont_ad_t
structure returned by
cont_ad_init()
.
cont_ad_read()
call. This is in
the read_ts
member variable of the
cont_ad_t
structure.
cont_ad_read
, the timestamp is 100000 and 116000,
respectively, the application can determine that 1 sec (16000 samples) of silence
have been gobbled up between the two calls.
Silence regions aren't chopped off completely. About 50-100ms worth of silence is preserved at either end of a speech segment and passed on to the application.
Finally, the continuous listener won't concatenate speech segments separated by
silence. That is, the data returned by a single call to
cont_ad_read
will not span raw audio separated by
silence that has been gobbled up.
cont_ad_read
must be called frequently enough to avoid
loss of input data owing to buffer overflow. The application is responsible for
turning actual recording on and off, if applicable. In particular, it must ensure
that recording is on during calibration and normal operation.
See examples/cont_adseg.c
for an example that uses the continuous listening module to segment live audio input
into separate utterances. Similarly,
examples/cont_fileseg.c
segments a given pre-recorded file containing audio data into utterances.
The implementation of continuous listening is in
src/libfe/cont_ad.c
.
Applications that use this module are required to link with
libfe
and libcommon
(and libad
if necessary).
include/fbs.h
.
The two functions pertaining to initialization and final cleanup are:
fbs_init :
|
Initializes the decoder. The input arguments (in the form of the common
command line argument list argc,argv ) specify the
input databases (acoustic, lexical, and language models) and various other
decoder configuration options. (See Arguments Reference.)
If batch-mode processing is indicated
(see -ctlfn
option below) it happens as part of this initialization.
|
fbs_end :
|
Cleans up the internals of the decoder before the application exits. |
Sphinx2 applications can use the following functions to decode speech into text,
one utterance at a time:
uttproc_begin_utt :
|
Begins decoding the next utterance. The application can assign an id string to it. If not, one is automatically created and assigned. |
uttproc_rawdata :
|
Processes (decodes) the next chunk of raw A/D data in the current utterance. This
can be non-blocking, in which case much of the data may be simply queued
internally for later processing. Note that only 16-bit linear PCM-encoded samples
can be processed. The A/D library provides a separate function
ad_mu2li for converting 8-bit mu-law encoded data
into 16-bit PCM format.
|
uttproc_cepdata :
|
This is an alternative to uttproc_rawdata if the
application wishes to decode cepstrum data instead of raw A/D data.
|
uttproc_end_utt :
|
Indicates that no more input data is forthcoming in the current utterance. |
uttproc_result :
|
Finishes processing internally queued up data and returns the final recognition result string. It can also be non-blocking, in which case it may return after processing only some of the internally queued up data. |
uttproc_result_seg :
|
Like uttproc_result , but returns word segmentation
information (measured in 10msec frames) instead of the recognition string. One
can use either this function or uttproc_result to
finish decoding, but not both.
|
uttproc_partial_result :
|
Before the final result is available, this function can be used to obtain the most up-to-date partial result (for example, as feedback to the user). |
uttproc_partial_result_seg :
|
Like uttproc_partial_result , but returns word
segmentation information (measured in 10msec frames) instead of the recognition
string.
|
uttproc_abort_utt :
|
This is an alternative to uttproc_end_utt that
terminates the current utterance. No further recognition results can be obtained for it.
|
search_get_alt :
|
Returns N-best hypotheses for the utterance (see further details in
include/fbs.h ).
|
uttproc_result
allows the application to respond
to user-interface events in real-time.
The application code fragment for decoding one utterance typically looks as follows:
uttproc_begin_utt (....) while (not end of utterance) { /* indicated externally, somehow */ read any available A/D data; /* possibly 0 length */ uttproc_rawdata (A/D data read above, non-blocking); } uttproc_end_utt (); uttproc_result (...., blocking);See demo applications in
examples
for several
variations:
Multiple, named LMs can be resident with the decoder module, either
read in during initialization, or dynamically at run time. However, exactly
one LM must be selected and active for decoding any given utterance.
As mentioned earlier, the active vocabulary for each utterance is given by the
intersection of the pronunciation dictionary and the currently active
LM. The following auxiliary functions allow the application to control language
modelling related aspects of the decoder:
lm_read :
|
Reads in a new language model from a given file, and is associated with a given name. The application only needs this function to create and load LMs dynamically at run time, rather than at initialization. |
lm_delete :
|
Deletes the named LM from the decoder repertory. |
uttproc_set_lm :
|
Sets the currently active LM to the named one. Must only be invoked in-between utterances. |
uttproc_set_context :
|
Sets a two-word history for the next utterance to be decoded, giving its first words additional context that can be exploited by the LM. |
The raw input data for each utterance and/or the cepstrum data derived from it
can be logged to specified directories:
uttproc_set_rawlogdir :
|
Specifies the directory to which utterance A/D data should be logged. An
utterance is logged to file <id>.raw, where <id> is the string
assigned to it by uttproc_begin_utt .
|
uttproc_set_mfclogdir :
|
Specifies the directory to which utterance cepstrum data should be logged. Like A/D files above, an utterance is logged to file <id>.mfc. |
uttproc_get_uttid :
|
Retrieves the string id for the current or most recent utterance. Useful for locating the logged A/D data and cepstrum files, for example. |
uttproc_allphone_cepfile :
|
Performs allphone recognition on the given file and returns the resulting phone segmentation. |
examples
:
sphinx2-ptt
:
demonstrates an application in which the user explicitly indicates the start and
end of each utterance using the <RETURN
> keyboard key.
(On WindowsNT/Windows95 systems, the ending <RETURN
> is not
used. Instead, the utterance is terminated after a fixed duration.)
sphinx2-continuous
:
demonstrates the interaction of continuous listening and decoding. An endless
audio input stream is automatically segmented into utterances using the continuous
listening module, and the utterances are decoded. The timestamps returned by the
continuous listening module are used to locate gaps in speech data of at least 1 sec,
thus marking the utterance boundaries.
sh autogen.sh
if necessary
./configure
make
make test
make install
-allphone
flag is
TRUE
during the initialization. In this mode, no
language model should be provided; i.e., the -lmfn
and -lmctlfn
arguments should be omitted.
-lmfn
and -lmctlfn
arguments should be omitted. The
set of utterances (speech data) is given by the -ctlfn
argument, as usual. In addition, the corresponding transcripts should be given
in a parallel file, which should be the -tactlfn
argument. Each line in this file should contain the transcript for one utterance
(and nothing else; in particular, no utterance-id). The first line of this file
should contain just the string *align_all*
.
Alignments at the word, phone and state levels can be obtained by setting the flags
-taword
, -taphone
, and
-tastate
individually to
TRUE
or FALSE
. Alignments
are written to stdout (the log file).
fbs_init(int argc, char *argv[])
defined in
include/fbs.h
.
(Applications built around the Sphinx2 libraries, of course, can have additional
arguments.) Many arguments, such as the input model databases, must be
specified by the user. We cover the more important ones below (the remaining
have reasonable default values):
Flag | Description | Default |
---|---|---|
-lmfn
|
Optional DARPA format bigram/trigram backoff LM file with the empty string as its name. | None. |
-lmctlfn
|
Optional LM control file with a list of LM files and associated names (one line per entry). This is how multiple LMs can be loaded during initialization. | None. |
-kbdumpdir
|
Optional directory containing precompiled binary versions of LM files (see Building LM Dump Files). | None. |
-dictfn
|
Main pronunciation dictionary file. | None. |
-oovdictfn
|
Optional out-of-vocabulary (OOV) pronunciation dictionary. These are added to the
unnamed LM (read from -lmfn file) with unigram
probability given by -oovugprob .
|
None. |
-ndictfn
|
Optional "noise" words pronunciation dictionary. Noise words are not part of any LM and, like silence, can be inserted transparently anywhere in the utterance. | None. |
-phnfn -mapfn
|
Phone and map files with senone mapping information for the given dictionary and acoustic model. | None. |
-hmmdir -hmmdirlist -cbdir
|
Directory with Sphinx-II semi-continuous HMM acoustic models and codebooks. | None. |
-sendumpfn -8bsen
|
Optional 8-bit senone model file created from the 32-bit HMM models
(see Building 8-Bit Senone Dump Files).
-8bsen should be TRUE if
the 8-bit senones are used.
|
None. |
Flag | Description | Default |
---|---|---|
-ctlfn -ctloffset -ctlcount
|
Batch-mode control file listing utterance files (without their file-extension)
to decode. -ctloffset is the number of initial
utterances in the file to be skipped, and -ctlcount
the number to be processed (after the skip, if any).
-ctlfn must not be specified for live-mode or
application-driven operation.
|
None 0 All |
-datadir |
If the control file (-ctlfn argument) entries are relative pathnames, an optional directory prefix for them may be specified using this argument. | None |
-allphone |
Should be TRUE to configure the recognition engine for
allphone mode operation.
|
FALSE
|
-tactlfn |
Input transcript file, parallel to the control file (-ctlfn )
in forced alignment mode.
|
None |
-adcin -adcext -adchdr -adcendian
|
In batch mode, -adcin selects A/D
(TRUE ) or cepstrum input data
(FALSE ).
If TRUE , -adcext
is the file extension to be appended to names listed in the
-ctlfn argument file,
-adchdr the number of bytes of header in each
input file, and -adcendian their byte ordering:
0 for big-endian, 1 for little-endian. With these flags, most A/D data file
formats can be processed directly.
|
FALSE raw 0 1 |
-normmean -nmprior
|
Cepstral mean normalization (CMN) option. If -nmprior
is FALSE , CMN computed on current utterance only
(usually batch mode), otherwise based on past history (live mode).
|
TRUE FALSE
|
-compress -compressprior
|
Silence deletion (within decoder, not related to continuous
listening). If -compressprior is
FALSE , based on current utterance statistics
(batch mode), otherwise based on past history (live mode).
-compress should be
FALSE if continuous listening is used.
|
FALSE FALSE
|
-agcmax -agcemax
|
Automatic gain control (AGC) option. In batch mode only
-agcmax should be TRUE ,
and in live mode only -agcemax .
|
FALSE FALSE
|
-live
|
Forces some live mode flags: -nmprior
-compressprior and
-agcemax to TRUE
if any AGC is on.
|
FALSE
|
-samp
|
Sampling rate; must be 8000 or 16000. |
16000
|
-fwdflat
|
Run flat-lexical Viterbi search after tree-structured pass (for better
accuracy). Usually FALSE in live mode.
|
TRUE
|
-bestpath
|
Run global best path search over Viterbi search word lattice output (for better accuracy). |
TRUE
|
-compallsen
|
Compute all senones, whether active or inactive, in each frame. |
FALSE
|
-latsize
|
Word lattice entries to be allocated. Longer sentences need larger lattices. | 50000 |
Flag | Description | Default |
---|---|---|
-top
|
Number of codewords computed per frame. Usually, narrowed to 1 in live mode. | 4 |
-beam -npbeam
|
Main pruning thresholds for tree search. Usually narrowed down to 2e-6 in live mode. |
1e-6 1e-6 |
-lpbeam
|
Additional pruning threshold for transitions to leaf nodes of lexical tree. Usually narrowed down to 2e-5 in live mode. | 1e-5 |
-lponlybeam -nwbeam
|
Yet more pruning thresholds for leaf nodes and exits from lexical tree. Usually narrowed down to 5e-4 in live mode. |
3e-4 3e-4 |
-fwdflatbeam -fwdflatnwbeam
|
Main and word-exit pruning thresholds for the optional, flat lexical Viterbi search. |
1e-8 3e-4 |
-topsenfrm -topsenthresh
|
No. of lookahead frames for predicting active base phones. (If <=1, all base phones
assumed to be active every frame.) -topsenthresh is
log(pruning threshold) applied to raw senone scores to determine active phones in each
frame.
|
1 -60000 |
Flag | Description | Default |
---|---|---|
-langwt -fwdflatlw -rescorelw
|
Language weights applied during lexical tree Viterbi search, flat-structured Viterbi search, and global word lattice search, respectively. |
6.5 8.5 9.5 |
-ugwt
|
Unigram weight for interpolating unigram probabilities with uniform distribution. Typically in the range 0.5-0.8. | 1.0 |
-inspen -silpen -fillpen
|
Word insertion penalty or probability (for words in the LM),
insertion penalty for the silence word, and
insertion penalty for noise words (from -ndictfn file) if any.
|
0.65 0.005 1e-8 |
-oovugprob
|
Unigram probability (logprob) for OOV words from
-oovdictfn file, if any.
|
-4.5 |
Flag | Description | Default |
---|---|---|
-matchfn
|
Filename to which final recognition string for each utterance written. (Old format, word-id at the end.) | None |
-matchsegfn
|
Like -matchfn , but contains word segmentation
info: startframe #frames word...
(New format, word-id at the beginning.)
|
None |
-reportpron
|
Causes word pronunciation to be included in output files. |
FALSE
|
-rawlogdir
|
If specified, logs raw A/D input samples for each utterance to the indicated directory. (One file per utterance, named <uttid>.raw.) | None |
-mfclogdir
|
If specified, logs cepstrum data for each utterance to the indicated directory. (One file per utterance, named <uttid>.mfc.) | None |
-dumplatdir
|
If specified, dumps word lattice for each utterance to a file in this directory. | None |
-logfn
|
Filename to which decoder logging information is written. | stdout/stderr |
-backtrace
|
Includes detailed word backtrace information in log file. |
TRUE
|
-nbest
|
No. of N-best hypotheses to be produced. Currently, this flag is only useful in batch
mode. But an application can always directly invoke
search_get_alt to obtain them.
Also, the current implementation is lacking in some details (e.g., in returning
detailed scores).
|
0 |
-nbestdir
|
Directory to which N-best files written (one/utterance). | Current dir. |
-taword -taphone -tastate
|
Whether word, phone, and state alignment output should be produced when running in forced alignment mode. |
TRUE TRUE FALSE
|
Finally, one of the arguments can be:
-argfile
filename.
This causes additional arguments to be read in from the given
file. Lines beginning with the '#' character in this file are ignored.
Recursive -argfile
specifications are not allowed.
Use 8-bit senone dump file. | |
A/D input file byte-ordering. | |
A/D input file extension. | |
No. bytes of header in A/D input file. | |
Input file contains A/D samples or cepstra (TRUE/FALSE). | |
Compute AGC (max C0 normalized to 0; estimated, live mode). | |
Compute AGC (max C0 normalized to 0 based on current utterance). | |
Arguments file. | |
Provide detailed backtrace in log file. | |
Main pruning beamwidth. | |
Run global best path algorithm on word lattice. | |
Codebooks directory. | |
Compute all senones. | |
Remove silence frames (based on C0 statistics). | |
Remove silence frames (based on C0 statistics from prior history). | |
No. of utterances to decode in batch mode. | |
Control file listing utterances to decode in batch mode. | |
No. of initial utterances to be skipped from control file. | |
Directory prefix for control file entries. | |
Main pronunciation dictionary. | |
Directory for dumping word lattices. | |
Noise word penalty (probability). | |
Run flat-lexical Viterbi search. | |
Main beam width for flat search. | |
Language weight for flat search. | |
Word-exit beam width for flat search. | |
Directory containing acoustic models. | |
Directory containing acoustic models. | |
Word insertion penalty (probability). | |
Directory containing LM dump files. | |
Language weight for lexical tree search. | |
Size of word lattice to be allocated. | |
Live mode. | |
Control file listing named language model files to be loaded at initialization. | |
Unnamed language model file to load at initialization. | |
Output log file. | |
Transition to last phone beam width. | |
Last phone internal beam width. | |
Senone mapping file. | |
Output match file. | |
Output match file with word segmentation. | |
Directory for logging cepstrum data for each utterance. | |
No. of N-best hypotheses to be produced/utterance. | |
Directory for writing N-best hypotheses files. | |
Noise words dictionary. | |
Cepstral mean normalization based on prior utterances statistics. | |
Cepstram mean normalization. | |
Next phone beam width for tree search. | |
Word-exit beam width for tree search. | |
Out-of-vocabulary words pronunciation dictionary. | |
Unigram probability for OOV words. | |
Phone file (senone mapping information). | |
Directory for logging A/D data for each utterance. | |
Show actual word pronunciation in output match files. | |
Language weight for best path search. | |
Input audio sampling rate(16000/8000). | |
(8-bit) Senone dump file. | |
Silence word penalty (probability). | |
Forced alignment transcript file. | |
Whether phone-level alignment information should be output. | |
Whether state-level alignment information should be output. | |
Whether word-level alignment information should be output. | |
No. of top codewords to evaluate in each frame. | |
No. of frames to lookahead to determine active base phones. | |
Pruning threshold applied to determine active base phones. | |
Unigram weight for interpolating unigram probability with uniform probability. |
-beam
-npbeam
-lpbeam
-lponlybeam
,
and
-nwbeam
uniformly by a factor >1.
-top
from 4 to 1.
-topsenfrm
>1,
and adjusting the corresponding pruning beamwidth
-topsenthresh
.
The former can be set to 3, and the latter between -50000 and -70000.
(Threshold values closer to 0 provide tigher pruning.)
-compallsen
.
When -top
is 1,
it is generally more efficient to compute all senones, but not when
-top
is 4. However,
when using very small vocabularies of just tens of words, it is preferable to
compute only the active senones, regardless of the value of
-top
. (But if
-topsenfrm
>1,
all senones are computed anyway.)
LM dump files can be created by either a standalone program
examples/lm3g2dmp.c
or the decoder. The standalone version can be compiled from the
examples
directory.
The program takes two arguments, the LM source file and a directory in which the
dump file is to be created. It reads the header from the original LM file to determine
the size of the LM. It then forms the binary dump file name by appending a
.DMP
extension to the LM file name. This file is written
to the second (directory) argument. (NOTE: The dump file must not already
exist!!)
Any version of the decoder can also automatically create binary "dump" files
similar to the standalone version described above. It first looks for the
dump file in the directory given by the
-kbdumpdir
argument. If the dump file is present it reads it and ignores the rest of the
original LM file. Otherwise, it reads the LM file and creates a dump file in the
-kbdumpdir
directory so that it can be used in subsequent decoder runs.
The decoder does not create dump files for small LMs that have fewer than an internally defined number of bigrams and trigrams.
-hmmdir
argument.) However, they can be
clustered down to 8-bits for memory efficiency, without loss of recognition
accuracy. The clustering is carried out by an offline process as follows:
-sendumpfn
flag set to the temporary file name,
the -8bsen
flag set to
FALSE
, and omitting the
-lmfn
argument.
The decoder can be killed after it creates the 32-bit senone dump file, which happens
during the initialization and is announced in the log output.
/afs/cs/project/plus-2/s2/Sphinx2/bin/alpha/pdf32to8b 32bit-file 8bit-file
pdf32to8b
is the temporary 32-bit dump file created above, and the second argument is the
8-bit output file.
-sendumpfn
argument to the decoder with the -8bsen
argument set to TRUE
.