Signal Preprocessing
General Info on Preprocessing
Signal preprocessing is a wide range topic that is good for
writing many books (probably there are many books). Signal
preprocessing can be so complicated that it even changes during
the recognition run several times. It is obvious that doing
recognition on raw samples is virtually impossible. All working
recognizers do some kind of transformation for the time-domain
into frequency-domain. This is either some kind of FFT creating
spectral coefficients or cepstral coefficients. Often other
features like zero-crossing, power information etc. are also
extracted. In Janus, we've found that applying a matrix transformation
to frequency-domain coefficients which does a linear discriminant
analysis is very helpful. In some special cases we also do some
extra tricks like squeezing or warping the time or the frequency
axis.
On this page we won't discuss the available kinds of preprocessing.
If you need some info about what preprocessing techniques are used
in the speech research comunity, and which of these are useful
under certain circumstances.
Preprocessing in Janus
Janus has a very flexible preprocessing module, which offers a
gazillion of algorithms, and is steadily growing. The central
object class of this module is called FeatureSet. Although
it is very well possible to do all preprocessing steps manually
and explicitely, you are encouraged to use two mechanisms offered
by the FeatureSet object, namely the definition of a so called
feature description file, and a feature access rule. The access
rule defines where to find the actual file that contains the
recording. This rule can be a complicated Tcl script, but usually
it just plugs a few variables together, like e.g. a speaker name
and an utterance ID, to form the UNIX filesystem path to the
desired recording file. The variables that can be used are defined
in the database that describes all utterances. This way training
scripts can remain constant across different tasks, the same thing
that is done in every script is getting an utterance ID from somewhere,
asking the database for info about the utterance, and giving that
info to the FeatureSet, telling it to load the necessary file(s) and
perform the preprocessing steps.
These steps are defined in the feature description file, which is a
Tcl script using variables that have been defined in the access rule.
Such a feature description could have the commands, "load ADC from
$file", "compute MEL spectral coefficients", "apply LDA transformation".
And the the $file variable was defined in the access rule.
The do-it-yourself pages to this topic contain some examples for
feature description files and access rules.