Signal Preprocessing

General Info on Preprocessing

Signal preprocessing is a wide range topic that is good for writing many books (probably there are many books). Signal preprocessing can be so complicated that it even changes during the recognition run several times. It is obvious that doing recognition on raw samples is virtually impossible. All working recognizers do some kind of transformation for the time-domain into frequency-domain. This is either some kind of FFT creating spectral coefficients or cepstral coefficients. Often other features like zero-crossing, power information etc. are also extracted. In Janus, we've found that applying a matrix transformation to frequency-domain coefficients which does a linear discriminant analysis is very helpful. In some special cases we also do some extra tricks like squeezing or warping the time or the frequency axis.

On this page we won't discuss the available kinds of preprocessing. If you need some info about what preprocessing techniques are used in the speech research comunity, and which of these are useful under certain circumstances.

Preprocessing in Janus

Janus has a very flexible preprocessing module, which offers a gazillion of algorithms, and is steadily growing. The central object class of this module is called FeatureSet. Although it is very well possible to do all preprocessing steps manually and explicitely, you are encouraged to use two mechanisms offered by the FeatureSet object, namely the definition of a so called feature description file, and a feature access rule. The access rule defines where to find the actual file that contains the recording. This rule can be a complicated Tcl script, but usually it just plugs a few variables together, like e.g. a speaker name and an utterance ID, to form the UNIX filesystem path to the desired recording file. The variables that can be used are defined in the database that describes all utterances. This way training scripts can remain constant across different tasks, the same thing that is done in every script is getting an utterance ID from somewhere, asking the database for info about the utterance, and giving that info to the FeatureSet, telling it to load the necessary file(s) and perform the preprocessing steps.
These steps are defined in the feature description file, which is a Tcl script using variables that have been defined in the access rule. Such a feature description could have the commands, "load ADC from $file", "compute MEL spectral coefficients", "apply LDA transformation". And the the $file variable was defined in the access rule.
The do-it-yourself pages to this topic contain some examples for feature description files and access rules.