Today we will learn how to compute feature vectors from the audio signal. The Janus object for feature computation is the FeatureSet object. You already created one during exercise 1. It stores two types of data: SVector for audio data and FMatrix for pre-processed features.
The number of methods of the FeatureSet is very large and we cannot discuss all of them. Therefore, we will concentrate at the functions we need for this lab.
# first we need a FeatureSet object. > FeatureSet fs # Now let us start with reading one of our training audio files # This can be done with the readADC method # This method has also many options: > fs readADC -help Options of 'readADC' are: <feature> name of the new feature <filename> name of ADC file (string:"NULL") -hm header mode, kind or size in byte (string:"auto") -bm byte mode (string:"auto") -f 1 => skip unnecessary bytes when reading (int:0) -chX selected channel: 1..chN (int:1) -chN number of channels (int:1) -from from (string:"0") -to to (string:"last") -sr sampling rate in kHz (float:0.000000) -offset subtract offset (string:"0") -fadeIn fade in (string:"0") -v verbosity (int:1) -startFile runon: name of start file (string:"adc.start") -readyFile runon: name of ready file (string:"adc.ready") -sleep runon: time to wait before next try (float:0.100000) -rmFiles runon: remove files (int:1) # We need only a few of them. # The first parameter is the destination feature name follows by the name of the file to read the data from # The header mode can be e.g. "WAV" for wav files or "1024" to skip the first 1024 bytes of the file # Byte can be "10" or "01" for big or little endianes, # or shorten for files compressed by the program shorten (supports only version 1!) # from and to tell which portion should be read. # e.g. -from 100 -to 600 read audio from sample 100 to 600 # -from 100s -to 600s read audio from second 100 to second 600 # -from 100s -to last read audio from second 100 to end of file # now read a shortened example file > fs readADC ADC /project/Class-11-753/data/CH/adc/030/CH030_1.adc.shn -bm shorten # configure tells us something about the feature > fs:ADC configure {-samplingRate 16.000000} {-shift 0.000000} {-sampleN 139041} {-dcoeffN 0} {-trans 0} # let us see what type our feature is > fs:ADC type Feature > fs:ADC.data type SVector # After we have an audio to play with let us look at it. # Janus has a nice tool to visualize features based on tcl/tk > fs show ADCThis shows the feature ADC (SVector) waveform in a new window.
Do not close the window, we will use this to watch more feature coming up. To show a list of available features to display hit the button
# as mentioned before the FeatureSet object provides many methods. > fs -help # To much to be shown and explained here # # Let us start with a power. # # computes the signal power over a 30ms window and stores it in the feature POWER0 > fs adc2pow POWER0 ADC 30ms # # The featShow window hit the re-read feature button. Now the feature POWER0 is available! # Click on the POWER0 feature and the waveform disappears. The picture now looks like a bar code. # The default mode to display features does not fit here! # In thedisplay menu go to themode item and selecthorizontal . # The feature is re-drawn and regions with high power can be identified. # # # Question: What type is the POWER0 feature? # # # Now we compute a smoothed log power and normalize the values # After each step re-read the feature list in the featShow window and displays the new feature # > fs alog POWER1 POWER0 1.0 4.0 # Question: Why do we add a 1 before we take the logarithm? > fs filter POWER2 POWER1 {-2 {1 2 3 2 1} } > fs filter POWER3 POWER2 {-2 {1 2 3 2 1} } # # Question: What does filter compute? What happens to POWER2 feature if we replace -2 by 0 > fs normalize POWER4 POWER3 -min -0.1 -max 0.5 # This command scales and shift the values that the minimum is -0.1 and the maximum is 0.5 # Values below 0.0 indicate a silence region # This is a simple energy based speech detection, # Question: What does the method 'thresh' ? # Task: Compute from the feature POWER4 a binary feature SPEECH using the thresh method
# The methods spectrum computes a power spectrum of an audio signal > fs spectrum -help Options of 'spectrum' are: <feature> name of the new feature <source_feature> name of the source feature <win> window size (string:"16msec") -shift shift (string:"10.000000msec") # lets compute the power spectrum from ADC; # default is a hamming window; we use a 30ms window # and a default 10ms frame shift > fs spectrum FFT ADC 30ms # Take a look at the feature in the featShow window. # You have to change the display mode back toGrey . # Not so exiting, because it is mostly white? # Compress the values with a logarithm > fs log lFFT FFT 1 1 # Take a look at the new feature. Can you see some energy peaks? # They are called formants and their position characterizes different vowels. # Some people can read spectrograms! # In some parts of the signal, a harmonic structure is also very well to see. # Question: Where does the harmonic structure come from? # We now create a MEL-scale filter, in Janus this is realized as a band-matrix > FBMatrix matrixMEL # This computes a simple melscale filter (30 mel-bins) # sampling rate 16000Hz (max frequenz 8kHz) and # input are 257 frequenz bins from the power FFT > matrixMEL mel -N 30 -p 257 -rate 16000 # More about mel-scale: http://en.wikipedia.org/wiki/Mel_scale # Question: What is the motivation for a melscale filtering? > fs filterbank MEL FFT matrixMEL > fs log lMEL MEL 1 1 # With log-mel you can already do speech recognition. # In fact this feature once was state of the art signal processing # for speech recognition # Currently mel-cepstrum is the state of art signal processing # for speech recognition FMatrix matrixCOS # transform 30 log-mel coefficients into 13 cepstren # type 1 = DCT-II , type 0 = DCT-I > matrixCOS cosine 13 30 -type 1 # Question: What are the dimensions of the cosine transform matrix? # More about cosine transform: http://en.wikipedia.org/wiki/DCT # Apply the MECP = matrixCOS * lMEL > fs matmul MCEP lMEL matrixCOS # More about cepstren: http://en.wikipedia.org/wiki/Cepstrum # Normalize the mean of cepstrum (CMN) (no variance normalisation a=0!); # If we want also to normalize the variance to # e.g. 2 the parameter -a has to be set to 2. > fs meansub MCEP MCEP -a 0 # Question: What happen to convoluted "noise" after the mean is subtracted from MCEP # Deltas are acceleration features that measure the change of a signal > fs filter DELTA MCEP {-3 {-3 -2 -1 0 1 2 3}} # Combine different features into one big FEAT feature > fs merge FEAT MCEP DELTA
We now know how to create features for our speech recognition system from audio files.
However, we have to process many audio files and Janus provides a way to compute these features "hidden". To do this the FeatureSet instance has to know how to compute the features. A description contain only the body of the feature processing, we call this a
# # creating a feature description file (and some more TCL) # > fs setDesc -help # This example featDesc only demonstrates how parameters are passed # and how to refer to the current FeatureSet > fs setDesc { parray arg ; puts "This is the FeatureSet $fes" } > fs eval { {ADCPATH /project/Class-11-753/data/CH/ } {ADC test.adc.shn} {TEXT yyy} } arg(ADC) = test.adc.shn arg(ADCPATH) = /project/Class-11-753/data/CH/ arg(TEXT) = yyy This is the FeatureSet fs # we can access the list elements in a special array "arg" # and with $fes contains the name of the calling FeatureSet # this is the procedure that was created when "fs setDesc" was executed. proc featureSetEval<fs> {fes {sampleList {}}} { set sampleList [$fes access $sampleList] makeArray arg $sampleList # the feadDesc code is inserted here! parray arg ; puts "This is the FeatureSet $fes" } # with the following TCL-commands you can query the name # of existing TCL-procedures > info proc featureSetEval* featureSetEval featureSetEval<fs> > info args featureSetEval<fs> fes sampleList # the procedure has two parameters # fes stores the name of the calling FeatureSet object # sampleList is the list passed trough the eval method > info body featureSetEval<fs> set sampleList [$fes access $sampleList] makeArray arg $sampleList parray arg ; puts "FeatureSet $fes" # first execute the access function of the FeatureSet # The access function can be used to build path names to find the data # Then the sampleList is converted into an array "arg" # finally the code provided by us is executed > fs setDesc { # start feature computation puts "$fes process $arg(ADCFILE)" $fes readADC ADC $arg(ADCFILE) -bm shorten $fes spectrum FFT ADC 30ms # more code from you ... } # lets assume you store your feature description in the fileTask 7 Write a feature description file that reads an audio file and computes mel-cepstren with delta (homework)featDesc . # you can also write: > fs setDesc @./featDesc # To execute the feature computation just call the eval method of the FeatureSet. > fs eval { {ADCFILE /project/Class-11-753/data/CH/adc/030/CH030_1.adc.shn } }
HINT: The information about the audio file is passed trough the arg-array and comes from the database you have created in the last homework.
In janus mixture weights and means/covariances are separated in two objects. means/covariances are stored in Codebook objects and mixture weights are stored in Distrib objects. These objects are stored the CodebookSet and DistribSet and can not created without these container objects.
# lets start some GMM training > CodebookSet cbs fs > DistribSet dss cbs > cbs add SIL-m -help Options of 'add' are: <name> name of the codebook (string:"SIL-m") <feat> name of the feature space (string:"NULL") <refN> number of reference vectors (int:0) <dimN> dimension of feature space (int:0) <type> type of covariance matrix {NO,RADIAL,DIAGONAL,FULL} > cbs add SIL-m FEAT 1 26 DIAGONAL > cbs:SIL-m alloc ;# allocate memory for mean/variance > cbs add SPEECH-m FEAT 1 26 DIAGONAL # Question: Why do we use DIAGONAL? # Tell the mixture weights to what Gaussian they point > dss add SIL-m SIL-m > dss add SPEECH-m SPEECH-m # This also allocates memory to accumulate statistics (side effect, # missing memory for mean/variance is also allocated ) > cbs createAccus > dss createAccus # Let us accumulate the statistic from the first frame of our features. > dss accuFrame SIL-m 0 # let us check what we have accumulated so far > dss:SIL-m.accu { 1.0000e+00} > cbs:SIL-m.accu {{ 1.0000e+00} {{ -8.1901e+00 4.8879e-01 8.7985e+00 -2.5220e+00 4.4818e-01 1.7419e+00 -5.5965e-01 -9.3969e-02 -1.4680e-01 -3.1432e+00 2.4545e+00 1.8970e-01 4.7703e-01 -1.2587e+02 -8.5686e+01 1.6740e+00 2.6424e+01 5.5329e+01 1.2608e+01 -7.5955e+00 1.8400e+01 6.1627e+00 2.3878e+01 -2.6597e+01 1.3416e+01 -3.6496e+00} }} # The squares of the observations (DIAGONAL covariance!) > cbs:SIL-m.accu.cov(0,0) { 6.7077e+01 2.3892e-01 7.7414e+01 6.3605e+00 2.0087e-01 3.0341e+00 3.1321e-01 8.8302e-03 2.1549e-02 9.8799e+00 6.0247e+00 3.5984e-02 2.2755e-01 1.5843e+04 7.3422e+03 2.8023e+00 6.9823e+02 3.0613e+03 1.5895e+02 5.7691e+01 3.3858e+02 3.7978e+01 5.7015e+02 7.0741e+02 1.8000e+02 1.3320e+01} # This is the sum of observation > cbs:SIL-m.accu.mat(0) { -8.19006348e+00 4.88790512e-01 8.79854774e+00 -2.52200127e+00 4.48180199e-01 1.74186277e+00 -5.59652567e-01 -9.39693451e-02 -1.46796942e-01 -3.14323831e+00 2.45452642e+00 1.89695120e-01 4.77026045e-01 -1.25870605e+02 -8.56864777e+01 1.67400527e+00 2.64240456e+01 5.53293800e+01 1.26075144e+01 -7.59547138e+00 1.84004898e+01 6.16266537e+00 2.38778515e+01 -2.65971718e+01 1.34163513e+01 -3.64960694e+00 } # Okay accumulate some more statistics > dss accuFrame SIL-m 1 > dss accuFrame SIL-m 2 > dss accuFrame SIL-m 3 > dss accuFrame SIL-m 4 > dss accuFrame SIL-m 5 # Now we want to compute the new models # of course this is not enough statistics! > dss update # this are the new model parameters for the GMM SIL-m > cbs:SIL-m.mat { 1.929263e+01 1.026237e+01 6.847821e-01 -4.159660e+00 -8.033497e+00 -1.088695e+00 1.355846e+00 -3.016090e+00 -7.507172e-01 -3.812534e+00 3.463488e+00 -2.207632e+00 4.978468e-01 -1.734147e+02 -2.528517e+01 3.536517e+01 1.112829e+01 4.520940e+01 6.829309e+00 -2.170840e+01 1.527374e+01 -4.170792e+00 4.167288e+00 -3.361704e+00 1.410798e+01 -2.447980e+00 } # We store the inverse of the covariance (efficiency) > cbs:SIL-m.cov(0) { 3.5384e-03 4.0478e-02 4.4786e-02 1.4008e+00 5.7899e-02 4.7885e-01 2.1122e-01 4.9514e-01 1.6728e+00 1.2837e+00 1.0576e+00 5.3214e-01 1.0949e+00 4.4359e-04 4.4802e-04 2.4243e-03 8.6397e-03 3.1798e-03 1.5289e-02 1.2764e-02 3.1694e-02 2.0303e-02 5.3952e-03 3.4078e-03 3.9119e-02 3.0104e-02 8.2677e+01} # We can copy model parameters if they fit (number of Gaussian and dimensions) > cbs:SPEECH-m := cbs:SIL-m > dss:SPEECH-m := dss:SIL-m # we can clear the accumulators before we start another round. > dss clearACCUS > cbs clearACCUS > cbs:SIL-m.accu {{ 0.0000e+00} {{ 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00} }} # we can save the models description (ASCII) and parameters (binary) > cbs write example-cbs.desc.gz > cbs save example-cbs.param.gz > dss write example-dss.desc.gz > dss save example-dss.param.gz # Compute the likelihood of a feature frame to a given GMM e.g. SIL-m > dss score SIL-m 0 # Question: How is the score of a GMM computed?
Task 8 Train the single Gaussian of the speech and silence model using all training utterances To select between speech and silence use the power based speech/silence detection described above. Write the description and parameters of the trained GMM to disk. (more details see homework).