Exercise10
Decoding with adaptation of the acoustic model

Acoustic models that are build specific for one speaker perform in general much better than acoustic models that are trained speaker independent. However, to train a speaker specific acoustic model many hours of speech with references of a speaker is required. This is not suitable for many speaker and applications. In some applications it is possible to ask for a small amount of speech of a speaker. For this reason an adaptation of a speaker independent acoustic model which can be estimated from a small amount of data is needed. The Maximum Likelihood Linear Regression (MLLR) adaptation is one method that only need a small amount of adaptation data.

The idea behind MLLR is to use a linear transform that is shared by many Gaussian to change the model parameters of the Gaussian. Because the transformations are shared only a small number of parameters need to be estimated. A very good description about the constrained and unconstrained MLLR with a mathematical derivation can be found in the paper "Maximum likelihood linear transformations for HMM-based speech recognition" from Mark Gales that can be found at Computer Speech and Language (Vol 12, Issue 2, 1998). The MLLR method was introduced by Leggetter and Woodland in the paper "Speaker Adaptation of Continous Density HMMs using Multivariate Regression" ICSLP 1994 and is also described in the paper "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models" in Computer Speech & Language (Vol 9, Issue 2, 1995)

MLLR can be divided into two cases, an unconstrained case in which the means and the covariance is transformed by two independent transformations A, H and a constrained MLLR were A = H. The later case has the nice advantage that the transform can easily transfered to a transformation of the input features instead of transforming the model itself.

#
# In the janus documentation this is also covered by the section "advanced training" (4.2)
#

# make sure we use utf-8 encoding
encoding system utf-8

#
# Create HMM, Path and search objects.
#
source start-up10.tcl

#
# configure the search space (see exercise/homework 9)
#

ltree configure -cacheN 50 -ncacheN 10

# language model weight, word transition penalty and filler penalty
svmap configure -phonePen 0.0 -wordPen 0.0 -filPen 30 -lz 30

spass configure -stateBeam 130
spass configure -morphBeam 80
spass configure -wordBeam  90

# a good relation between transN and morphN seems to be factor of 4
spass configure -transN 20 -morphN 5

# open database
# IMPORTANT: for demonstration purposes we will us the utterences from training
# However, in the homework9 you should use the data reserved
# for system tuning to tune the parameters of the search.
set uttDB utterance
DBase db
db open ${uttDB}.dat ${uttDB}.idx    -mode "r"

#
# End of initialization
#

Constrained MLLR (cMLLR), Feature MLLR (FMLLR), signal adaptation

We want to improve the hypotheses by adaptation Therefore we first decode the audio and get the hypotheses then we assign each frame to one GMM by computing the viterbi alignment Based on this, we accumulate the statistic which is needed to estimate the transform. After we have collected enough statistics we estimate the adaptation transform W.



#
# Constrained MLLR adaptation
#
# see the paper and/or technical report from Mark Gales
#
SignalAdapt signalAdapt sns
signalAdapt configure -topN  1 -shift 1.0
# define which GMMs are used to estimate the transformation
foreach ds [dss] { signalAdapt add $ds }

# compute the features for utterance spk030_utt1
set xKey spk030_utt1
set uttInfo [db get $xKey]
fs eval $uttInfo

# do the decoding
spass run
spass.stab trace
# score 40538.230469


set hypo [lrange [spass.stab trace] 2 end]
hmm make $hypo
path viterbi hmm -beam 300
# score 5.523549e+04


signalAdapt accu path 1 ;#  accumulate statistic into accu index 1

# the constrained MLLR has no closed form solution and is estimated iteratively
# so we approximate this for example with 10 iterations
# we take the statistics from the accumulators with index 0
# and use the transformation with index 0 to store
# It is important to note that a previous transform is not cleared and 
# the approximation starts from what is stored in current 
# the transformation matrix. To avoid unpredictable behavior 
# it is recommended that the transform is cleared before a 
# new transform is computed.
signalAdapt clear   0      ;# clear transform 0
signalAdapt compute 10 1 0 ;# 10 iterations to compute transform 0 from accumulator 1 

# Question 10-1: Why can we apply the adaptation in the feature domain?

# the constin the feature space
# the following line transform the LDA feature (in place)
signalAdapt adapt fs:LDA.data fs:LDA.data 0

path viterbi hmm -beam 300
# score 5.131486e+04
# the likelihood after applying the estimated transform is improved

# The same for the decoding
spass run
spass.stab trace
# score 37387.937500 

# decode a new utterance
set xKey spk030_utt2
set uttInfo [db get $xKey]
fs eval $uttInfo

# decode without adaptation!
spass run
spass.stab trace
# -> score 49930.894531

# apply transform that was estimated on previous utterance!
signalAdapt adapt fs:LDA.data fs:LDA.data 0
# decoding with adapted features
spass run
spass.stab trace
# -> score 49194.042969
# this is not the utterance the transformation was estimated
# however, the score is improved (smaller).

# decode a new utterance from a NEW speaker
set xKey spk099_utt1
set uttInfo [db get $xKey]
fs eval $uttInfo

spass run
spass.stab trace
# score 34074.410156

signalAdapt adapt fs:LDA.data fs:LDA.data 0
spass run
spass.stab trace
# score 37028.101562
# If we have a speaker and/or environment change 
# it is very likely that the old transform does not fit
# as in this example the score get much worse (larger)
# On approach is to reset the transform and clear the accus.

# Question 10-2: What other approaches may be useful?


# Let us go back to utterance 2 of speaker 30
set xKey spk030_utt2
set uttInfo [db get $xKey]
fs eval $uttInfo
signalAdapt adapt fs:LDA.data fs:LDA.data 0
spass run
set hypo [lrange [spass.stab trace] 2 end]
hmm make $hypo
path viterbi hmm -beam 300

# What we have done so far is that we have used 
# the adapted features to compute the hypotheses 
# and the viterbi alignment. To use the new utterance 
# to incrementally improve the adaptation matrix we can 
# accumulate the statistics.
# However, we have to accumulate the non adapted features.
# This is because we have to use the same model for collecting the statistics
# otherwise it is like combining apples and peaches.

# in this setup we only have to recompute the LDA-Feature from 
# the FEAT-feature to remove the adaptation
fs matmul LDA FEAT ldaMatrix -cut 32

# now we can accumulate the new statistics
signalAdapt accu path 1 ;#  accumulate statistic into accu index 1

# and estimate a new transform based on utterance spk30_utt1 and spk30_utt2
signalAdapt clear   0      ;# clear transform 0
signalAdapt compute 10 1 0 ;# 10 iterations to compute transform 0 from accumulator 1

# Depending on the application we can incrementally improve 
# the adaptation or do a batch wise adaptation.

# Question 10-3: What do the SignalAdapt methods combine, clearAccu and addAccu?

Speaker adaptive training is a methods to apply adaptation during the training to remove speaker differences that can be captured by adaptation. Speaker adaptive training (SAT) with constrained MLLR is an elegant and effective way to improve word error rate.

#
#  Question 10-4: How can you do speaker adaptive training with constrained MLLR?
#


#
#  Question 10-5: How is an SAT acoustic model used during decoding?
#

unconstrained MLLR

#
# MLLR
#
# see the paper and/or technical report from Leggeter and Gales
#

set mode 2         ;# use full transforms (default) 
set minCount 500   ;# minimum threshold to update regression class
                   ;# usual a larger value per transform is used (e.g. 2500)
set depth 3        ;# depth of regression tree (-> 2**(3+1) -1 up to 15 transforms)

# for MLLR we collect the same statistic as during training
cbs createAccus
dss createAccus

# 
MLAdapt mla cbs -mode $mode -bmem 1
foreach cb [cbs:] { mla add $cb } ;# all the models 
mla cluster -depth $depth   ;# build a regression tree for the Gaussian. 
                            ;# Similar Gaussian share the same transformation

mla store ;# store (copy) the original acoustic model
          ;# restore copy the original acoustic model back

# Because we collect the same statistic in the same objects as during regular ML-training
# the steps are also similar.

# compute the features for utterance spk030_utt1
set xKey spk030_utt1

dss clearAccus
cbs clearAccus

# collect 
foreach xKey {spk030_utt1 spk030_utt2 spk030_utt3 spk030_utt4} {
set uttInfo [db get $xKey]
fs eval $uttInfo

spass run
set hypo [lrange [spass.stab trace] 2 end]
hmm make $hypo
path viterbi hmm -beam 300
sns accu path  ;# accumulate statistic
}

mla update -minCount $minCount
# Transform the means/covariances of the codebooks
# To get the original models back we have to use the method restore
# or load the parameters again from disk. 
# The return value indicated how many transformations were used


# Let us see how the scores change for utterance spk030_utt1
set xKey spk030_utt1
set uttInfo [db get $xKey]
fs eval $uttInfo

spass run
spass.stab trace
# score 36817.257812, this is slightly better than the constrained MLLR.
# However, we have also more parameter to estimate, and therefore need more data.

# Question: What has to be taken care of if we want to use this adaptation incrementally?

Other methods of speaker normalisations

One method for speaker normalisation is to use the Cepstral Normalisation on a per speaker base. The cepstral mean of a speaker is a feature used in speker identification.

Vocal Tract Lenth Normalisation (VTLN) is a method to compensate for the lenght differences of the vocal tract by different spekers. The effect is that the formants change their position depending on the length of the vocal tract. The approach here is to warp the frequenzy spektrum in a non linear way to compensate for this effect. In janus the method is part of the featureSet object.

Unfortuenlty we dont have the time to explain these techniques in detail during the lab.

Task: Build a speech recognition system for Mandarin using adaptation, for the decoding you can assume to have a the utterences assigned a speaker id. The best systems get a Starbucks Gift Card and a nice Certificate. If you do this homework you have also to give a short presentation (15min) of how you tried to improve the performance. (details see optional homework)

Last modified: Mon Mar 9 13:01:16 EST 2006
Maintainer: tschaaf@cs.cmu.edu.

Exercise10 Decoding with adaptation of the acoustic model

Constrained MLLR (cMLLR), Feature MLLR (FMLLR), signal adaptation

unconstrained MLLR

Other methods of speaker normalisations

Exercise10
Decoding with adaptation of the acoustic model