|
||||||||||||||||||
Speaker
Adaptive Training for Deep Neural Network Acoustic Models |
||||||||||||||||||
Yajie Miao, Hao
Zhang, Florian Metze
Carnegie Mellon University
|
||||||||||||||||||
|
||||||||||||||||||
Introduction |
||||||||||||||||||
Speaker adaptive training (SAT) is a standard technique for GMM models. We apply the concept of SAT to deep neural network (DNN) acoustic models. Training of SAT-DNN models starts from fully-trained DNN models. We then train a smaller neural network iVecNN which takes speaker i-vectors as inputs and outputs linear feature shifts. These feature shifts are added to the original DNN inputs, resulting in more speaker-normalized features. The canonical DNN is finally updated in this newly-estimated feature space. DNN SAT-DNN |
||||||||||||||||||
|
||||||||||||||||||
Training | ||||||||||||||||||
Suppose that we have a fully-trained DNN model and the i-vector for each speaker s. The idea of SAT-DNN can be formulated by the following equation: a_t = o_t + iVecNN(i_s) where i_s is the i-vector for s, o_t is the original DNN input feature vector (e.g., fbanks). iVecNN is a separate neural network depicted with green circles in the figure. For each speaker, iVecNN converts his/her i-vector into a linear feature shift. With this shift added, the resulting DNN input vector a_t becomes more speaker-normalized. The steps to train SAT-DNN models can be summarized as follows: 1) Train the baseline DNN over the training data, as we normally do 2) Extract i-vectors for training speakers 3) Learn the iVecNN network, by keeping the DNN fixed 4) Update the DNN model in the new feature space a_t, by keeping the iVecNN network fixed Both the learning of iVecNN and the updating of the DNN model can be performed using the standard error backpropagation, using stochastic gradient descent. The implementation can be found in our Kaldi+PDNN scripts and PDNN toolkit. |
||||||||||||||||||
|
||||||||||||||||||
Testing (Decoding) |
||||||||||||||||||
During decoding, we simply need to extract the i-vector for each testing speaker. Feeding the i-vector to the SAT-DNN architecture will automatically adapt SAT-DNN to this testing speaker. No initial decoding pass and no DNN fine-tuning are needed on the adaptation data. |
||||||||||||||||||
|
||||||||||||||||||
Implementation |
||||||||||||||||||
SAT-DNN has been integrated into our Kaldi+PDNN recipes. You can check out the latest version from the repository and find the following 4 recipes run_swbd_110h/run-dnn-fbank-sat.sh -- Hybrid systems with filterbanks as input features run_swbd_110h/run-dnn-sat.sh -- Hybrid systems with fMLLRs as input features run_swbd_110h/run-bnf-fbank-tandem-sat.sh -- BNF tandem systems with filterbanks as input features run_swbd_110h/run-bnf-tandem-sat.sh -- BNF tandem systems with fMLLRs as input features Before running each of the recipes, make sure that 1) I-vectors have been generated by running run_swbd_110h/run-ivec-extract.sh 2) The correpsonding DNN recipe has been run beforehand. For example, run-dnn-fbank.sh for run-dnn-fbank-sat.sh. 3) Dowload the following two source files to src/featbin and compile them. http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/get-spkvec-feat.cc http://www.cs.cmu.edu/~ymiao/codes/kaldipdnn/add-feats.cc Training of the SAT-DNN models is performed by the run_DNN_SAT.py command from the PDNN toolkit. |
||||||||||||||||||
|
||||||||||||||||||
Results | ||||||||||||||||||
This is the Switchboard 110-hour setup. We show the WER% on the Switchboard part of the Hub'00 evaluation set. The DNN input features can be SI filterbanks or SA fMLLRs. Hybrid Models
Bottleneck Feature Tandem Systems
|
||||||||||||||||||
|
||||||||||||||||||
Reference | ||||||||||||||||||
Yajie Miao, Hao Zhang, Florian Metze. Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models. INTERSPEECH 2014. Yajie Miao, Lu Jiang, Hao Zhang, Florian Metze. Improvements to Speaker Adaptive Training of Deep Neural Networks. SLT 2014. |