Bimodal Speech Recognition
Overview
The goal of bimodal speech recognition is to combine
audio and visual information to enhance speech recognition rate under poor
audio conditions (noise or acoustically confusing words). A lipreading system recognizes
a spoken word based on the input lip motion. To handle this problem, we
proposed a space-time delay neural network that can automatically discover the
features embedded in spatiotemporal domain in the training process and use
these features to classify different lip motions. Our experimental results
indicated that, using only lip motion video, the lipreading system can achieve
a 77.8%~90% recognition rate for Chinese digits, and 44.7%~48.9% recognition
rate for nineteen Chinese confusing words.
We also implemented an on-line bimodal speech
recognition system to test how lipreading can improve the audio speech
recognition. The recognition system consisting of three DSP processors and one
Pentium processor concurrently processes lip motion video and speech signals.
The whole recognition process, including mouth region centering, 2D-FFT, speech
feature extraction, neural network computation, HMM computation, and decision
fusion, can be executed in real time.
Publication
·
Chin-Teng Lin, Hsi-Wen Nein and Wen-Chieh Lin, “A Space-Time Delay
Neural Network for Motion Recognition and Its Application to Lipreading, ”
International Journal of Neural Systems, Vol. 9, No. 4, Aug 1999, pp. 311-334.
·
Wen-Chieh Lin, A Space-Time Delay Neural
Network for Motion Recognition and Its Application to Lipreading in Bimodal
Speech Recognition, Master thesis, National Chiao-Tung University, Taiwan,
1996.
·
Wen-Chieh Lin, Hsi-Wen Nein, and Shin-Hui Liang, A DSP-based On-line Bimodal Speech Recognition System,
First Prize of the Graduate Student Team in the Texas Instrument DSP Design
Challenge, Taiwan, 1996.