Distant Speech Recognition
Distant Speech Recognition
Recognition of speech recorded through distant microphones (i.e. table-top microphones in video conferences, hands-free units for mobile devices) continues to be one of the most challenging tasks in speech processing, with word error rates around 30% even in otherwise ideal conditions. This is mainly due to a mismatch of acoustic models, which were trained on single-channel, close-talking data, which is not encountered during testing. Single channel signal enhancement improves the perceptual quality of speech, but does not improve recognition. Microphone arrays can be used in some situations, but they require careful placement and speaker localization and tracking to function well, which is often impractical.
We propose a training scheme for acoustic models using multiple arbitrarily placed microphones (“ad-hoc microphone array”) in parallel, instead of just one. By training and adapting the acoustic model on multiple slightly different inputs, instead of just one, it will generalize well to microphone conditions during testing. Recent advances in discriminative training of acoustic models allow joint optimization of speaker dependent acoustic models in multiple discriminatively trained feature spaces, derived from multiple microphones. This allows training acoustic models independently of spatial arrangements in the room. Preliminary results using non-discriminative transformations confirm the effectiveness of this approach, and suggest discriminatively trained models should be sufficient for indexing of end-user videos recorded with hand-help devices, transcription, summarization, or translation of content in tele-presence systems, or other applications.
Mittwoch, 1. Juni 2011
Multi-Channel Discriminative Training