Cheating with Imperfect Transcripts
Abstract
Most speech recognition systems try to reconstruct a word sequence given
an acoustic input, using prior information about the language being
spoken. In some cases, there is more information available to the
decoder than simply the acoustics. When decoding a television news
broadcast, for example, the closed-caption information that is often
recorded for hearing impaired viewers may also be available. While
these captions are generally not completely accurate transcriptions,
they can be considered to be a strong hint as to what was actually
spoken.
In this paper, we present a formalization of this problem in terms of
the source channel paradigm. We propose a simple translation model for
mapping caption sequences to word sequences which updates the language
model with the prior information inherent in the captions. We also
describe an efficient implementation of the search in a Viterbi decoder,
and present results using this system in the broadcast news domain.