Error Correction Using Confidence Annotation in Automatic Speech Recognition.
   __________________________

This work deals with identifying and rectifying incorrect recognition in
automatic speech recognition. Some recent work [1] has been done in this
direction using confidence measures [2,3,4] which uses confidence scores
as an additional feature along with acoustic and language model scores
to improve the recognition performance. In this work, we use the
existing confidence annotator [3] to identify incorrectly recognized
segments in the decoded utterance. We propose two measures, namely
"Overall Likelihood Difference" and "Homogeneity Count Difference".
These measures are used to rectify incorrect segments by identifying
their correct substitutes in the N-best hypotheses list produced during
the decoding.
The top hypothesis in the N-best list is the output of the speech
decoder. The first best alternative segment of a decoded segment has the
highest likelihood within the same time segment in the N-best list. The
proposed measures are defined as follows:
1) Overall Likelihood Difference is the difference in likelihood of a
decoded segment and its best alternative. The smaller the likelihood
difference, the more likely it is that the best alternative segment is
the correct segment.
2) Homogeneity Count Difference is the difference in the number of words
in the decoded and its best alternative segment, having the homogeneity
score above a threshold. The threshold giving maximum reduction in cross
entropy [4] is chosen. Correct segments are expected to have higher
homogeneity counts compared to incorrect segments.

We only consider the first best alternative segment because most of the
correct segments  are either the decoded segments themselves or their
first best alternative. (the distribution goes down almost exponentially
with the position)

The SPHINX-III speech recognizer was used to decode the 1997 Hub4
broadcast news corpus. The output of the decoder was tagged into class
correct/incorrect by the confidence annotator [3]. The confidence
annotator identifies 80% of the incorrect words at a false alarm of 45%.
As expected, correctly decoded segments show a higher Overall Likelihood
Difference compared to incorrectly decoded segments. Also, correct
segments show a higher Homogeneity Count compared to incorrect segments.
A combined use of the two measures rectifies 50% of the segments where
the decoded segment is incorrect and its best alternative segment is
correct. Only 15% of the correctly decoded segments were replaced with
their incorrect best alternatives. We propose to continue further work
in the following directions:
1. Improving confidence annotation. Using the word lattice for
identifying the correct segments.
2. As many incorrect segments seemed to be syntactically incorrect, we
plan to use parse scores as an extra measure.

[1] Rose C. R. et al, "Integration of Utterance Verification with
Statistical Language Modeling and Spoken Language Understanding".
ICASSP-98, Volume 1, page 237-241,
[2] Jeanrenaud et al, "Large Vocabulary Word Scoring as a Basis of
Transcript ion Generation", Proc. Eurospeech, pp 2149-2152, 1995.
[3] Bansal D. and Ravishankar R., "New features for Confidence
Annotation", Proc.  ICSLP-98. Paper No. 829.
[4] Chase L., "Error Responsive Feedback Feedback Machenisms for Speech
Recognizers", Ph.D thesis, Carnegie Mellon University, TR no.
CMU-RI-TR-97-18, Apr. 1997.