Speech Reading Group

When: 12:00pm every Friday
Comments/suggestions to hyu@cs.cmu.edu, qjin@cs.cmu.edu

12:00pm, Fri. Mar.24, LTI Blue Room: Laura Tomokiyo on nonnative speech recognition
12:00pm, Fri. Mar.17, LTI Blue Room: Christian Fuegen on integrating dialect, speech rate, and SNR into decision trees
12:00pm, Fri. Mar.10, LTI Blue Room: LDA & QDA/FDA/PDA/MDA
12:00pm, Fri. Mar.3, ISL Lab: Multimodal People ID
12:00pm, Fri. Feb.25, LTI Blue Room: Speech Synthesis Systems

12:00pm, Fri. Feb.11, LTI Blue Room:

Text Independent Speaker Identification

There's a fast way to calculate the log likelihood of a test utterance with respect to a Gaussian model. Rather than summing up loglikelihood for each frame, we can compute the overall loglikelihood in a single step using sample mean (m) & variance (S):

12:00pm, Fri. Feb.4, LTI Blue Room:

Duration Modeling in Large Vocabulary Speech Recognition

Using Relative Duration in Large Vocabulary Speech Recognition

10:30am, Fri. Jan.28, ISL lab: The 1999 NIST Speaker Recognition Evaluation, Using Summed Two-Channel Telephone Data for Speaker Detection and Speaker Tracking, M.A.Przybocki, A.F.Martin

1:30pm, Thur. Nov.11, NSH4632:

Unified Decoding and Feature Representation for Improved Speech Recognition

Interesting idea. Maybe data-dependent feature switching is different from phone/context dependent feature combination?
The moral seems to be choosing the most confident feature. The normalisation factor doesn't feel right. There're better ways to scale scores from different sources.
A comparison with ROVER (with confidence) would be more fair.

Improvements on Speech Recognition for Fast Talkers

CLN compensates for both duration and dynamic features.
Seems speech rate estimation is more critical than various compensation techniques. The 2 sentence-level speech rate estimators can both be thought of as averaging individual phone stretch factors, but one works and one doesn't. It's hard to tell which is correct without doing recognition runs.
Is ML speech rate estimation possible? (VTLN like)
It's surprising that resampling method doesn't work as well as frame interpolation. LM/AM score balancing might be another issue.

1:30pm, Thur. Oct.28, WeH4603: Combining Words and Prosody for Information Extraction from Speech,Dilek Hakkani-Tür, Gökhan Tür, Andreas Stolcke, Elizabeth Shriberg, Eurospeech99

First paper that strongly demonstrates the usefulness of prosodic cues in 3 problems: sentence segmentation, topic segmentation and NE extraction. Their numbers compare favorably against LM methods that're considered state of the art on BN corpus. Might be interesting to apply it on summarization, parsing, etc.
BN speech is very well behaved 'cause anchors have to work with certain pause, prosody constraints during topic change. So it's understandable they're important features.

1:30pm, Tue. Oct. 12, LTI Blue Room: Modeling Trajectories in the HMM Framework, Rukmini Iyer, Owen Kimball, Herbert Gish, Eurospeech99

Presents a way (parallel path within a phoneme) to model trajectories within traditional HMM framework. Segmental modeling provides more elaborate ways of modeling trajectories, but at the cost of departing from efficient & traditional HMM algorithms.

Generalized Mixture of HMMs for Continuous Speech Recognition

One explanation for the improvement might be: different paths corresponds to different speakers, or different speaking rate, etc. For example, assuming a 2-path configuration, if there's no gender normalisation in the frontend, the 2 paths might just be one for male, one for female.
Also discussed normalization issues for segment-based (vs. frame-based) recognition. Anti-units seem to be the predominant approach: for a certain hypothesis (seg1,seg2,...), with the corresponding feature vector (x1,x2,...), the score is

Segmentation and Modeling in Segment-based Recognition

Papers that should've been read but haven't ...

From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition,
by Mari Ostendorf, V.Digalakis, Owen Kimball, IEEE Trans. on Speech and Audio Processing Vol.4 No.5 Sep. 1996