Speech Reading Group
When: 12:00pm every Friday
Comments/suggestions to hyu@cs.cmu.edu,
qjin@cs.cmu.edu
-
12:00pm, Fri. Mar.24, LTI Blue Room: Laura Tomokiyo on nonnative speech
recognition
-
12:00pm, Fri. Mar.17, LTI Blue Room: Christian Fuegen on integrating
dialect, speech rate, and SNR into decision trees
-
12:00pm, Fri. Mar.10, LTI Blue Room: LDA & QDA/FDA/PDA/MDA
-
12:00pm, Fri. Mar.3, ISL Lab: Multimodal People ID
-
12:00pm, Fri. Feb.25, LTI Blue Room: Speech Synthesis
Systems
-
12:00pm, Fri. Feb.11, LTI Blue Room:
Text Independent Speaker Identification, H.Gish and M.Schmidt,
IEEE Signal Processing Magazine, pp18 - pp32, Oct. 1994
-
There's a fast way to calculate the log likelihood of a test utterance
with respect to a Gaussian model. Rather than summing up loglikelihood
for each frame, we can compute the overall loglikelihood in a single step
using sample mean (m) & variance (S):
logL(X;u,C) = -n/2 * { log|2*pi*C| +
tr(inv(C)*S) + (m-u)'*inv(C)*(m-u) }
The key to the proof is:
tr(x*y') = y'*x, where x & y are any column vectors
which leads to
tr(inv(C)*S) = 1/n * Sum{ x'*inv(C)*x } when Mean(X) = 0
-
12:00pm, Fri. Feb.4, LTI Blue Room:
Duration Modeling in Large Vocabulary Speech Recognition, A.Anastasakos,
R.Schwartz, H.Shu, ICASSP95
Using Relative Duration in Large Vocabulary Speech Recognition,
M.Jones, P.C.Woodland, Eurospeech93
-
10:30am, Fri. Jan.28, ISL lab: The 1999 NIST Speaker Recognition Evaluation,
Using Summed Two-Channel Telephone Data for Speaker Detection and Speaker
Tracking, M.A.Przybocki, A.F.Martin
-
1:30pm, Thur. Nov.11, NSH4632:
Unified
Decoding and Feature Representation for Improved Speech Recognition,
Li Jiang, Xuedong Huang, Eurospeech99
-
Interesting idea. Maybe data-dependent feature switching is different from
phone/context dependent feature combination?
-
The moral seems to be choosing the most confident feature. The normalisation
factor doesn't feel right. There're better ways to scale scores from different
sources.
-
A comparison with ROVER (with confidence) would be more fair.
Improvements
on Speech Recognition for Fast Talkers, M. Richardson, M. Hwang, A.
Acero, X.D. Huang, Eurospeech99
-
CLN compensates for both duration and dynamic features.
-
Seems speech rate estimation is more critical than various compensation
techniques. The 2 sentence-level speech rate estimators can both be thought
of as averaging individual phone stretch factors, but one works and one
doesn't. It's hard to tell which is correct without doing recognition runs.
-
Is ML speech rate estimation possible? (VTLN like)
-
It's surprising that resampling method doesn't work as well as frame interpolation.
LM/AM score balancing might be another issue.
-
1:30pm, Thur. Oct.28, WeH4603: Combining
Words and Prosody for Information Extraction from Speech,Dilek Hakkani-Tür,
Gökhan Tür, Andreas Stolcke, Elizabeth Shriberg, Eurospeech99
-
First paper that strongly demonstrates the usefulness of prosodic cues
in 3 problems: sentence segmentation, topic segmentation and NE extraction.
Their numbers compare favorably against LM methods that're considered state
of the art on BN corpus. Might be interesting to apply it on summarization,
parsing, etc.
-
BN speech is very well behaved 'cause anchors have to work with certain
pause, prosody constraints during topic change. So it's understandable
they're important features.
-
Presents a way (parallel path within a phoneme) to model trajectories within
traditional HMM framework. Segmental modeling provides more elaborate ways
of modeling trajectories, but at the cost of departing from efficient &
traditional HMM algorithms.
Related papers: Generalized
Mixture of HMMs for Continuous Speech Recognition, Filipp Korkmazskiy,
Biing-Hwang Juang and Frank Soong, ICASSP97
-
One explanation for the improvement might be: different paths corresponds
to different speakers, or different speaking rate, etc. For example, assuming
a 2-path configuration, if there's no gender normalisation in the frontend,
the 2 paths might just be one for male, one for female.
-
Also discussed normalization issues for segment-based (vs. frame-based)
recognition. Anti-units seem to be the predominant approach: for a certain
hypothesis (seg1,seg2,...), with the corresponding feature vector (x1,x2,...),
the score is
P(x1 | seg1) * P(x2 | seg2) * ... /
( P(x1 | antiSeg1) * P(x2 | antiSeg2) * ... )
The idea is explained in detail in Segmentation and Modeling in
Segment-based Recognition, Jane W. Chang and James R. Glass, Eurospeech97
Papers that should've been read but haven't ...
From HMM's to Segment Models: A Unified View of Stochastic Modeling
for Speech Recognition,
by Mari Ostendorf, V.Digalakis, Owen Kimball, IEEE Trans. on Speech
and Audio Processing Vol.4 No.5 Sep. 1996