Abstracts



The Blizzard Challenge 2006 (pdf)
 Christina L. Bennett and Alan W Black.
 Blizzard Challenge Workshop, satellite event of Interspeech 2006 - ICSLP, Pittsburgh, PA, 2006.

Last year the Blizzard Challenge 2005 introduced the speech synthesis community to the concept of large scale, multi-site evaluation of TTS systems using common data. In this, the second year of the Blizzard Challenge, we again tackled this task. Participation increased dramatically, out of 18 initial sites that showed interest, a total of 14 sites from around the world actually submitted entries. In this paper we discuss the results, difficulties, and differences in this year’s Challenge.



Large Scale Evaluation of Corpus-based Synthesizers: Results and Lessons from the Blizzard Challenge 2005 (pdf)
 Christina L. Bennett.
 In Proceedings of Interspeech 2005 - Eurospeech, Lisbon, Portugal, 2005.

The Blizzard Challenge 2005 was a large scale international evaluation of various corpus-based speech synthesis systems using common datasets.  Six sites from around the world, both academic and industrial, participated in this evaluation, the first ever to compare voices built by different systems using the same data.  Here we describe results of the evaluation and many of the observations and lessons discovered in carrying it out.



The Blizzard Challenge 2005 CMU Entry, a method for improving speech synthesis systems (pdf)
 John Kominek, Christina Bennett, Brian Langner, and Arthur Toth.
 In Proceedings of Interspeech 2005 - Eurospeech, Lisbon, Portugal, 2005.

In CMU's Blizzard Challenge 2005 entry we investigated twelve ideas for improving Festival-based unit selection voices.  We tracked progress by adopting a 3-tiered strategy in which candidate ideas must pass through three stages of listening tests to warrant inclusion in the final build.  This allowed to evaluate ideas consistently without us having large human resources at our disposal, and thereby improve upon our baseline system within a short amount of time.



Prediction of Pronunciation Variations for Speech Synthesis: A Data-driven Approach (pdf)
 Christina L. Bennett and Alan W Black.
 In IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, Pennsylvania, 2005.

The fact that speakers vary pronunciations of the same word within their own speech is well known, but little has been done to automatically categorize and predict a speaker’s pronunciation distribution for unit selection speech synthesis.  Recent work demonstrated how to automatically identify a speaker’s choice between full and reduced pronunciations using acoustic modeling techniques from speech recognition.  Here, we extend this approach and show how its results can be used to predict a speaker’s choice of pronunciations for synthesis.  We apply machine learning techniques to the automatically categorized data to produce a pronunciation variation prediction model given only the utterance text – allowing the system to synthesize novel phrases with variations like those the speaker would make.  Empirical studies emphasize that we can improve automatic pronunciation labels and successfully utilize the results for prediction of future synthesized examples.  The prediction results based on these automatic labels are very similar to those trained from human labeled data – allowing us to reduce manual effort while still achieving comparable results.



Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices (pdf)
 Christina L. Bennett and Alan W Black.
 In Proceedings of Eurospeech 2003, Geneva, Switzerland, 2003.

Within-speaker pronunciation variation is a well-known phenomenon; however, attempting to capture and predict a speaker's choice of pronunciations has been mostly overlooked in the field of speech synthesis. We propose a method to utilize acoustic modeling techniques from speech recognition in order to detect a speaker's choice between full and reduced pronunciations.



Evaluating and Correcting Phoneme Segmentation for Unit Selection Synthesis (pdf)
 John Kominek, Christina Bennett, and Alan W Black.
 In Proceedings of Eurospeech 2003, Geneva, Switzerland, 2003.

As part of improved support for bulding unit selection voices, the Festival speech synthesis system now includes two algorithms for automatic labeling of wavefile data. The two methds are based on dynamic time warping and HMM-based acoustic modeling. Our experiments show that DTW is more accurate 70% of the time, but is also prone to gross labeling errors. HMM modeling exhibts a systematic bias of 30 ms. Combining both methods directs human labelers towards data most likely to be problematic.



The Carnegie Mellon Communicator Corpus (pdf)
 Christina Bennett and Alexander I. Rudnicky.
 In Proceedings of the International Conference of Spoken Language Processing, Denver, Colorado, 2002.

As part of the DARPA Communicator program, Carnegie Mellon has, over the past three years, collected a large corpus of speech produced by callers to its Travel Planning system. To date, a total of 180,605 utterances (90.9 hours) have been collected. The data were used for a number of purposes, including acoustic and language modeling and the development of a spoken dialog system. The collection, transcription and annotation of these data prompted us to develop a number of procedures for managing the transcription process and for ensuring accuracy. We describe these, as well as some results based on these data. A portion of this corpus, covering the years 1999-2001, is being published for research purposes.



Building VoiceXML-Based Applications (pdf)
 Christina Bennett, Ariadna Font Llitjos, Stefanie Shriver, Alexander Rudnicky, and Alan W Black.
 In Proceedings of the International Conference of Spoken Language Processing, Denver, Colorado, 2002.

The Language Technologies Institute (LTI) at Carnegie Mellon University has, for the past several years, conducted a lab course in building spoken-language dialog systems. In the most recent versions of the course, we have used (commercial) web-based development environments to build systems. This paper describes our experiences and discusses the characteristics of applications that are developed within this framework.



Task and Domain Specific Modelling in the Carnegie Mellon Communicator System (pdf)
 Alexander Rudnicky, Christina Bennett, Alan Black, Ananlada Chotimongkol, Kevin Lenzo, Alice Oh, Rita Singh.
 In Proceedings of the International Conference of Spoken Language Processing, Beijing, China, 2000.

The Carnegie Mellon Communicator is a telephone-based dialog system that supports planning in a travel domain.  The implementation of such a system requires two complimentary components, an architecture capable of managing interaction and the task, as well as a knowledge base that captures the speech, language and task characteristics specific to the domain.  Given a suitable architecture, the principal effort in development is taken up in the acquisition and processing of a domain knowledge base.  This paper describes a variety of techniques we have applied to modeling in acoustic, language, task, generation and synthesis components of the system.



Data Collection and Processing in the Carnegie Mellon Communicator (pdf)
 Maxine Eskenazi, Alexander Rudnicky, Karin Gregory, Paul Constantinides, Robert Brennan, Christina Bennett, Jwan Allen.
 In Proceedings of Eurospeech 1999, Budapest, Hungary, 1999.

In order to create a useful, gracefully functioning system for travel arrangements, we have first observed the task as it is accomplished by a human.  We then imitated the human while making the user believe he was dialoguing with an automatic system.  As we gradually built our system, we devised ways to assess progress and to detect errors.  The following described the manner in which the Carnegie Mellon Communicator was built, data collected, and assessment begun using these criteria.