Festival at CMU	Demos	Manual	Download	Festival at Edinburgh

Festival: General examples

Audio files are 16bit Microsoft .WAV at (mostly) 16KHz sampling.

Simple text-to-speech

This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.

Festival currently uses a diphone synthesizer, both residual excited LPC and PSOLA methods are supported. The upper levels, duration and intonation, are generated from statistically trained models, built from databases of natural speech. The architecture of the system is designed to be flexible, including various tools, which allow new modules to be added easily.

Multi-lingual text-to-speech

Festival is a multilingual synthesizer. The default language may be set at start-up time or changed easily during a session.

This welsh synthesizer is was ported from a previous CSTR Welsh synthesizer

Dwi'n gallu llefaru pob llinell heb atal, oherwydd does dim tafod gyda fi.
(I can speak every line without a stammer, as I have no tongue).

A Castillean Spanish synthesizer was built from diphone collected during an MSc project.

m'uchos 'a~nos despu'es, fr'ente al pelot'on de fusilami'ento, el coron'el aureli'ano buend'ia hab'ia de record'ar de aqu'el d'ia lej'ano, en que su p'adre lo llev'o a conoc'er el hi'elo.
(Many years later, in front of the firing squad Col. Aureliano Beundia remembered that far off day when his father took him to see the ice.) This is the first sentence from ``A Hundred Years of Solitude'' by Gabriel Garcia Marquez.

Two German synthesisers were developed as part of a summer project at Oregon Graduate Institute

female male
Ihr naht euch wieder, schwankende Gestalten,
Die früh sich einst dem trüben Blick gezeigt.
Goethe Faust

Statistical text analysis aids speech synthesis

A statistical phrase break prediction system ensures that even distribution of breaks are inserted, such that similar contexts for breaks are not confused.

He wanted to go for a drive in.
He wanted to go for a drive in the country.

A statistical part-of-speech tagger allows Festival to identify the correct pronunciation of homographs.

My cat who lives dangerously had nine lives.

Certain character sequences may be a roman number pronounced as simple a number, an ordinal, or as a letter sequence. The cases can be differentiated by a statistically trained model that takes into account the context.

Henry V: Part I Act II Scene XI: Mr X is I believe, V I Lenin, and not Charles I.

TTS modes

Special modes including tokenization, lexicon, and prosody can be built when deal with special types of text. For example if you are to read a list of addresses it is better to do so in an address mode.

Raw default TTS engine analysis
Using special address mode
Smith, Bobbie Q, 3337 St Laurence St, Fort Worth, TX 71611-5484, (817)839-3689
Anderson, W, 445 Sycamore Way NE, Lincoln, NE 98125-5108, (212)404-9988

Unit selection

In order to improve quality of the waveform itself we can selection sub-word units from a larger corpus that simply one example of each diphone. This is an example from a new implementation of Hunt and Black ICASSP96

This is a short introduction to the Festival Speech Synthesis System.
This is a short introduction to the Festival Speech Synthesis System.

The first was produced from 460 (TIMIT) phonetically balanced sentences, using only phonetic context and pitch as selection features with hand-tuned weights. No signal processing to modify pitch and duration was made to the selected units. The units selected typical contain 2-3 phones. The second example was synthesized using a diphone database from the same speaker. Only the waveform synthesizers differ, that is they use the same target phones.

A different technique for finding appropriate units is described in Black and Taylor 97 posctscript html. Here appropriate sub-word units (diphones or demi-phones) are clustered using an acoustic measure.

Original non-synthesized utterance.
Copy synthesis from natural phones, F0 and duration.
Example of full TTS, from a synthesizer built from BU Radio FM corpus.

NoteIn both the above techniques the good examples are good, but the bad examples are much worse than diphones alone. These techniques are still need to be researched further until they are stable enough produce high quality all the time.

Back to demo index

This page is maintained by Alan W Black awb@cs.cmu.edu