As it has become apparent that insertion of language technologies in the field is desirable, practical issues such as size, weight and durability of the device have become a priority. World events have also brought into focus a completely new set of target languages.
The research described in this paper addresses the dual concerns of synthesis of Arabic, a language that has shot to prominence in the past few years, and synthesis on a handheld device, realization of which presents difficult software engineering problems. Our system was developed in conjunction with the DARPA BABYLON project, and has been integrated with English synthesis, English and Arabic ASR, and machine translation on a single off-the-shelf PDA.
The Arabic language offers a number of challenges for speech synthesis. In the written language, vowels are represented partially at best, and must be inferred. Naturally, this is a problem in the generation phase, when one must know what vowel is to be synthesized. It is also a problem in training. In a concatenative synthesis system such as ours, a database is ordinarily annotated at the phoneme level; one must choose between working from a traditional text and labeling only consonants, and phonetically transcribing the text in order to include vowel labels.
Speech synthesis is the ``face'' of a speech-to-speech translation system. Not only must it speak the text given to it intelligibly, but it must also say it in the right register, with the right gender, in the right dialect - the design of a speech synthesis system must anticipate the reaction of the target listener and adjust its output accordingly. This is particularly true for sensitive contexts like those faced by BABYLON.
In this paper we describe rapid development of a small-footprint Arabic voice, focusing on the challenges encountered.