[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter7.html .]
[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]
Chapter 7
Speaker-Language Identification
and Speech Translation
Editor: Gianni Lazzari
Contributors:
Gianni Lazzari
Robert Frederking
Wolfgang Minker
Abstract
Significant progress has been made in the various tasks of speaker identification, speaker verification, and spoken and written language identification--the last being a completely solved problem. The translation of spoken language, however, remains a significant challenge. Progress in speaker identification and verification is hampered by problems in speaker variability due to stress, illness, etc. Progress in spoken language translation is hampered not only by the traditional problems of machine translation (see Chapter 4), but also by ill-formed speech, non-grammatical utterances, and the like. It is likely to remain a significant problem for some time to come.
7.1 Definitions
Automatic Speaker Identification (Speaker ID) is the ability of a machine to determine the identity of a speaker given a closed set of speakers. It is therefore is an n-class task. Speaker Verification (Speaker VE), on the other hand, is a single-target open set task, since it is the ability of a machine to verify whether a speaker is who he or she claims to be. Both problems can be seen as instance of a more general speaker recognition (Speaker RE) problem. Speaker recognition can be text-dependent or text-independent. In the former case the text is known, i.e., the system employs a sort of password procedure. Knowledge of the text enables the use of systems that combine speech and speaker recognition, whereby the customer is asked to repeat one or more sentences randomly drawn from a very large set of possible sentences. In the case of text-independent speaker recognition, the acceptance proced! ure should work for any text. Traditionally this problem is related to security applications. More recent application areas include broadcast news annotation and documentation.
Automatic Spoken Language Identification (Language ID) is the ability of a machine to identify the language being spoken from a sample of speech by an unknown speaker (Muthusamy et al., 1994a). The human is by far the best language ID system in operation today. If somebody knows the language being spoken, they are able to positively identify it within a few seconds. Even if they don't know the language, people can often make sounds like French' statements. Several important applications already exist for language ID. A language ID system can be used as a front-end system to a telephone-based international company, routing the caller to an appropriate operator fluent in the caller's language to serve business, the general public, and police departments handling 911 emergency calls.
Automatic Text Language Identification is a solved problem. Several techniques exist, and it is possible to get near 100% accuracy on just ten words of input. For large sets of languages, this should surpass human abilities. The best technique seems to be training a classifier on documents in the different languages (a machine learning technique).
Spoken Language Translation (SLT) is the ability of a machine to interpret a multilingual human-human spoken dialog. The feasibility of spoken language translation is strongly related to the scope of application, which ranges from interpretation of the speaker's intent in a narrow domain to unrestricted, fully automatic simultaneous translation. The latter is not feasible in the foreseeable future. Historically Machine Translation (MT) applications (see
Chapter 4) have been divided into two classes:Spoken Language Translation seems to belong to a different class of applications, the communication between two individuals (see also
Chapter 5). Often a narrow domain may be sufficient, but it is hard to control style. Bidirectional, real-time operation is necessary, but fairly low quality is acceptable if communication is achieved. An MT system does not necessarily need to give an absolutely correct solution, if it produces a sufficient expression in the Target Language satisfying the dialogue situation.7.2 Where We Were Five Years Ago -- Speaker ID
7.2.1 Capabilities
The field of speaker recognition shows considerable activity in research institutions and industry, including AT&T, BBN, NTT, TI, the Dalle Molle Institute, ITC-IRST, MIT Lincoln Labs, Nagoya University, National Tsing Hua University of Taiwan. In the US, NIST and NSA have conducted speaker recognition systems evaluation and assessment. (Campbell, 1997).
As discussed in
Chapter 5, speech is a very complex signal occurring as a result of several transformations at different levels: semantic, linguistic, articulatory and acoustic. Differences in these transformations appear as differences in the spectral properties of the speech signal. Speaker-related differences are a result of a combination of anatomical differences inherent in the vocal tract and the learned speaking habits of different individuals. In speaker recognition, all these differences can be used to discriminate between speakers.The general approach to Speaker RE consists of four steps: digital speech data acquisition, parameter extraction, and pattern matching (this implies an enrollment phase to generate reference models). In the case of Speaker VE, a fifth step concerns the decision to accept or reject the claimed speaker.
7.2.2 Major Methods, Techniques, and Approaches
Speech information is primarily conveyed by the short-time spectrum, the spectral information contained in an interval of 1030 ms. While the short-term spectra do not completely characterize the speech production process, the information carried by it is basic to many speech processing systems, including speech and speaker recognition. There are so many methods to characterize a short-time spectrum, but the dominant features used in previous and current systems are cepstral and delta cepstral parameters derived by filterbank analysis.
Different features often found in literature are based on computation of the Linear Prediction Coefficients, from which different parameters can be derived, i.e., LPC cepstrum, reflection coefficient, log area ratios, etc. Prosodic features, such as pitch and duration, have been proposed in the past and also methods based on nonlinear discriminant analysis (NLDA) (Gnanadesikan and Kettenring, 1989) have been evaluated.
While the past seven years have seen no great change in the features selection component of speaker recognition systems, the pattern matching component followed the trend taken in the speech recognition area. As a matter of fact, the methods of VQ (Vector Quantization), DTW (Dynamic Time Warping), and NN (Nearest Neighbors) are now less common than HMM (Hidden Markov Models) and ANN (Artificial Neural Networks). In general, statistical modeling seems to deliver the best results when robustness and adaptation are mandatory, i.e., in almost all real applications: over the telephone, with a target of a high number of speakers (Gish and Schmidt, 1994).
A very popular technique adopts an unsupervised representation of target. Two models are used; the first, which is dominant in speaker recognition, is called Adapted Gaussian Mixture Models. In this case the background model is a speaker independent Gaussian Mixture Model (GMM) (Reynolds, 1995; Reynolds and Rose, 1995), while the target model is derived from the background by Bayesian adaptation. The latter is based on the use of Unadapted Gaussian Mixture Models; in this case the target model is trained using ML (Maximum Likelihood) estimation. Other diffused techniques include ergodic HMM, unimodal Gaussian, and auto-regressive vectors. Mixture modeling is similar to VQ identification in that voices are modeled by components or clusters. Model components may be "phonetic units" learned with a supervised labeling by a continuous speech recognizer. Speaker-independent and speaker-dependent likelihood scores are then compared. Model component! s could also be "broad phonetic class", obtained by a suitable recognizer. Target and background label matches are then compared.
Temporal decomposition plus neural nets have also been exploited: MLPs are trained to discriminate target and non-target VQ-labeled training data. This technique has not benefited speaker recognition, but has proven useful in language recognition.
Two other techniques--normalization and fusion--have been pursued in order to improve robustness. Normalization with respect to speaker and handset is very important in order to overcome the mismatch between training and test conditions. In particular, handset type mapping (electricity to carbon-button speech or vice-versa) has great importance, given the degree of mismatch caused by this kind of handset.
Fusion is also very important in order to increase system performance, especially in the case of secure speaker verification. The most important methods, used in typical pattern recognition systems are linear combinations of systems, voting systems, and MLP-based fusion.
7.2.3 Major Bottlenecks and Problems
The major bottlenecks and problems are related to the same factors that occur in speech recognition.
A large numbers of people could be potential users of these systems. There is a high intra-speaker variability over time due to health (respiratory illness, laryngitis, etc.) stress, emotional factors, speech effort and speaking rate, aging, gender, etc. Moreover, the telephone channel variability and noise and the microphone/handset differences have to be taken into account. Difficulties arise also when dealing with the effects of the communication channel through which speech is received. The variability essentially distorts their pattern in the features space, increasing confusion. Crosstalk is another type of event that increases variability.
7.3 Where We Will Be in Five Years -- Speaker ID
7.3.1 Expected Capabilities
Speaker recognition is adopted because other technology demands it. Limited performance may be acceptable in some applications; in other situations, Speaker RE can be used with other technologies, such as biometrics (Brunelli and Falavigna, 1995) or word recognition. The performance of current speaker recognition systems is suitable for many practical applications. Existing products already on the market (Campbell, 1997) are mainly used for speaker verification applications such as access control, telephone credit cards, and banking. The number of speaker verification applications will grow in the next five years and this will drive the research towards more robust modeling in order to cover as much as possible unexpected noise and acoustic events and to reduce (at best) the number of high-tech thieves.
Two new classes of applications seem to increase in demand: speech archiving and broadcast news documentation, including wire tapping and multimodal interaction. In the first case the problem is to identify and track, in batch mode or in real-time, a speaker in a radio, video or generally a multimedia archive. In the latter, knowing who is interacting will not only log on the user to some service but also help the system to provide better user modeling. For more details, see
Chapter 9.In general, as a matter of fact, speaker identification will be a value-added function when a spoken communication channel is available.
7.3.2 Expected Methods and Techniques (Breakthroughs)
Progress has generally been gradual. Comparing different systems presented in the literature is a hard problem, given the different kind and amount of data used for test and different types of tests, e.g., the binary-choice verification task or the multiple-choice identification task. Nevertheless, the general trend shows accuracy improvements over time, from seven years ago, with larger data set. The size in the last five years has increased by a factor of ten or even more. The error rate ranges from 0.3% recognition error, for both speaker verification or identification in text dependent mode, non-telephone speech with a minimum amount of 2 second speech data, to 16% speaker verification error in text independent mode, telephone quality speech, mismatched handset with at least 3 seconds of speech. In the last case, the verification error drops to 5% after 30 seconds of speech. For a detailed analysis of recent results see
http://www.nist.gov/speech/spkrec98.htm and (Martin and Przybocki, 1998).From the point of view of new techniques and approaches adopted, novel training and learning methods devoted to a broader coverage of different and unexpected acoustic events will be necessary. New features selection methods and stochastic modeling will grow in importance, taking account the better performance that can be offered when more flexibility is required. Stochastic modeling, as known in speech recognition, also offers a more theoretically meaningful probabilistic score.
Moreover, if new applications are envisaged, methodologies for data fusion will be necessary.
7.3.3 Expected Bottlenecks
A major bottleneck for the future is data availability and data collection, both for training and for testing. Fast adaptation methods and efficient training will be critical in near future; if such methods become available, a great development of real word application will occur, especially in the field of speaker verification.
7.4 Where We Were Five Years Ago -- Language ID
7.4.1 Capabilities
Advances in the spoken language understanding area and the need of global communication have increased the importance of Language Identification, making feasible the availability of multilingual information services, such as checking into a hotel, arranging a meeting, or making travel arrangements, which are difficult actions for non native speakers. Telephone companies can handle foreign language calls when a Language ID system is available; this is particularly important for routing (emergency) telephone calls.
This research subject flourished in the last four, five years. In the past, language ID was a niche research topic, with few studies in an incoherent picture (Muthusamy et al., 1994a).
In March 1993, the OGI_TS (Muthusamy et al., 1992) database was designated as the standard for evaluating Language ID research and algorithms by the NIST. Since that time, many institutions have contributed to this field and are participating in evaluations. Although Language ID has become a major interest only recently, since then it has been able to build some objective comparisons among various approaches.
7.4.2 Major Methods, Techniques, and Approaches
Before going into description of methods and techniques, it is necessary to define the sources of information useful for Language ID. It is also very important to understand how humans are able to identify languages.
Generally, in speech recognition acoustic unit modeling is sufficient for decoding the content of a speech signal. The problem here is that in text independent Language ID phonemes or other subword units are not sufficient cues to determine the acoustic signature' of a language. Different sources of information are necessary to identify a language, the most important obviously being the vocabulary. Above all languages are distinct because they use different set of words. Non-native speakers of a language, e.g., German, can use the phonetic inventory and prosodic features of their native language and be identified as German speakers. Second, Acoustic Phonetic and Phonotactics differ from language to language. Finally, duration of phones, intonation, and speech rate are typical language cue, making Prosody an important source of information for language ID.
Perceptual studies provide benchmarks for evaluating machine performance. It is known that humans use very short selected speech events, where the choice is based on several different sources of information. While many experiments have provided interesting results (Muthusamy, et al., 1994b), the difference of subjects and languages makes it difficult to determine the features that human would use to distinguish among unfamiliar languages.
Language ID is highly related to speech recognition and speaker ID in many ways. Both acoustic modeling (AM) and language modeling (LM) in speech recognition have strong relations with AM and LM in Language ID.
The basic architecture of a Language ID system (Zissman, 1996) is based on a cascade of the following components: acoustic representation (parameter extraction), followed by a pattern recognition component which exploits an alignment, taking account two sources of knowledge, an acoustic model and a language model. The alignment procedure will produce a language score. The approaches differ first in their acoustic representation (are prosodic features used?), and second in acoustic modeling (is it a single stochastic model for language involved, more acoustic unit models per language, more stochastic grammar of unit sequences?) (Nakagawa et al., 1994).
Finally, another important distinction is whether the system is text dependent or not. When text independent systems are involved, front-end systems for services like 911, in which a Language ID connects the user with the right' human translator, it is generally not feasible to build word models in each of the target language. When text dependent systems are developed, a front-end Language ID for multilingual information systems such as flight or train information, an implicit identification can be obtained. Indeed, the trained recognizers of the languages to be recognized, the lexicon, and the language models are combined in one multilingual recognizer. Each hypothesized word chain is only made of words from one language and language identification is an implicit by-product of a speech recognizer (Noeth et al., 1996).
7.4.3 Major Bottlenecks and Problems
The major bottlenecks and problems are related to the same factors that occur in speech recognition. A major problem is the mismatch due to the communication channel features of the training and the test condition. Another bottleneck is represented by the type of system to be created. When text independent features are needed and acoustic units (e.g., phoneme like) are to be trained, or the training phase is to be booted, a large amount of phonetically labeled data in each of the target language is needed. On the other hand, when text dependent system are needed, a multilingual speech recognizer has to be built and this is feasible in the near future only for subsets of languages.
Of the main sources of variability across languages, prosodic vocabulary information has not been successfully incorporated into a Language ID system. This is also true in speech recognition. Only recently, in the framework of Verbmobil (Niemann et al., 1997), has prosody been integrated in a spoken language system successfully.
Performances are not adequate for managing a large number of languages, whereas it is acceptable in the case of restricted class of languages (e.g., English, German, Spanish, Japanese) with clearly different cues (Corredor-Ardoy et al. 1997).
7.5 Where We Will Be in Five Years -- Language ID
7.5.1 Expected Capabilities
Following the trends of multilinguality in speech recognition, Language ID capabilities will increase in the next years. Multilingual information systems need Language ID as a front end, for both technological reasons (performance will be not acceptable) and multilinguality requirements. More difficult will be the development of a general purpose Language ID system for special Telecom services, such as 911 in the USA or 113 in Italy. Difficulties came from the high number of languages and time constraints.
7.5.2 Expected Methods and Techniques (Breakthroughs)
Progress has generally been gradual. A comparison with different systems presented in literature is hard, given the different kind and amount of data used for the test, different type of test, e.g., pairwise classification task or multiple-choice (e.g., 12 closed set) identification task. Nevertheless the general trend shows accuracy improvements over time, with larger data-test and improving acoustic and phonotactics modeling. The error rate depends on many factors, first on the duration of the utterance (15 seconds vs 30 minutes), then on the architecture, a single acoustic model or more acoustic models, and finally on the type of classifier, pairwise or multiple choice. Results of NIST 1996 are reported in (Zissman, 1996).
Methods and techniques for Language ID processing will follow the trends in speech recognition and spoken language understanding. The integration of prosodic information, both at the feature extraction and decoding level, will be the next important milestone. A potential improvement is expected from perceptual studies. Knowing the human strategies will suggest suitable machine strategies (e.g., keywords, key phrases to be decoded in a particular level). The biggest difficulty is and will be the statistical modeling of such information, i.e., how to add such knowledge in a probabilistic framework. For text independent Language ID, an improvement in the statistics modeling or adaptation of phones (Zissman, 1997) is very important, given its dependence on the manner (monologue, dialogue, etc.) of speech data collection.
7.5.3 Expected Bottlenecks
A major bottleneck, also for the future, will be data availability and data collection both for training and testing. Fast adaptation methods for channel normalization and efficient training will be critical in near future. When such methods become available, great development of real word applications will occur, especially in the field of Language ID.
Another bottleneck will be represented by the short period of time available for critical services, i.e., less than 30 seconds.
7.6 Where We Were Five Years Ago -- Spoken Language Translation
Early speech translation systems implemented in the eighties mainly had the purpose to demonstrate the feasibility of the concept of speech translation. Their main features included restricted domains, severe limitations on fixed speaking style, grammatical coverage, and limited size vocabulary. System architecture was usual strictly sequential, involving speech recognition, language analysis and generation, and speech synthesis in the target language. Developed at industrial and academic institutions like NEC, AT&T, ATR, Carnegie Mellon University, Siemens AG, University of Karlsruhe, and SRI and consortia, they represented a significant first step and demonstrated that multilingual communication by speech might be possible.
The VEST system (Roe et al., 92), successfully demonstrated at EXPO'92 in Seville, was developed in a collaboration between AT&T and Telefonica in Spain. It used a vocabulary of 374 morphological entries and a finite state grammar used for both language modeling and translation of English and Spanish in the domain of currency exchange.
NEC's Intertalker system, successfully demonstrated at GlobCom'92, allowed utterances in the domain of reservation of concert tickets and travel information. A finite state grammar was used also in this system for processing input sentences.
An interesting attempt to extend spontaneous multilingual human-machine dialogues to translation is represented by a system developed at SRI in collaboration with Telia (Rayner et al., 1993). It is based on previously developed system components from SRI's air travel information system (ATIS) and is interfaced with a generation component. The system's input language is English and it produces output in Swedish.
Speech translation encourages international collaborations. Prominent among these, the C-STAR I Consortium for Speech Translation Research has been set up as a voluntary group of institutions. Its members, ATR Interpreting Telephony Laboratories (now Interpreting Telephony Laboratories) in Kyoto, Japan; Siemens AG in Munich and University of Karlsruhe (UKA) Germany; and Carnegie Mellon University (CMU) in Pittsburgh, PA, USA, have developed prototype systems that accept speech in each of the members' languages (English, German, and Japanese), and produce output text in all the others (Morimoto et al., 1993; Waibel et al., 91; Woszczyna et al., 1994).
Another prominent collaboration is Verbmobil phase I, a large new research effort sponsored by the BMFT, the German Ministry for Science and Technology (Wahlster, 1993; Kay et al., 1994; Niemann, 1997). Launched in 1993, the program sponsored over 30 German industrial and academic partners who work on different aspects of the speech translation problem and are delivering system components for a complete speech translation system. Verbmobil is aimed at face-to-face negotiations, rather than telecommunication applications, and assumes that two participants have some passive knowledge of a common language, English. It aims to provide translation on demand for speakers of German and Japanese, when they request assistance in an otherwise English conversation. Verbmobil is an eight-year project with an initial four-year phase.
7.6.1 Capability Now
The feasibility of speech translation depends mainly on the extent of the application. Applications such as voice-activated dictionaries are already feasible, while unrestricted simultaneous translation will remain impossible for the foreseeable future. Current research goals therefore range within these extremes. The language and discourse modeling in these systems restrict the user in what he or she can talk about, and hence constrain the otherwise daunting task of modeling the world of discourse. There are no commercial speech translation systems on the market to date, but a number of industrial and government projects are exploring their feasibility.
Spoken language translation systems could be of practical and commercial interest when used to provide language assistance in some critical situations, such as between medical doctor and patient, in police assistance situations, or in foreign travel interactions such as booking hotels, flights, car rentals, getting directions, and so on.
Spoken language translation, even in limited domains, still presents considerable challenges, which are the object of research in several large research undertakings around the world.
7.6.2 Major Methods, Techniques, and Approaches
The translation of spoken language (unlike text) is complicated by ill-formed speech, human noise (coughing, laughter, etc.) and non-human noise (door-slams, telephone rings, etc.), and has to cope with speech recognition errors. The spoken utterance is not segmented in words like text and often contains information that is irrelevant with respect to the given application and that should not be translated. In the everyday person-to-person dialogues, even simple concepts are spoken in quite different ways. A successful system should therefore be capable to interpret the speaker's intent, instead of literally translating the speaker's utterances, to produce an appropriate message in the target language.
There are many approaches to spoken language translation; some of them are also mentioned in
Chapter 5. They can roughly be divided in two classes: direct approaches that try to link speech recognition and machine translation techniques, and interlingual approaches that try to decode both recognition and understanding into a common consistent framework. Both have many different instantiations. An example of the first case is followed at ATR (Iida, 1998). The recognizer outputs a word graph that is directly accepted by an MT system using a chart parser mechanism with an integrated translation method. This framework can be called a cooperative integrated translation, simultaneously executing both example-based translation and dependency structure analysis. In this context, the basic technologies would include the calculation of logical forms instead of syntactic parsing, abduc! tion instead of deduction, chart' structure under graph connection instead of a structure under linear connection, maximally partial understanding, translation instead of definitive understanding, creating and roughly handling images that correspond to object representations with feature properties (frame representations), and flexible generation adaptable to situations (courteous conversation, customer-clerk conversation, conversation for persuasion, etc.). Other approaches include Example Based Translation (EBMT) (Iida, 1998), that improves portability and reduces development cost through the use of large parallel corpora. Robust Transfer Approaches are also explored, with robust and stochastic analysis to account for fragmentary input.An example of the second case is followed in the framework of Verbmobil phase II and C-STAR II. Present activity has shifted toward a greater emphasis on interpretation of spoken language, i.e., the system's ability to extract the intent of a speaker's utterance (Bub and Schwinn, 1996). Several institutions involved in C-STAR therefore stress an interlingual representation and the development of generation component from the given interlingual representation (CMU, UKA, ETRI, IRST, and CLIPS) (Angelini et al., 1997). Each C-STAR II partner builds a complete system that at the very least accepts input in one language and produces output in another language of the consortium. In a multinational consortium, building full systems thereby maximizes the technical exchange between the partners while minimizing costly software/hardware interfacing work.
Discourse and domain knowledge and prosodic information are being explored, for more robust interpretation of ambiguous utterances.
7.6.3 Major Bottlenecks and Problems
Speech translation involves aspects of both person-to-person and person-machine dialogues. Person-to-person speech contains more dysfluencies, more speaking rate variations, and more coarticulation, resulting in lower recognition and understanding rates than person-machine speech, as indicated by experiments over several speech databases in several languages. Further technological advances are required based on a new common speech and language processing strategy, resulting in a closer integration between the acoustic and linguistic levels of processing.
To develop more practical spoken language translation systems, greater robustness is needed in the modeling and processing of spontaneous ill-formed speech. The generation of an interlingua representation requiring an underlying semantic analysis is typically done in a specific context. As long as a speaker stays within the expected bounds of the topic at hand (such as appointment or train schedules, for instance), a system which can process topic-related utterances will be satisfactory, even if it fails when given more general input.
In response to these general constraints, approaches applied to generate interlingua representations are typically those that allow semantic (that is, topic-related) and syntactic information to be captured simultaneously in a grammar (Fillmore, 1968; Bruce, 1975). These robust semantic grammars are very popular, as they provide a convenient method for grouping related concepts and features while allowing syntactic patterns that arise repeatedly to be shared easily. Semantic grammars are usually parsed using some method that allows a subset of the input words to be ignored or skipped.
However, the achieved robustness may turn out into a drawback. A grammar formalism that is based on a purely semantic analysis may ignore important information which is propagated by syntactic relations. Ignoring syntax information prevents from a sufficiently detailed interlingua presentation, which is necessary for a smooth translation.
A number of research activities aiming at the translation of spontaneous speech are under way. Several industrial and academic institutions, as well as large national research efforts in Germany and in Japan, are now working on this problem. The goals are oriented to removing the limitation of a fixed vocabulary and requiring the user to produce well-formed (grammatical) spoken sentences, accepting spontaneous spoken language in restricted domains. One example system that aims to integrate both the robustness of semantic analysis and smoothness of the translation is Janus, a speech translation system that processes spontaneous human-to-human dialogs in which two people from different language groups negotiate to schedule a meeting (Waibel et al., 1996). The system operates in a multilingual appointment scheduling task. It focuses on the translation of spontaneous conversational speech in a limited domain in different languages. Janus explores sev! eral approaches for generating the interlingua, including a robust version of the Generalized Left-Right parser, GLR* (Lavie, 1993), which is similar to Lexical Functional Grammar based parsers, and an extension of the CMU-Phoenix parser (Ward et al., 1995) that uses a robust semantic grammar formalism. This integration is to provide high fidelity translation, whenever possible and robust parsing, facing ill formed or misrecognized input.
Evaluation in spoken language translation is a bottleneck, both with regard to methodology and effort required.
Generation and synthesis are also topics of interest. Current concatenative speech synthesis technology is reasonably good; at least, it works better than the SR and MT components do.
7.7 Where We Will Be in Five Years -- Spoken Language Translation
7.7.1 Expected Capabilities
A major target of the future is the Portable Translator Application. The desiderata of this helpful device include physical portability, real-time operation, good human factors design, and management of as many (minor) languages as possible. Rapid development will also be a necessary key feature for the success of such a device, as is being investigated by the DIPLOMAT project (Frederking et al. 97). The spoken input has to be managed as well as possible in order to deal with degraded input, due mainly to spontaneous speech dysfluencies and speech recognizer errors.
7.7.2 Expected Methods and Techniques (Breakthroughs)
To achieve greater portability across domains it is mandatory to improve language component reusability. Most state-of-the-art translation systems apply rule-based methods and are typically well-tuned to limited applications. However, the manual development is costly as each application requires its own adaptation or, in the worst case, a completely new implementation. More recently there has been interest in extending statistical modeling techniques from the acoustic and syntactic levels of speech-recognition systems to other levels such as the modeling of the semantic content of the sentence. As discussed in
Chapter 1, most language work still requires a great amount of resources; for example, language models and grammar development require large amounts of transcribed data within each domain. On the other hand, acoustic models can be reused to a certain extent ! across domains and multilingual acoustic models are promising.The limitation to restricted domains of discourse must be relaxed for real applications. Intermediate goals might be given by large domains of discourse that involve several subdomains. Integration of subdomains will need to be studied. The C-STAR II and Verbmobil II projects are aiming to demonstrate the feasibility of integration of subdomains by merging into a big travel planning domain appointment scheduling, hotel reservation tasks, transport information management, and tourist information delivery.
Finally, better person-computer communication strategies have to be developed, if the purpose is to interpret the speaker's aim rather than the straightforward translation of the speaker's words. A useful speech translation system should be able to inform the user about misunderstandings, offerings, and negotiating alternatives, handling interactive repairs. As a consequence an important requirement is a robust model of out of domain utterances.
7.7.3 Expected Bottlenecks
Optimum translation systems need to integrate and to counterbalance four issues that are sometimes contradictory. These are the robustness of the translation system versus the correctness and smoothness of the translation, and the application-specific tuning versus portability of the system to new applications.
An expected bottleneck is evaluation, including the appropriate corpora required.
A second bottleneck, more complex than the first, is the development of the Interlingua approach, mainly for different languages and heritage, including Western and Asian.
7.8 Juxtaposition of this Area with Other Areas and Fields
The areas discussed in this chapter relate closely to several other areas of Language Processing, in fairly obvious ways. Language and speaker recognition may be used together to route messages. Language recognition may be front end for further speaker or speech processing. Speaker recognition may assist speech recognition. Speaker and speech recognition may be used together for access control. Routing (emergency) telephone calls is important language application.
7.9 The Treatment of Multiple Languages in this Area
Multilingual applications are the aim of this research area. To meet the challenges in developing multilingual technology, an environment and infrastructure must be developed. Contrary to research fostered and supported at the national level, multilingual research tends to involve cooperations across national boundaries. It is important to define and support efficient, international consortia that agree to jointly develop such mutually beneficial technologies. An organizational style of cooperation with little or no overhead is crucial, involving groups who are in a position to build complete speech translation systems for their own language. There is a need for common multilingual databases and data involving foreign accents. Moreover, better evaluation methodology over common databases is needed to assess the performance of speech translation systems in terms of accuracy and usability. Research in this direction needs to! be supported more aggressively across national boundaries.
7.10 References
Angelini, B., M. Cettolo, A. Corazza, D. Falavigna, and G. Lazzari. 1997. Person to Person Communication at IRST. Proceedings of ICASSP-97 (9194). Munich, Germany.
Bruce, B. 1975. Case Systems for Natural Language. Artificial Intelligence vol. 6 (327360).
Brunelli R., and D. Falavigna. 1994. Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10) (955966). Also IRST-Technical Report, 1994.
Bub, W.T. and J. Schwinn. 1996. Verbmobil: The Evolution of a Complex Large Speech-to-Speech Translation System. Proceedings of ICSLP (23712374).
Campbell, J.P. 1997. Speaker Recogntion: a tutorial. Proceedings of the IEEE 85(9) (14371462).
Corredor-Ardoy, C., J.L. Gauvain, M.Adda-Decker, L.Lamel. 1997. Language Identification with Language-Independent Acoustic Models. Proceedings of EUROSPEECH, vol. 1 (55-58). Rhodes, Greece.
Fillmore, Ch. J. 1968. The Case for Case. In E. Bach and R.T. Harms (eds.), Universals in Linguistic Theory. Holt, Rinehart and Winston Inc. (190).
Frederking, R., Rudnicky, A., and Hogan, C.
Gish, H. and M. Schmidt. 1994. Text-independent speaker identification. IEEE Signal Processing Magazine 11 (1832).
Gish, H., M. Schmidt, A. Mielke. 1994. A robust segmental method for text-independent speaker identification. Proceedings of ICASSP-94, vol. 1 (145148). Adelaide, South Australia.
Gnanadesikan, R. and J.R. Kettenring. 1989. Discriminant analysis and clustering. Statistical Science 4(1) (3469).
Iida, H. 1998. Speech Communication and Speech Translation. Proceedings of the Workshop on Multilingual Information Management: Current Levels and Future Abilities. Granada, Spain.
Kay, M., J.M. Gawron, and P. Norvig. 1994. Verbmobil: A Translation System for Face-to-Face Dialog. CSLI Lecture Notes No. 33, Stanford University.
Lavie, A. and M. Tomita. 1993. GLR*--An Efficient Noise Skipping Parsing Algorithm for Context Free Grammars. Proccedings of IWPT-93 (123134).
Martin A. and M. Przybocki 1998. NIST speaker recognition evaluation. Proceedings of the First International Conference on Language Resources and Evaluation (331-335). Granada, Spain.
Minker, W. 1998. Semantic Analysis for Automatic Spoken Language Translation and Information Retrieval. Proceedings of the Workshop on Multilingual Information Mangament : Current Levels and Future Abilities. Granada, Spain.
Morimoto, T., T. Takezawa, F. Yato, S. Sagayama, T. Tashiro, M. Nagata, A. Kurematsu. 1993. ATR speech translation system: ASURA. Proceedings of the Third Conference on Speech Communication and Technology (12951298). Berlin, Germany.
Muthusamy, Y.K., E. Barnard, and R.A. Cole. 1994. Reviewing Automatic Language Identification. IEEE Signal Processing Magazine.
Muthusamy, Y.K., R.A. Cole, B.T. Oshika. 1992. The OGI Multi-Language Telephone Speech Corpus. Proceedings of the International Conference on Spoken Language Processing.
Muthusamy, Y.K., N. Jain, R.A. Cole. 1994. Perceptual Benchmarks For Natural Language Identification. Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing.
Nakagawa, S., T. Seino, Y. Ueda. 1994. Spoken Language Identification by Ergodic HMMs and its State Sequences. Electronic Communications of Japan part 3, 77(6), (7079).
Niemann, H., E. Noeth, A. Kiessling, R. Kompe and A. Batliner. 1997. Prosodic Processing and its Use in Verbmobil. Proceedings of ICASSP-97 (7578). Munich, Germany.
Noeth, E., S. Harbeck, H. Niemann, V. Warnke. 1996. Language Identification in the Context of Automatic Speech Understanding. Proceedings of the 3rd Slovenian-German Workshop Speech and Image Understanding (5968). Ljubljana, Slovenia.
Rayner, M. et al. 1993. A speech to speech translation system built from standard components. Proceedings of the 1993 ARPA Human Language Technology Workshop. Princeton, New Jersey.
Reynolds, D.A. 1995. Speaker identification and verification using Gaussian mixture speaker models. Speech Communications 17 (91108).
Reynolds, D.A. and R. Rose. 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech Audio Processing 3(1) (7283).
Roe, D.B., F.C. Pereira, R.W. Sproat, M.D. Riley. 1992. Efficient grammar processing for a spoken language translation system. Proceedings of ICASSP-92, vol. 1 (213216). San Francisco.
Wahlster, W. 1993. Verbmobil, translation of face-to-face dialogs. Proceedings of the Fourth Machine Translation Summit (127135). Kobe, Japan.
Waibel, A., A. Jain, A.Mc Nair, H. Saito, A. Hauptmann, J. Tebelskis. 1991. JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies. Proceedings of ICASSP-91, vol .2 (793796). Toronto, Canada.
Waibel, A., M. Finke, D. Gates, M. Gavaldà, T. Kemp, A. Lavie, M. Maier, L. Mayfield, A. McNair, I. Rogina, K. Shima, T. Sloboda, M. Woszczyna, T. Zeppenfeld, and P. Zahn. 1996. JANUS-II-Translation of Spontaneous Conversational Speech. Proceedings of ICASSP (.409412).
Ward, W. and S. Issar. 1995. The CMU ATIS System. Proceedings of the ARPA Workshop on Spoken Language Technology ( 249251).
Woszczyna, M., N. Aoki-Waibel, F.D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C.P. Rose, T. Schultz, B. Suhm, M. Tomita, and A. Waibel. 1994. Towards spontaneous speech translation. Proceedings of ICASSP-94, vol. 1 (345349). Adelaide, South Australia.
Zissman, M.A. 1996. Comparison of Four Approaches to Automatic Language Identification of Telephone Speech. IEEE Transactions on Speech and Audio Processing 4(1).
Zissman, M.A. 1997. Predicting, Diagnosing and Improving Automatic Language Identification Performance . Proceedings of EUROSPEECH, vol. 1 (5154). Rhodes, Greece.
Websites
Projects
C-STAR: http://www.is.cs.cmu.edu/cstar/
SQEL: http://faui56s1.informatik.unierlangen.de:8080/HTML/English/Research/Projects/SQEL/SQEL.htm
VERBMOBIL: http://www.dfki.de/verbmobil/