Speech Recognition is the most computationally expensive part of the speech-to-speech translation process. Unless a decoder is specially designed to run on a PDA platform which has limited memory bandwidth and no floating point, the recognition will likely be too slow for practical use.
Multimodal Technologies Inc, has been working on a small footprint fast decoder HMM-based recognition for some years and has had significant experience in working with multiple languages and speech-to-speech translation systems.
The audio input device on PDAs is not of high quality. Given the size of the hardware it is common that the audio channel has lots of electrical noise from the power supply and motherboard, thus recordings on these devices are not clean. Furthermore in our experience the amount of noise that the audio channel may differ from device to device. External digitizing of audio might be an option in the long term, such as off-device USB audio, or design of better shielding around the audio hardware, but our goal was to use standard PDAs so such alternatives were not available.
The acoustic models were bootstrapped from the GlobalPhone [3] Arabic collection as well as the recordings described above. The data contains both male and female examples though we have tested more with male speakers than female.
As the Speechalator is a domain-based translation system, we want to use that advantage to constrain the recognition engines. Rather than having a separate language model and subsequent parser as we have done in other translation systems we have built [7], we have integrated the parsing part of the system within the recognizer language model. This allows the decoder to be more efficient allowing us to deal with larger vocabularies and more utterance types than we would be able to do otherwise.
The final part of the recognition system is the adaptation to the acoustic environment, and speaker. This is fairly standard in most recognition engines and we adopt these techniques here too.