Disclaimer : Information provided here may not represent the standpoints of Carnigie Mellon University or CMU Sphinx Group.

Sphinx FAQ

First Created at Mar 2004

Copyright by Arthur Chan

Author : Arthur Chan

I wrote this FAQ for speech recognition (mainly for Sphinx) because I think existing FAQ may not be the most comprehensive from a point of view of researcher/developer of speech recognition. There are also many misconceptions out there even for engineers/programmers who have long-time experience. That is quite scary. Hence in this FAQ, I tried to hit some misconceptions very hard and hopefully they will be gone from the world. :-)

Disclaimer again. This is called "Arthur's Sphinx Manual", not "Official Sphinx's Manual". Yell at me.

1, Why Sphinx X's default models' performance is so poor?

Many people yell when they found that the bundled model of Sphinx doesn't work well. I got these responses most of the time from layman of speech recognition. Sometimes, surprisingly from hard-core hackers in the field.

The official answer of this question is "Of course!" because if you trained a model under a condition which is different from your target platform. You will not get the most ideal performance. "Garbage in, Garbage out". Sphinx's acoustic model are trained under some special situation. For example, in Wall Street Journal task initiated by NIST in 1992. Speakers are invited to read news in the Wall Street Journal. So, later, many sites try their best to transcribe them automatically.

Now, consider this, if a developer used this model (acoustic model and language model) in their task , lets say a task that requires recognition of only digits strings. What will happen? The developer will be very disappointed to see that the famous WSJ model give him very poor performance.

Why? From the language modeling point of view, this is because the language model involved 20000 words in the grammar and the chance that a string of digits appear in this language model is very low.

From the acoustic modeling point of view, the model was trained using so many data which is so different from the target application. As a result, the resulting Gaussian distribution may not be sharp enough to be good enough in digit task.

No pain, no gain. Currently there is no public archive for acoustic model. Our job is to build one. Before that, help us to train more acoustic models. Developers in later generation will thank to all of us.

Also take a look my answer to "Why company X can do that well, but not Sphinx?"

2, I talked to the speech recognizer and I said the same things. Why Sphinx doesn't give me the same answer?

This happened to every speech recognizer. :-) Speech is a random signal. Believe it or not. The waveform of "hello" you spoke 5 minutes ago can look entirely different from that of the "hello" you speak now. Hence, a speech recognizer is to match this randoms signal to some know models (again, the model is probabilistic in nature). So unless, you set the grammar to allow only one word there (in which case, yes you always get the same answer.), there will still be chance that you will get two different answers even you said the same thing.

3, Training using SphinxTrain is hard. This stuffs should be simple enough such that everyone can do it.

Yes, why not? Anyone who has go through training will know that training takes a lot of time to prepare. You need to be pretty fluent in shell/perl scripting because there is usually a lot of text-editing. You also need to make sure the data is matched.

I guess one key issue here is training is actually the hardest part in a speech researcher's learning. Even in the field, many people don't know how to do training at all. Those who know how to train, they seldom understand the math. Those who understand the math, well, they are already pretty good in this stuff. Why bother to make it easy?

Another key issue is that there is no one common format of transcription files for speech yet. Why? Well, some researchers want to do research A, it requires tags X, Y, Z. Some researchers don't want it, what can they do? Most likely, they will just write a sh/perl script, do the conversion. Frankly, if you know this stuff, it usually take you less than one minute to do simple thing like lower-case to upper-case conversion or replacement. (You don't believe it? You must use Microsoft Windows for a long time. ;-))

Well, get back to the question. I think training can be somewhat simpler if someone can write a GUI for it. Researchers just have no time. Company just don't think it is profitable. So, it really take people spare time to work on this.

4, "There is something wrong in SphinxTrain, it took me an hour to get an acoutic model.", what's wrong?

(to be written.)

5, Why company X can do that well, but not Sphinx?

You may say, well company X can do it very well! Why not Sphinx? I guess the main reason is that CMU is an academic institution and we put much more focus on fundamental research and evaluation of new scenario.

One may say, there are only less than 10 speech companies can really survive in the world. Let me use ViaVoice as an example, ViaVoice has spent a lot of effort to 1) collect data, 2) clean up the data, 3) research on how to make use of this data. They spent long years to do this and they only dedicated to the dictation domain. They have different data set for LMs for different people, such as medical transcription, email writing. They spent a lot of time to build acoustic model just for the dictation task.

In this case, how could you compare Sphinx with these recognizers? The dictation task is definitely not the academic focus for long long time.

Another issue in speech recognition is, there are only few sites sharing source code for large vocabulary continous speech recognizer. There are even less sharing models and training data. We(CMUers) are quite dedicated to solve this problem. If you have a model which is trained for research purpose using SphinxTrain and there is no IP issue from your side, we suggest you to contribute this model/scripts /data to the community. Yes, we need more readily made model for 1) different environments, 2) different scenarios and 3) different languages. This model will be used by researchers/developers around the world. Please send me a mail if you can.

6, Why Sphinx is not GPLed?

(to be written.)

7, Why Sphinx has no manual?

This is a problem, we are currently working on this. Please also understand our difficulty. It is not that easy to write a manual.

8, Why Sphinx doesn't have feature X?

Many people complain to me there is not enough documentation in using Sphinx. Well, when you think about that, it is actually not easy to have them in the first place. Sphinx development was carried through many developers and each of them write a small piece of the code. Hence, no one can really say what is the big picture of everything. If anyone did understand, he/she must be very busy on doing other projects.

Many people also tried to compare Sphinx and other recognizers. Well, I don't like to make too much statements before I have enough evidence. Most of the time this kind of statements are from beginners and they usually like try to compare apple and orange. There are many issues are very specific to recognizer design which trade-offs have to be made. That results in different architectures and characteristics of a recognizer. If one considers all these factors, one may tends to see many good recognizers than just seeing simple differences

Let me try to give you an example of what I mean, I regard HTK and ISIP as some great open-source projects. I was a HTK user for couple of years. It is great to use and has a lot of functionalities. However, when I delved in the source code, I found that it is very hard to change. Partially because I don't know how to use tools such as etags, grep and emacs. However, a hypothesis I have got after I read that page seems to me is more likely. The fact that the recognizer is easy to use and modular causes a lot of complexity was encoded inside the software.

Do you call this bad? I guess there are always two sides of the coin. HTK allows a lot of beginner to try out acoustic model ideas. Many papers do quote it. I think as a whole, it stimulates the development of the field. However, it also has its limitations that one still need to work on. They do work on it a lot. That is good to the field.

Sphinx is just the same. There are always some limitations of the recognizer. Say some people will say the use of multiplexed cross-word triphones is black magic. What we should do is to do full fan-in and fan-out (like HTK). In reality, all commercial recognizers (cool or not) have some black magics. If a recognizer is doing full fan-in and fan-out, it will definitely not able to fit in to run in real-time. HTK can do it partially because it provides extra mechanim to prune the cross-word triphones arcs (I am not sure whether it is fan-in or fan-out). Sphinx did it because in reality the multiplex crossword triphones won't hurt. Well, you may always argue that this hurts accuracy. This is just the same arguing heuristic search is not useful. In reality, everyone knows heuristic is useful and it appears everywhere.

I admit that more works need to be done in making Sphinx to be more approchable. We are making an effort on it. Drop me a mail if you want to help and tell me about your comments. I am more than willing to listen.