Folks, today you are going to build the first Janus speech recognizer! It will be a very small recognition engine with simple context independent acoustic models, which we will borrow from another engine. And for now we won't be able to have a real live demo with speaking into it and so on... But, hey - it's a start! The goal of this exercise is to get familiar with all the Janus objects you need in order to get the recognizer running and to learn more about the Janus-script language.
Repeat the formular for the multinominial multivariate Gaussian distribution from the Janus Tutorial . |
Task 11 Create a small codebook, mixture weights, and a reference vector. Get the value of the multinomial distribution of this reference vector by following the instructions below: |
Question 8 Reconsider the most important Janus objects, their purposes, their relation and dependencies and then answer the following questions:
Typing the name of an object followed by a . gives you all subobjects as output. Typing the name of an object followed by a : gives you all names of the subobjects. |
Setup the environment
You will now setup your trainings and test environment and arrange your files and data you already prepared in the last sessions. For information about the database we are using see also Janus Tutorial Step1 . |
Intialize weights / Run first Viterbi
Read Janus Tutorial Step3 and follow the instructions. Run the commands interactively in your directory step3. Try out some alternative parameters, apply the Viterbi to different utterances and get familiar with this procedere. |
To avoid to type all the commands again and again and again, it is very useful to write a startup script . Instead of typing the lines or cut and paste them you now can source this script by typing in Janus %source startup.tcl. |
Get recognition results
In order to test the resulting recognizer we do need a Language Model and a vocabulary.
Task 12: Reuse your script countPairs.tcl from Session 1 which produces word pairs. Modify your script such that it produces a language model according to the NIST specifications (see below). The language model consists of one entry for each word (unigram) and word pair (bigram) together with a notation of the probability of the word/word pair in the training corpus. |
Question 9 Look into the resulting language model file and find the bigram with the highest propability. Which one is it? Is it reasonable? NIST language Model:
comments \data\ \ngram 1=Number of Unigrams \ngram 2=Number of Bigrams \1-grams: log(p(word)) word -99.9 ... log(p(word)) word -99.9 \2-grams: log(p(word2|word1)) word1 word2 ... log(p(word2|word1)) word2 word1 \end\Assuming we do have a corpus of the following three sentences
<s> B C A </s> <s> A A B </s> <s> C A A </s>Then a NIST language model could look like:
\data\ ngram 1=5 ngram 2=9 \1-grams: -1.18045354881 </s> -99.9 -0.700418578079 <s> -99.9 -0.477989694795 A -99.9 -0.87723631318 B -99.9 -0.87723631318 C -99.9 \2-grams: -0.481485034036 <s> A -0.481485034036 <s> B -0.481485034036 <s> C -0.400116075245 A </s> -0.400116075245 A A -0.703333310875 A B -0.305394150245 B </s> -0.305394150245 B C -0.00217691461508 C A \end\The NIST format allows many different kinds of language models. In our experiments we are using a very simple one. You can think of others like trigram language models using different backoff schemes, so that the back-off factors can be different from the above used -99.9 ... and much more other fancy stuff.
Read Janus Tutorial Step7 and do the following things: |
Last modified: Fri Mar 16 00:47:50 EST 2001
Maintainer: tanja@cs.cmu.edu.