The goal of Exercise 2 is to do more exercises in Tcl/Tk and to introduce you to the Janus objects database and dictionary. In addition we will talk about training, cross-validation, and evaluation test set as well as about the word error rate to measure the output of a recognizer.
Pronunciation Dictionary
Typically a pronunciation dictionary consists of two fields per entry. The first field gives a word, the second field describes the words pronunciation in terms of phonemes. From last exercises you learnt, that in our dictionary only 16 different phonemes are used. This is not the typical size for an English phoneme set. The reduction was made for simplification only. The first 5 lines of the file dict look like this:
ABLE E Y P A L ABOUT A P A T ACTION E K S A N ACTS E K T S ADAMS E T A N SFortunately this format does only slightly differ from what Janus expected:
{ABLE} {E Y P A L} {ABOUT} {A P A T} ...The curly braces are used by Tcl, because the dictionary will be interpreted by Tcl when reading. Obviously you wouldn't need the braces around the words in the above example, but you could imagine to have words that include special characters, that should be packed in braces. However, you are strongly encouraged NOT to use special characters in names of words or phonemes, they are going to cause you trouble pretty sure.
Task 3-1 (If not done from last exercise) Write a script coverage.tcl that outputs the number of words in the text file steps/data/transcripts which are not covered by the dictionary steps/data/dict. Caution Don't forget to skip the utterance ID in the transcripts.
Question 3-1: How high is the OOV-rate?
Question 3-2: What are the different ways to define a vocabulary for a speech recognizer?
Task 3-2 Identify the words in the transcripts which are not covered by the vocabulary. Create manually a phoneme sequence of these words according to our phoneme set (output of count.tcl from last session) and add the pronunciations to your local copy of the dictionary steps/Mydict . Add the curly braces to your dict.
Janus Database
To train a recognition engine with Janus we need to create a database. For this purpose Janus provides us with an object type DBase. These objects can be used for anything, but one of the most common usages is a task database, describing the recordings of the task, giving all the needed information about every utterance.
To get familiar with this object, type the following:
DBase shows you the methods defined for the objects of type DBase (method puts) DBase db create an DBase object db db open db.dat db.idx -mode rwc opens database with data and index file db add one {{frz un} {ger eins}} add an entry with two fields with entry-key = "one" db add two {{frz deux} {ger zwei}} add another entry to the database db close close the database (incl. Files)
Exit janus and look into the files which had been created in your directory. Now restart janus again and type in the following lines:
DBase mybase mybase open db.dat db.idx -mode r makeArray two [mybase get two] puts $two(ger)
Now you have an idea how easy we can create databases in Janus. In the following step we will create a database from our data. In our example, we have the file transcripts which contains the most essential information about the task's utterances, namely an utterance ID and a transcription. Other tasks can be organized in a different way. So you'll have to figure out yourself what is the best way to structure your data into a Janus database.
Task 4 Create a Janus database from the transcripts file. Use the utterance ID as the entry-key.
[DBase db] open db.dat db.idx -mode rwc set fp [open ../data/transcripts r] while { [gets $fp line] != -1 } { set utt ... db add ... } db close exit
Defining a Training and a Test Set
Our data set contains 126 utterances. For training and evaluating a real system, you would use a training set for training the system, a cross-validation sets for tuning, and an unseen evaluation set for testing. In our "toy-world" we try to keep things very simple, therefore we will split our 126 sentences in a training set of 110, and a test set of 26 sentences. From adding up those numbers you can see that there will be an overlap - well let's see later what happens.
Task 5 Create two files trainIDs and testID, the first contains the utterance IDs of the training set, the latter the utterance IDs of the test set.
In continuous speech recognition three different kinds of error can occur: a spoken word is misrecognized, i.e. subsituted by another word (=subsitution error) , or a spoken word wasn't recognized at all, i.e. deleted (= deletion error), or the recognition system recognizes a word which was not spoken, i.e. it inserted a word (=insertion error).
Based on these three error types the Word Error Rate (WER) is defined as a measure for the performance of the recognition engine:
#Deletions + #Subsitutions + #Insertions WER = ------------------------------------------ * 100 # words to be recognizedConsider the following example:
REF: This great machine can recognize speech HYP: This machine can wreck nice beach DEL SUB INS SUB 1 + 2 + 1 WER = ------------ * 100 = 66 6So, the recognizer has a Word Error Rate of 66% or a Word Accuracy WA (WA = 100 - WER) of 33%. Make sure you understand that WER always meant to be the minimal error rate. Otherwise one could interpret the above example as the second reference word "great" to be substituted with "machine", then the reference word "machine" could be interpreted as to be substituted with "can", the word "can" with "wreck" and so on.
Question 6-1: Is it possible that WA becomes < 0 or > 100. If so, give an example.
Question 6-2: Is it possible that for a pair of reference/hypothesis the minimal word error rate results from different types of errors (different combination of error types)?
Task 6 Write a tcl-script that gets two lists as input (first list = list of reference strings, second list = list of hypothesis) and gives the Word Error Rate as output.