The goal of Exercise 2 is to do more exercises in Tcl/Tk and to introduce you to the Janus objects Database, Dictionary, Tags, Phones and PhonesSet.
Pronunciation Dictionary
Typically, a pronunciation dictionary consists of two fields per entry. The first field gives a word, the second field describes the words pronunciation in terms of phonemes or other units e.g. demi-syllables. Because we are developing a Mandarin speech recognition system, we will use demi-syllables. How to do this is part of the homework.
To create a Dictionary object we first have to create a Tags and Phones object.
Tags are used to add more information to
We will use the tag WB for word boundaries and T1 T2 T3 T4 T5 for the five different tones used in Mandarin.
To create a Tag object start janus in your steps directory and type:
> Tags tags ;# create an instance tags of class Tags
> tags puts ;# output content of the tags instance
> tags add WB ;# add a tag for word boundaries
> tags add SMOKER
> tags puts ;# what tags do we have?
SMOKER WB
;# add all the tonal tags with a loop
> foreach tag {T1 T2 T3 T4 T5} {
> tags add $tag
> }
> tags puts
SMOKER T1 T2 T3 T4 T5 WB
# show sub-objects that can be accessed
> tags.
item(0..6) list
# we store 7 items starting with an index 0 up to 6 (0..6)
# how can we access these items?
# access item 4
> tags.item(4) puts
T3
> tags delete SMOKER ;# we can also remove tags
> tags.item(4) puts
T4
# But the place/index of items may move if we delete something!
> tags write ./tags.desc ;# creates an ASCII file with the tags
# take a look at the created file
Check the created tags.desc file.
To create a pronunciation dictionary object we have to define the basic acoustic units used to describe the "words". We will now create a phones object. This is only an example! You have to come up with the full list of units for yourself.
The Phones object has a similar functionality as the Tags object.
> Phones phones
> phones add SIL ;# SIL stand for Silence
> phones add @ ;# This is the padding phoneme. It has a special meaning that we will not discuss now.
> phones add "zh ong b iao" ;# this is a short version to add 4 units at once
> phones puts
@ SIL b iao ong zh
# write the description into an ASCII file
> phones write ./example-phones.desc
# Look at the created file
Now let us create a pronunciation dictionary!
# get some help what other objects are needed
> Dictionary dict -help
ERROR itf.c(0406) <ITF,FCO> Failed to create 'dict' object.
Options of 'dict' are:
<name> name of the dictionary (string:"dict")
<Phones> phones (Phones:)
<Tags> tags (Tags:)
# The "ERROR itf.c(0406)" is because we did not provide enough information to create the object.
# Additional to the instance name we have to provide first an object of type Phones and then an object of type Tags
# with objects we have created above
> Dictionary dict phones tags
> dict add -help ;# we have to provide two parameters the lexical form and a list list of {phoneme tag*} elements
dict add -help
Options of 'add' are:
<name> name (spelling) of the word (string:"NULL")
<pronunciation> pronunciation of the word
> dict add zhong3biao2 { {zh WB} {ong T3} b {iao WB T2} }
> dict add $ { {SIL WB} } ;# we will use this word as optional silence, this will be explained later
> dict write ./example.dict ;# write an ASCII version of the dictionary
Take a look at the example.dict file.
Task 3 Write a janus-script that creates a training dictionary (train.dict). (For details see homework).
Now we take a closer look at the PhoneSet object.
Because we will model
> PhonesSet phonesSet # The list of PHONES, INITIALS etc. is incomplete! > phonesSet add PHONES "SIL @ zh ong b iao" > phonesSet add INITIALS "zh b" > phonesSet add FINALS "ong iao" > phonesSet add SILENCE "SIL" # We can get a list of elements stored in the set > phonesSet FINALS INITIALS PHONES SILENCE # We can access the content of e.g. the finals by typing > phonesSet:FINALS iao ong # The type method tells us the type of an janus object > phonesSet:FINALS type Phones # The type of the phonesSet instance is of course > phonesSet type PhonesSet # Write an ASCII description into a file > phonesSet write example-phonesSet.desc # Look at the created fileBecause the type of phonesSet:PHONES is Phones it can be used to create a Dictionary object.
> Dictionary dict2 phonesSet:PHONES tagsQuestion 3-1: What other groups/classes could be useful?
Janus Database
To train a recognition engine with Janus we need to create a database. For this purpose Janus provides us with an object type DBase. These objects can be used for anything, but one of the most common usages is a task database, describing the recordings of the task, giving all the needed information about every utterance.
To get familiar with this object, type the following:
> DBase ;# shows you the methods defined for the objects of type DBase (method puts) > DBase db ;# create an DBase object db > db open db.dat db.idx -mode rwc ;# opens database with data and index file > db add spk030_utt001 "{ADC CH030/CH030_1.adc.shn} {TEXT guo2wu4yuan4 zhao4kai1 di4}" ;# add an entry with two fields with entry-key = "spk030_utt001" > db add spk030_utt002 "{ADC CH030/CH030_2.adc.shn} {TEXT li3peng2 zong3li3 jin1tian1 zhu3chi2}" ;# add another entry to the database > db close ;# close the database (incl. Files)
Exit janus and look into the files that has been created in your directory. Now restart janus again and type in the following lines:
> DBase mybase > mybase open db.dat db.idx -mode r > makeArray uttInfo [mybase get spk030_utt002] > puts $uttInfo(TEXT) li3peng2 zong3li3 jin1tian1 zhu3chi2
Now you have an idea how easy we can create databases in Janus. In the following step, we will create a database from our data. In our example, we use the Romanized transcripts in ./data/CH/rmn/. The structure of the database you should create is described in the homework for this session.
Task 4 Create a Janus database from the transcript files. Use the speaker and utterance ID as the entry-key (homework).
[DBase db] open db.dat db.idx -mode rwc foreach f {[glob ./data/CH/rmn/*.rmn] set fp [open $f r] while { [gets $fp line] != -1 } { # analyze the text etc. set utt ... db add ... } close $fp } db close exitQuestion 4-1: Why is it not sufficient to use the utterance ID alone as a key?
Defining a Training and a Test Set
Remember that our data set contains 30 speakers with a total of 2589 utterances. The utterances of speakers "0[36]*" are for training and "09*" are for parameter tuning.
Task 5 Create two files train.utt and test.utt, the first contains the utterance keys of the training set, the latter the utterance keys of the test set.
In continuous speech recognition, three different kinds of error can occur: a spoken word is misrecognized, i.e. substituted by another word (=substitution error) , or a spoken word was not recognized at all, i.e. deleted (= deletion error), or the recognition system recognizes a word that was not spoken, i.e. it inserted a word (=insertion error).
Based on these three error types the Word Error Rate (WER) is defined as a measure for the performance of the recognition engine:
#Deletions + #Substitutions + #Insertions WER = ------------------------------------------- * 100 # words to be recognizedConsider the following example:
REF: This great machine can recognize speech HYP: This machine can wreck nice beach DEL SUB INS SUB 1 + 2 + 1 WER = ------------ * 100 = 66 6Therefore, the recognizer has a Word Error Rate of 66% or a Word Accuracy WA (WA = 100 - WER) of 33%. Make sure you understand that WER always meant to be the minimal error rate. Otherwise one could interpret the above example as the second reference word "great" to be substituted with "machine", then the reference word "machine" could be interpreted as to be substituted with "can", the word "can" with "wreck" and so on.
Question 6-1: Is it possible that WA becomes < 0 or > 100. If so, give an example.
Question 6-2: Is it possible that for a pair of reference/hypothesis the minimal word error rate results from different types of errors (different combination of error types)?
Task 6 Write a tcl-script that
gets two lists as input (first list = list of reference strings, second list = list of hypothesis) and
gives the Word Error Rate as output.