Utterance IDs

In JANUS, every utterance has it's ID. You can decide yourself what you consider an utterance. One utterance can be spread over multiple speech recording files, or one file can contain many utterances. Usually, however, you will have one utterance per speech file. Many tasks (like the Wall Street Journal Task) already use unique IDs for their utterances. Other tasks use IDs for speakers, dialogs, sentences within a dialog, etc. You can compose an utterance ID from those IDs. Later you will create a task database which will use the utterance IDs as its keys. You will also have to define rules (Tcl scripts) for how to get the name(s) of the speech file(s) and how to get the transcription of an utterance from its ID. You will be able to create subsets of the entire database by listing all the utterances of the subset into a file. This way you can create training sets, development sets, test sets, or subsets of the trainining set like e.g. gender-dependent sets, crossvalidation sets etc.