Index of ghn's corpora
Index of ghn's corpora
I've decided to index and document my corpora using HTML rather than
SDE. These corpora are fairly unstructured; I may try to put some
structure on them later... we'll see.
The corpora that we're using with Sphinx are a bit unusual. The
sentences NL gets as input are an approximation to the actual sentence
(as determined by a human listener), because of the potential for
errors in the recognition process. What I've included here, at least
initially, are the human-transcribed reference sentences, which might
differ from the sentences we actually need to push through NL-Soar.
- The ATIS December 1993 Demo Corpus, male utterances only. We
have separated the corpus so that we can use one set as a "training"
set and the other for test; this also makes it possible to batch
process the corpus in one night before our kerberos tickets
expire.
There are two corpora, the human
transcriptions of the sentences, and the Sphinx unigram+acoustics hypotheses for
the recorded utterances.
Each utterance is tagged with an integer identifier and an ATIS
subject data identifier. The subject identifiers can be read as:
- The traffic report corpus collected
(without permission) from WDUQ's broadcast traffic reports. I've used
a similar set of identifiers for the
utterances in this corpus, except I have reversed the sentence and
discourse ids, made the discourse id a 2 digit number, and eliminated
the cruft at the end.
Maintainer: ghn@cs.cmu.edu (Last updated 95-09-05)