Index of ghn's corpora

I've decided to index and document my corpora using HTML rather than SDE. These corpora are fairly unstructured; I may try to put some structure on them later... we'll see.

The corpora that we're using with Sphinx are a bit unusual. The sentences NL gets as input are an approximation to the actual sentence (as determined by a human listener), because of the potential for errors in the recognition process. What I've included here, at least initially, are the human-transcribed reference sentences, which might differ from the sentences we actually need to push through NL-Soar.

The ATIS December 1993 Demo Corpus, male utterances only. We have separated the corpus so that we can use one set as a "training" set and the other for test; this also makes it possible to batch process the corpus in one night before our kerberos tickets expire.
There are two corpora, the human transcriptions of the sentences, and the Sphinx unigram+acoustics hypotheses for the recorded utterances.
Each utterance is tagged with an integer identifier and an ATIS subject data identifier. The subject identifiers can be read as:
The traffic report corpus collected (without permission) from WDUQ's broadcast traffic reports. I've used a similar set of identifiers for the utterances in this corpus, except I have reversed the sentence and discourse ids, made the discourse id a 2 digit number, and eliminated the cruft at the end.

Maintainer: ghn@cs.cmu.edu (Last updated 95-09-05)