The data is available from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/course-cotrain-data.tar.gz
(GNU tar'ed and gzip'ped).
The files are organized into a directory structure with two directories at the top level
Under each of the two directories, there is one directory for each class (course, non-course). These directories in turn contain the Web-pages. The file name of each page corresponds to its URL, where '/' was replaced with '^'. Note that the pages start with a MIME-header.
If you have any questions about this dataset, send mail to Rayid
Ghani
(rayid@cs.cmu.edu)
last update: November 24, 1999 (Rayid)
created: November 24, 1999 (Rayid)