Home
Feedback
REUTERS
Sinopsis
REUTERS newswire corpuses
Description
There are many versions of the REUTERS corpus in use.
See Yiming's paper on text categorization for more details.
On AFS we have some different versions:
Reuters_21450:
Reuters-21578:
- The original corpus contains the file in SGML. You can use
tdt-reuters to extract the smart files. Then you can use
eugeneng_clean_cat.c
to extract the 90categories sets, described bellow.
- A subset of ApteMod that contains only documents
whose categories belong to one of the 90 categories for which there
exists at least one training document and one testing document, was created by e Sing Eugene Ng . A version of it (they dont have the titles) was called ApteMod
- Well documented by David D. Lewis
(AT&T Labs - Research, lewis@research.att.com)
.