Naive Bayes algorithm for learning to classify text
Companion to Chapter 6 of Machine Learning textbook.
Naive Bayes classifiers are among the most successful known algorithms
for learning to classify text documents. This page provides an
implementation of the Naive Bayes learning algorithm similar to that
described in Table 6.2 of the textbook. It also provides a dataset
containing 20,000 newsgroup messages drawn from the 20 newsgroups
described in Table 6.3. As mentioned in the textbook, the dataset
contains 1000 documents from each of the 20 newsgroups.
Note on downloading
This code and data are only supported under the Unix and Linux
operating systems. (if you would like to volunteer support for
Windows, please contact me). To reconstruct the original files from a
downloaded files such as xxx.tar.gz, type the following two commands
to Unix:
gunzip xxx.tar.gz
tar -xf xxx.tar
Code
This code is based on the Rainbow/Libbow software package developed by
Andrew McCallum. It includes efficient C code for indexing text
documents along with code implementing the Naive Bayes learning
algorithm. Libbow also provides implementations of two additional
text learning algorithms: TFIDF and prTFIDF. This code may be used as
both a building block for creating other programs, or as a stand-alone
learning/classification system.
Note: this code is a minor variant of the code described in Table
6.2 of Machine Learning.
.
Newsgroup Data
On-Line Documentation
Visitors from outside CMU are invited to use this material free of
charge for any educational purpose, provided attribution is given in
any lectures or publications that make use of this material.
This page organized by Jason Rennie.
jr6b@cs.cmu.edu | Last updated 4/6/97