Naive Bayes algorithm for learning to classify text

Companion to Chapter 6 of Machine Learning textbook.

Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. This page provides an implementation of the Naive Bayes learning algorithm similar to that described in Table 6.2 of the textbook. It also provides a dataset containing 20,000 newsgroup messages drawn from the 20 newsgroups described in Table 6.3. As mentioned in the textbook, the dataset contains 1000 documents from each of the 20 newsgroups.

Note on downloading

This code and data are only supported under the Unix and Linux operating systems. (if you would like to volunteer support for Windows, please contact me). To reconstruct the original files from a downloaded files such as xxx.tar.gz, type the following two commands to Unix:

gunzip xxx.tar.gz
tar -xf xxx.tar

Code

This code is based on the Rainbow/Libbow software package developed by Andrew McCallum. It includes efficient C code for indexing text documents along with code implementing the Naive Bayes learning algorithm. Libbow also provides implementations of two additional text learning algorithms: TFIDF and prTFIDF. This code may be used as both a building block for creating other programs, or as a stand-alone learning/classification system.

Note: this code is a minor variant of the code described in Table 6.2 of Machine Learning.

Newsgroup Data

The tarred and gzipped data directory (easiest for downloading).
A tarred and gzipped subset of the Newsgroup data which contains 100 randomly selected messages from each newsgroup. This is a useful dataset for learning to use Rainbow.

On-Line Documentation

Rainbow Documentation

Visitors from outside CMU are invited to use this material free of charge for any educational purpose, provided attribution is given in any lectures or publications that make use of this material.

This page organized by Jason Rennie.

jr6b@cs.cmu.edu | Last updated 4/6/97