Gentle Introduction to Rainbow

This document created for Libbow version 0.7/Rainbow Version 0.2
The primary author/coordinator of Libbow/Rainbow is Andrew McCallum <mccallum@cs.cmu.edu>
This tutorial written by Jason Rennie <jr6b@cs.cmu.edu>

This document is intended for users of Rainbow who have installed the Libbow/Rainbow package on their system and now want to learn how to use it. This document will give you a few simple examples of what Rainbow can do, and will help to increase your familiarity with its wealth of options. Additionally, it will give you an idea of what to expect and will help you to interpret the information Rainbow returns to you.

To make full use of this documentation, you will need to have a set of documents to classify. It is highly recommended that you obtain the 20 newsgroup dataset from the main Rainbow/Libbow web page. If you are unable to obtain the 20 newsgroup dataset, you will need a set of text documents which are grouped into directories according to their classification.

Before we start, it's always nice to know where to turn to get help. In the case of Rainbow, a list of options and their functionalities is available with the following command:

# rainbow --help | more

If you are using a pre-0.2 version of rainbow, you would use the following syntax:

# rainbow -X | more

Note that I will use "# " to indicate a command prompt. As above and wherever else applicable in this document, I will describe both the long and short command line options. Rainbow verstions before 0.2 do not understand long command line options.

With that helpful tidbit in mind, we can continue. Rainbow was primarily designed for running batch training and testing runs. Rainbow also allows for individual document classification according to a training model. However it is not yet possible to have Rainbow perform incremental training (all training for a model must be performed in a single execution of Rainbow).

Now, let's try some actual examples which use the Rainbow executable. First, we will index the documents from two of the twenty newsgroups. This is what might otherwise be called the training phase.

# rainbow --index [newsgroup_dir]/comp.graphics [newsgroup_dir]/rec.autos

If you are using a pre-0.2 version of rainbow, you would use the following syntax:

rainbow -i [newsgroup_dir]/comp.graphics [newsgroup_dir]/rec.autos

Here, [newsgroup_dir] is the path (relative or absolute) where the 20 newsgroup dataset is located. If you are using an alternate dataset, replace the comp.graphics and rec.autos directories with the locations of two directories filled with text documents. Make sure to NOT append the directory name with a slash. This may confuse Rainbow. After performing that operation, you should receive output similar to the following:

Class `comp.graphics'
  Counting words... files : unique-words ::     1000 :  12601
Class `rec.autos'
  Counting words... files : unique-words ::     1000 :  18543
Class `comp.graphics'
  Gathering stats... files : unique-words ::     1000 :  13141
Class `rec.autos'
  Gathering stats... files : unique-words ::     1000 :  13141
Making vector-per-class... words ::     41
Normalizing weights:         0

The directory ~/.rainbow should now exist and contain all the necessary information concerning the documents in the comp.graphics and rec.autos directories. To have Rainbow store the model information in a directory other than ~/.rainbow, you can use the -d or --data-dir=DIR option to specify an alternate directory. This is very useful if you want to store more than one learned model at a time. Here's how you would perform the above operation and store the results in /usr/local/rainbow/data instead of ~/.rainbow:

# ./rainbow --data-dir=/usr/local/rainbow/data --index [newsgroup_dir]/comp.graphics [newsgroup_dir]/rec.autos

If you are using a pre-0.2 version of rainbow, you would use the following syntax:

# ./rainbow -d /usr/local/rainbow/data -i [newsgroup_dir]/comp.graphics [newsgroup_dir]/rec.autos

As was mentioned before, you can use Rainbow to perform single docuement queries. Here's an example of how you would perform a query using the entire database of documents as the training set and the single query document as the testing set:

# rainbow --query=[newsgroup_dir]/rec.autos/101610
Loading data files...

[newsgroup_dir]/rec.autos 1
[newsgroup_dir]/comp.graphics 3.19284e-30

If you are using a pre-0.2 version of rainbow, you would use the following syntax:

# rainbow -q [newsgroup_dir]/rec.autos/101610

This specific query is "unfair" as the query document is part of the rec.autos category, but it should at least give you an idea of what to expect. The output lists each of the categories for which the model has information. Beside each category, it lists the probability, according to the model, that the query document would be assigned to that category.

One quite nice feature of Rainbow is that it does not require you to indicate from the beginning which documents are training documents. Now that Rainbow has indexed a set of documents, it has the flexibility to assign certain of those documents as training documents and certain of those as testing documents.

With the documents indexed, we can now do some testing runs. For our first run, we will use 67% of the indexed documents as training documents and 33% of the indexed documents for testing:

# ./rainbow --test-percentage=33 --test=1 > rainbow_output

If you are using a pre-0.2 version of rainbow, you would use the following syntax:

# ./rainbow -p 33 -t 1 > rainbow_output

The --test-percentage=33 specifies that 33% of the indexed documents should be set aside for testing. These documents are randomly chosen from the poll of indexed documents and information from these documents is not part of the model used to classify documents (on this run). The --test=1 option indicates that you wish to perform one testing run.

During the testing run, you should see output similar to the following:

Loading data files...
Making vector-per-class... words ::     41
Normalizing weights:         0

The output from a testing run is sent to standard out and is quite extensive. In most cases, as has been done here, you will want to direct the output to a file.

Once the above run is complete, you will be left with a file named rainbow_output which contains all the result information. It includes the classification for each individual document and also the confidence rating for each classification. To get summary results for the testing runs, you run this file through the rainbow-stats program:

# rainbow-stats < rainbow_output

This will give you results similar to the following. Note that your resulting accuracy may be different as the training/test document split is made at random.

Trial 0

Correct: 655 out of 660 (99.24)

 - Confusion details
Actual: comp.graphics
comp.graphics:320 rec.autos:4 
Actual: rec.autos
comp.graphics:1 rec.autos:335 

Overall_accuracy_and_stderr 99.24 0.00

For each trial, the output lists the overall accuracy, along with the details for the testing documents from each of the category from which testing documents were used.

The above runs have not made full use of the wealth of options which Rainbow offers. Without specifying any other options, Rainbow will use a stoplist to remove uninformative words (such as 'a', 'the', etc.). To have Rainbow not use this stoplist, you should use the -s or --no-stoplist option.

Besides using a stoplist, another fairly standard method for increasing the accuracy of a text learning algorithm is to use stemming (a method which represents each word by its root, removing suffixes which might differentiate two words which have the same meaning). Rainbow has code to perform stemming during the lexing and indexing of the documents. This can be turned on by using the -S or --use-stemming option with the execution of Rainbow. Beginning with the 0.2 version of Rainbow, the default is for Rainbow to NOT stem words.

Another important option is the -m <mname> or --method=<mname> option. This allows you to specify the method which is used for determining the classification of documents. By default, the Naive Bayesian (naivebayes) algorithm will be used. Libbow allows the flexiblity to use any algorithm and Rainbow currently implements the use of Naive Bayesian (naivebayes), TFIDF (tfidf_words), Fuhr's Probabilistic Indexing (prind), Cross Entropy (crossentropy), TFIDF using log of frequency (tfidf_log_words) and TFIDF using log of occurrences (tfidf_log_occur).

Under the lexing options of Rainbow, you will also find options to do such things as ignoring HTML tokens, removing e-mail headers and lexing words as multi-grams.

The -H or --skip-html option will ignore the HTML tokens in an HTML document. Essentially, it will ignore all blocks of text beginning with < and ending with >.

The --lex-for-usenet option will ignore Newsgroup: and Path: headers from e-mail and newsgroup messages and will also remove uuencoded blocks. This is especially useful for newsgroup messages, as these headers may include information which directly indicates the proper classification of a messages. For our example above using the 20 newsgroup dataset, the --lex-for-usenet option should always be used.

In a simliar vein, the -h or --skip-header option will remove ALL headers from e-mail and newsgroup messages, restricting classification of those documents purely to the content of the message.

The -g <N> or --gram-size=N option will use N-grams (N consecutive words in a message) for individual tokens, rather than using a single word for each token.

jr6b@cs.cmu.edu | Last updated 6/21/97