Rudimentary documentation for Rainbow

This document created for Libbow version 0.6/Rainbow Version 0.1

There are three main modes of operation:

Indexing: tokenizing a set of documents and building weight vectors
Querying/classifying: classifying a single document
Running an experiment, including several training/test splits.

Here is a simple indexing example. The argument here is a set of directories, one directory per class. Within each directory, there must be one file per document. This tokenizes the files, builds weight vectors according to Naive Bayes, and saves the vectors in ~/.rainbow/*

     ./rainbow -i /usr1/mitchell/datasets/homepagesections/NEG /usr1/mitchell/datasets/homepagesections/POS
     Class `NEG'
       Counting words... files : unique-words ::      522 :   5535
     Class `POS'
       Counting words... files : unique-words ::      164 :   6979
     Class `NEG'
       Gathering stats... files : unique-words ::      522 :   3830
     Class `POS'
       Gathering stats... files : unique-words ::      164 :   3830
     Making vector-per-class... words ::     30
     Normalizing weights:         0

Here is an example query:

     ./rainbow -q /usr1/mitchell/datasets/homepagesections/POS/ZABOWSKI-DAVID.h4-3
     Loading data files...
     
     Hit number 0, with score 1
     Class `/usr1/mitchell/datasets/homepagesections/POS'
     
     Hit number 1, with score 1.27581e-14
     Class `/usr1/mitchell/datasets/homepagesections/NEG'

Here is an example of running an experiment, in two steps. The first step uses the data from the most recent -i run of Rainbow. It performs mulitple (10 here) iterations of a train/test split. The second step uses a perl script called rainbow-stats to summarize the results of these trials in lovely, human readable form.

 
     ./rainbow -t 10 > ~/rainbow.output
     ~/rainbow.output Loading data files...  Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0 Making
     vector-per-class... words :: 79 Normalizing weights: 0

Now let's look at the results...

 
     ./rainbow
     > cat ~/rainbow.output | ~/rainbow-stats
     Trial 0
     
     Correct: 154 out of 200 (77.00)
     
      - Confusion details
     Actual: NEG
     NEG:124 POS:31 
     Actual: POS
     NEG:15 POS:30 
     
     Trial 1
     
     Correct: 152 out of 201 (75.62)
     
      - Confusion details
     Actual: NEG
     NEG:125 POS:36 
     Actual: POS
     NEG:13 POS:27 
     
     
          ...more...

Here is a nice way to see the 15 terms with the highest information gain:

      ./rainbow -I 15
     Loading data files...
     Calculating info gain... words ::     79
      0.12477 project
      0.05289 research
      0.05225 lyco
      0.03599 html
      0.03523 system
      0.03445 vasc
      0.02966 http
      0.02792 home
      0.02779 vision
      0.02741 href
      0.02730 cmu
      0.02342 wa
      0.01995 www
      0.01952 parallel
      0.01942 laboratori

Here is the way to get online help:

[jr6b@stomach bow]$ ./rainbow -X
rainbow: illegal option -- X
usage:
 ./rainbow [-d datadir] [-v <verbosity_level>] [-b]
  (where
   `datadir' is the directory in which to read/write data
   `verbosity_level' is 0=silent, 1=quiet, 2=show-progress, ... 5=max)
  [-b]  don't use backspace when verbosifying (good for use in emacs)
  [-T <N>]  prune all but top N words by info-gain (default: infinity)
  [-m <mname>]  set method to <mname> (eg. naivebayes, tfidf, prind)
  [-U]  in the PrInd method, use non-uniform prior probabilities
  [-G]  in the PrInd method, scale Pr(w|d) by Foil-gain
  [-V]  print version information and exit
lexing options
  [-s]  don't use the stoplist (i.e. don't prune frequent words)
  [-S]  turn off stemming of the tokens
  [-H]  ignore HTML tokens
  [-g <N>]  set N for N-gram lexer (default=1)
  [-h]  skip over email or news header
then, for indexing and setting weights
 -i <class1_dir> <class2_dir> ...
  [-L]  don't lex to get word counts, instead read archived barrel
  [-f <file>]  prints file contents instead of class_dir at query time
  [-R <N>]  remove words with occurrence counts less than N
or, for querying
 -q [<file_containing_query>]
  [-n <N>]  prints the N best-matching documents to the query
or, for testing
  [-t <N>]  perform N testing trials
  [-p <N>]  (with -t) Use N% of the documents as test instances
  [-x <class1_dir> <class2_dir>...] use these files as test instances
  [-N]  in the PrInd method, do not normalize the scores.
or, for diagnostics
  [-I <N>]  prints the top N words with highest information gain
  [-W <classname>]  prints the weight-vector for <classname>
  [-F <classname>]  print the unsorted foilgain #'s for <classname>
  [-P]  print score contribution of each word to each class
  [-B]  print barrel word vectors in awk-processable form

To have Rainbow index and store the 20 newsgroup data, you can call it with the following arguments:

./rainbow -M -m naivebayes -i [news data directory]/*

where 'news data directory' is the location which you untarred the newsgroup file from (note that the '/*' will pass to Rainbow each subdirectory as a separate class).

To then have Rainbow perform test runs on the data, you can call it with the following arguments:

./rainbow -t [number of test runs] -p 33 > rainbow.output
cat rainbow.output | ./rainbow-stats