There are three main modes of operation:
./rainbow -i /usr1/mitchell/datasets/homepagesections/NEG /usr1/mitchell/datasets/homepagesections/POS Class `NEG' Counting words... files : unique-words :: 522 : 5535 Class `POS' Counting words... files : unique-words :: 164 : 6979 Class `NEG' Gathering stats... files : unique-words :: 522 : 3830 Class `POS' Gathering stats... files : unique-words :: 164 : 3830 Making vector-per-class... words :: 30 Normalizing weights: 0Here is an example query:
./rainbow -q /usr1/mitchell/datasets/homepagesections/POS/ZABOWSKI-DAVID.h4-3 Loading data files... Hit number 0, with score 1 Class `/usr1/mitchell/datasets/homepagesections/POS' Hit number 1, with score 1.27581e-14 Class `/usr1/mitchell/datasets/homepagesections/NEG'Here is an example of running an experiment, in two steps. The first step uses the data from the most recent -i run of Rainbow. It performs mulitple (10 here) iterations of a train/test split. The second step uses a perl script called rainbow-stats to summarize the results of these trials in lovely, human readable form.
./rainbow -t 10 > ~/rainbow.output ~/rainbow.output Loading data files... Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0 Making vector-per-class... words :: 79 Normalizing weights: 0Now let's look at the results...
./rainbow > cat ~/rainbow.output | ~/rainbow-stats Trial 0 Correct: 154 out of 200 (77.00) - Confusion details Actual: NEG NEG:124 POS:31 Actual: POS NEG:15 POS:30 Trial 1 Correct: 152 out of 201 (75.62) - Confusion details Actual: NEG NEG:125 POS:36 Actual: POS NEG:13 POS:27 ...more...
Here is a nice way to see the 15 terms with the highest information gain:
./rainbow -I 15 Loading data files... Calculating info gain... words :: 79 0.12477 project 0.05289 research 0.05225 lyco 0.03599 html 0.03523 system 0.03445 vasc 0.02966 http 0.02792 home 0.02779 vision 0.02741 href 0.02730 cmu 0.02342 wa 0.01995 www 0.01952 parallel 0.01942 laboratoriHere is the way to get online help:
[jr6b@stomach bow]$ ./rainbow -X rainbow: illegal option -- X usage: ./rainbow [-d datadir] [-v <verbosity_level>] [-b] (where `datadir' is the directory in which to read/write data `verbosity_level' is 0=silent, 1=quiet, 2=show-progress, ... 5=max) [-b] don't use backspace when verbosifying (good for use in emacs) [-T <N>] prune all but top N words by info-gain (default: infinity) [-m <mname>] set method to <mname> (eg. naivebayes, tfidf, prind) [-U] in the PrInd method, use non-uniform prior probabilities [-G] in the PrInd method, scale Pr(w|d) by Foil-gain [-V] print version information and exit lexing options [-s] don't use the stoplist (i.e. don't prune frequent words) [-S] turn off stemming of the tokens [-H] ignore HTML tokens [-g <N>] set N for N-gram lexer (default=1) [-h] skip over email or news header then, for indexing and setting weights -i <class1_dir> <class2_dir> ... [-L] don't lex to get word counts, instead read archived barrel [-f <file>] prints file contents instead of class_dir at query time [-R <N>] remove words with occurrence counts less than N or, for querying -q [<file_containing_query>] [-n <N>] prints the N best-matching documents to the query or, for testing [-t <N>] perform N testing trials [-p <N>] (with -t) Use N% of the documents as test instances [-x <class1_dir> <class2_dir>...] use these files as test instances [-N] in the PrInd method, do not normalize the scores. or, for diagnostics [-I <N>] prints the top N words with highest information gain [-W <classname>] prints the weight-vector for <classname> [-F <classname>] print the unsorted foilgain #'s for <classname> [-P] print score contribution of each word to each class [-B] print barrel word vectors in awk-processable form
./rainbow -M -m naivebayes -i [news data directory]/*where 'news data directory' is the location which you untarred the newsgroup file from (note that the '/*' will pass to Rainbow each subdirectory as a separate class).
./rainbow -t [number of test runs] -p 33 > rainbow.output cat rainbow.output | ./rainbow-stats