DTree - Decision Tree Classifier ================================ by Ralf Brown 11/4/97 Contents: ========= * What are Decision Trees * Running DTree * DTree Options * Data Format ======================= What are Decision Trees ======================= !!! ============= Running DTree ============= The DTree program is invoked as follows: dtree [options] docfile [docfile ...] The available options are described in the following section. Each of the one or more document files named on the command line is assumed to be a separate collection of documents to be processed. !!! ============= DTree Options ============= The DTree program supports the following main options on the command line (there are others listed on the usage screen, but they are either of little usefulness or have never been fully implemented): * Induction Options * Output Options * Testing Options * Data-File Options * Subprocess Options * Miscellaneous Options ----------------- Induction Options ----------------- -F force processing even if there are too few positive training items (as selected with the -T option). -Ln Stop growing the decision tree once a node represents fewer than 'n' training instances. -Px Stop growing the decision tree once the proportion of positive examples on a node reaches at least 'x' (0.00-1.00). -fN Use only the top N features (words) by information gain. When DTree computes the information gain for all words in the training collection to find the one with the highest information gain to use as the root of the decision tree, it will also discard from further consideration in subtrees all but the top N words. -sFILE read a list of word roots/synonyms from FILE. Each line of FILE must contain two words; anytime the first one is encountered in the data file, the second one is substituted. -SFILE Substitute documents. When this option is used, FILE contains the document bodies to be used, while the normal data file continues to provide the document numbers, classification, and other header items. -u Distinguish between unique and multiple occurrences of a word within a document. When this option is specified, DTree will generate a pseudo-feature for each word which occurs at least twice in a document. The name of the pseudo-feature is the word prefixed with a plus sign. -------------- Output Options -------------- -Nname set the null class's name to 'name' -cFILE check the decision tree predictions against the 'truth', and output the results to FILE. The file also contains, after all of the predictions, a summary of the performance statistics, as well as overall statistics at the end of the file (use "tail -2 FILE" to see just the overall statistics). -tFILE write the induced decision trees to FILE -eFILE write computed entropy values to FILE -oFILE write the decision tree judgements to FILE (this option is more-or-less obsolete). --------------- Testing Options --------------- -2 Multiple runs, powers of two. This option causes DTree to act as though it were run multiple times with different values of the -T parameter, beginning with one positive training instance and progressing in powers of two until reaching either the value set with -M or the value set with an explicit -T option. If -M is used, the test set will remain fixed over all runs, starting with the document after the nth positive training instance as specified by the -M parameter. If -M is not specified, the test set will be variable, starting with the document immediately following the last training instance. -+N -+N:min,max Mutiple runs, arithmetic progression. This option is the same as -2, but the number of positive training instances is increased by 'N' each time, rather than doubling. The second form gives explicit start and stop values for the progression; the first form starts with one positive training instances and continues until reaching either the value set with -M or with an explicit -T, as described for -2 above. -M Maximum positive examples to use for -2. -Tp,n Set the number of positive and negative training instances. The document collection is split into training and test sets based on the selected number of positive training instances. If the training set contains more than 'n' negative instances, only the last 'n' in the set are considered in building the decision tree. -hN Time Horizon. When this option is given, only the first N documents in the test set are actually evaluated; any further documents are always evaluated as 'No' for the class under consideration. -haN,T -haN,=T Adaptive Time Horizon. When this option is given, documents in the test set are evaluated until the density of 'Yes' decisions falls below the specified threshold, after which all further documents will be evaluated as 'No' for the class under consideration. The threshold is specified by the parameters N and T; for the first form, the threshold test is met when the most recent N documents contain fewer than T times as many 'Yes'es as the first N documents of the test set. For the second form, the threshold test is met when the most recent N documents contain fewer than T 'Yes' documents. -Rname Restrict the DTree run to class 'name' only, ignoring any other classes which may be present in the collection. -R=FILE Read a list of classes from FILE, then restrict the DTree run to process only those classes which where listed, ignoring any other classes which may be present in the document collection. -Vp,t reserve 'p' positive examples for testing (if not overridden by another option), and test only the first 't' documents of the test set. The second parameter, 't', is optional and defaults to unlimited testing if omitted. If this option is given as "-V0", no testing or validation is performed unless indicated by some other option such as -2 or -+. -wN When DTree generates scores for a particular judgement, it uses two main factors: the percentage of examples at the leaf node which are positive examples, and the percentage of all positive examples located at that leaf node. This option sets the weight of the percentage positive examples for the leaf to N, and the percentage of total positives at the leaf to (1-N). This option is purely for use with external programs that are interested in the reported scores rather than Yes/No decisions. ----------------- Data-File Options ----------------- -jFILE read the actual judgements ('truth') from FILE. This option is used to exlude documents labelled BRIEF from the test set and to avoid counting them towards total positive training instances when BRIEF documents have not been excluded from the document collection. -lFILE read the event-label to number mapping from FILE. Without this option, all per-event output will be identified by event label; with the mapping file, that output will be identified by event number instead. ------------------ Subprocess Options ------------------ -rS -rL -rS- -rL- Run a child program for each selected training/test set (see "Testing Options" above), passing it three files in either SMART format ('S') or LSF4 format ('L'). The child program is specified by the next item on the command line, and may optionally contain three occurrences of "%s", which will be replaced by (in order) the names of the files in which the training, validation, and test collections have been placed. If the trailing minus sign to the -r option is included, DTree will skip its own test runs and *only* run the child program. -rS= -rL= -r= Run a child program for each selected training/test set (see "Training Options" above), passing it the training, validation, and test sets via a pipe, in either SMART format ('S') or LSF4 format ('L'). The three sets of documents are separated by empty dummy documents with document number 0. --tDIR store the temporary files for -rS and -rL in directory DIR instead of in the current directory. --------------------- Miscellaneous Options --------------------- -m show how much memory DTree uses -v Run with verbose output. Verbose output includes such statistics as the sizes of training and test sets and the amount of time each step of the induction and checking processes require. =========== Data Format =========== The DTree program can accept data in either LSF4 or SMART format, and auto- detects the format based on the first line of the data file. Further, when running a child process on the training, validation, and test sets that it extracts from the document collection, DTree can export the collections in either format, independent of the original data file's format. LSF4 format consists of one (very long) line per document, in the form docnum|class1|class2|...|classN|word1 word2 word3 ... wordN where 'docnum' is a unique document number, 'class1' through 'classN' are the classes to which the document belongs, and 'word1' through 'wordN' are the words of the document body (in arbitrary order). SMART format consists of several free-form sections per document, with each section delineated by a 'dot-command'. The only sections which DTree processes (any others are ignored) are .I docnum specifies the unique document number ('docnum' must either be an actual number, or the string "TDT" followed by a number) .C this section specifies the classes to which the document belongs; each class specifier consists of the class name followed by a number, and multiple class specifiers are separated by semicolons. .T this section contains the documents title, which will be included in the set of words considered a part of the document if it is present and non-empty .W this main section contains the complete text of the body of the document. --- End of File ---