DTree - Decision Tree Classifier
		   ================================
			    by Ralf Brown
			       11/4/97

Contents:
=========
	* What are Decision Trees
	* Running DTree
	* DTree Options
	* Data Format

=======================
What are Decision Trees
=======================

!!!


=============
Running DTree
=============

The DTree program is invoked as follows:

	dtree [options] docfile [docfile ...]

The available options are described in the following section.  Each of the
one or more document files named on the command line is assumed to be a
separate collection of documents to be processed.

!!!


=============
DTree Options
=============

The DTree program supports the following main options on the command line
(there are others listed on the usage screen, but they are either of little
usefulness or have never been fully implemented):
	* Induction Options
	* Output Options
	* Testing Options
	* Data-File Options
	* Subprocess Options
	* Miscellaneous Options

-----------------
Induction Options
-----------------

	-F
		force processing even if there are too few positive training
		items (as selected with the -T option).

	-Ln
		Stop growing the decision tree once a node represents fewer
		than 'n' training instances.

	-Px
		Stop growing the decision tree once the proportion of positive
		examples on a node reaches at least 'x' (0.00-1.00).
	
	-fN
		Use only the top N features (words) by information gain.
		When DTree computes the information gain for all words in
		the training collection to find the one with the highest
		information gain to use as the root of the decision tree,
		it will also discard from further consideration in subtrees
		all but	the top N words.

	-sFILE
		read a list of word roots/synonyms from FILE.  Each line of
		FILE must contain two words; anytime the first one is
		encountered in the data file, the second one is
		substituted.

	-SFILE
		Substitute documents.  When this option is used, FILE contains
		the document bodies to be used, while the normal data file
		continues to provide the document numbers, classification,
		and other header items.

	-u
		Distinguish between unique and multiple occurrences of a
		word within a document.  When this option is specified,
		DTree will generate a pseudo-feature for each word which
		occurs at least twice in a document.  The name of the
		pseudo-feature is the word prefixed with a plus sign.

--------------
Output Options
--------------

	-Nname
		set the null class's name to 'name'

	-cFILE
		check the decision tree predictions against the 'truth', and
		output the results to FILE.  The file also contains, after
		all of the predictions, a summary of the performance
		statistics, as well as overall statistics at the end of the
		file (use "tail -2 FILE" to see just the overall statistics).

	-tFILE
		write the induced decision trees to FILE

	-eFILE
		write computed entropy values to FILE

	-oFILE
		write the decision tree judgements to FILE (this option is
		more-or-less obsolete).

---------------
Testing Options
---------------

	-2
		Multiple runs, powers of two.  This option causes DTree to
		act as though it were run multiple times with different values
		of the -T parameter, beginning with one positive training
		instance and progressing in powers of two until reaching
		either the value set with -M or the value set with an
		explicit -T option.  If -M is used, the test set will remain
		fixed over all runs, starting with the document after the
		nth positive training instance as specified by the -M
		parameter.  If -M is not specified, the test set will be
		variable, starting with the document immediately following
		the last training instance.

	-+N
	-+N:min,max
		Mutiple runs, arithmetic progression.  This option is the
		same as -2, but the number of positive training instances
		is increased by 'N' each time, rather than doubling.  The
		second form gives explicit start and stop values for the
		progression; the first form starts with one positive training
		instances and continues until reaching either the value set
		with -M or with an explicit -T, as described for -2 above.

	-M
		Maximum positive examples to use for -2.

	-Tp,n
		Set the number of positive and negative training instances.
		The document collection is split into training and test
		sets based on the selected number of positive training
		instances.  If the training set contains more than 'n'
		negative instances, only the last 'n' in the set are
		considered in building the decision tree.

	-hN
		Time Horizon.  When this option is given, only the first
		N documents in the test set are actually evaluated; any
		further documents are always evaluated as 'No' for the
		class under consideration.

	-haN,T
	-haN,=T
		Adaptive Time Horizon.  When this option is given, documents
		in the test set are evaluated until the density of 'Yes'
		decisions falls below the specified threshold, after which
		all further documents will be evaluated as 'No' for the
		class under consideration.  The threshold is specified by
		the parameters N and T; for the first form, the threshold
		test is met when the most recent N documents contain fewer
		than T times as many 'Yes'es as the first N documents of
		the test set.  For the second form, the threshold test is
		met when the most recent N documents contain fewer than
		T 'Yes' documents.

	-Rname
		Restrict the DTree run to class 'name' only, ignoring any
		other classes which may be present in the collection.

	-R=FILE
		Read a list of classes from FILE, then restrict the DTree
		run to process only those classes which where listed,
		ignoring any other classes which may be present in the
		document collection.

	-Vp,t
		reserve 'p' positive examples for testing (if not overridden
		by another option), and test only the first 't' documents
		of the test set.  The second parameter, 't', is optional
		and defaults to unlimited testing if omitted.  If this
		option is given as "-V0", no testing or validation is
		performed unless indicated by some other option such as
		-2 or -+.

	-wN
		When DTree generates scores for a particular judgement, it
		uses two main factors: the percentage of examples at the
		leaf node which are positive examples, and the percentage
		of all positive examples located at that leaf node.  This
		option sets the weight of the percentage positive examples
		for the leaf to N, and the percentage of total positives
		at the leaf to (1-N).  This option is purely for use with
		external programs that are interested in the reported
		scores rather than Yes/No decisions.

-----------------
Data-File Options
-----------------

	-jFILE
		read the actual judgements ('truth') from FILE.  This option
		is used to exlude documents labelled BRIEF from the test set
		and to avoid counting them towards total positive training
		instances when BRIEF documents have not been excluded from the
		document collection.

	-lFILE
		read the event-label to number mapping from FILE.  Without
		this option, all per-event output will be identified by
		event label; with the mapping file, that output will be
		identified by event number instead.
	
------------------
Subprocess Options
------------------

	-rS
	-rL
	-rS-
	-rL-
		Run a child program for each selected training/test set
		(see "Testing Options" above), passing it three files
		in either SMART format ('S') or LSF4 format ('L').
		The child program is specified by the next item on the
		command line, and may optionally contain three occurrences
		of "%s", which will be replaced by (in order) the names
		of the files in which the training, validation, and test
		collections have been placed.  If the trailing minus
		sign to the -r option is included, DTree will skip its
		own test runs and *only* run the child program.

	-rS=
	-rL=
	-r=
		Run a child program for each selected training/test set
		(see "Training Options" above), passing it the training,
		validation, and test sets via a pipe, in either SMART
		format ('S') or LSF4 format ('L').  The three sets of
		documents are separated by empty dummy documents with
		document number 0.

	--tDIR
		store the temporary files for -rS and -rL in directory DIR
		instead of in the current directory.


---------------------
Miscellaneous Options
---------------------

	-m	show how much memory DTree uses

	-v	Run with verbose output.  Verbose output includes such
		statistics as the sizes of training and test sets and
		the amount of time each step of the induction and checking
		processes require.

===========
Data Format
===========

The DTree program can accept data in either LSF4 or SMART format, and auto-
detects the format based on the first line of the data file.  Further, when
running a child process on the training, validation, and test sets that it
extracts from the document collection, DTree can export the collections in
either format, independent of the original data file's format.

LSF4 format consists of one (very long) line per document, in the form
	docnum|class1|class2|...|classN|word1 word2 word3 ... wordN
where 'docnum' is a unique document number, 'class1' through 'classN' are
the classes to which the document belongs, and 'word1' through 'wordN' are
the words of the document body (in arbitrary order).

SMART format consists of several free-form sections per document, with each
section delineated by a 'dot-command'.  The only sections which DTree
processes (any others are ignored) are
	.I docnum
		specifies the unique document number ('docnum' must either
		be an actual number, or the string "TDT" followed by a number)
	.C
		this section specifies the classes to which the document
		belongs; each class specifier consists of the class name
		followed by a number, and multiple class specifiers are
		separated by semicolons.
	.T
		this section contains the documents title, which will be
		included in the set of words considered a part of the
		document if it is present and non-empty
	.W
		this main section contains the complete text of the body
		of the document.


			 --- End of File ---