The basic incremental growth procedure may be outlined as follows. Every stage of the process is characterized by a set of active features . These determine a space of models
The optimal model in this space, denoted by , is the model with the greatest entropy:
By adding feature to , we obtain a new set of active features . Following (22), this set of features determines a set of models
The optimal model in this space of models is
Adding the feature allows the model to better account for the training sample; this results in a gain in the log-likelihood of the training data
At each stage of the model-construction process, our goal is to select the candidate feature which maximizes the gain ; that is, we select the candidate feature which, when adjoined to the set of active features , produces the greatest increase in likelihood of the training sample. This strategy is implemented in
One issue left unaddressed by this algorithm is the termination condition. Obviously, we would like a condition which applies exactly when all the ``useful'' features have been selected. One reasonable stopping criterion is to subject each proposed feature to cross-validation on a held-out sample of data. If the feature does not lead to an increase in likelihood of the held-out sample of data, the feature is discarded.
Another outstanding issue is the efficiency of Algorithm 2. It is impractical for any reasonable size of to imagine performing Algorithm 1 once for each at every iteration. The interested reader is referred to the list of further readings at the end of this document for pointers to articles containing efficient algorithms for model construction.