As mentioned before, the idea of using an ensemble of classifiers rather than the single best classifier has been proposed by several people. In Section 2, we present a framework for these systems, some theories of what makes an effective ensemble, an extensive covering of the Bagging and Boosting algorithms, and a discussion on the bias plus variance decomposition. Section 3 referred to empirical studies similar to ours; these methods differ from ours in that they were limited to decision trees, generally with fewer data sets. We cover additional related work in this section.
Lincoln and Skrzypek [1989], Mani [1991] and the forecasting literature [Clemen1989,Granger1989] indicate that a simple averaging of the predictors generates a very good composite model; however, many later researchers [Alpaydin1993,Asker Maclin1997a,Asker Maclin1997b,Breiman1996a,Hashem1997,Maclin1998,Perrone1992,Wolpert1992,Zhang et al.1992] have further improved generalization with voting schemes that are complex combinations of each predictor's output. One must be careful in this case, since optimizing the combining weights can easily lead to the problem of overfitting which simple averaging seems to avoid [Sollich Krogh1996].
Most approaches only indirectly try to generate highly correct classifiers that disagree as much as possible. These methods try to create diverse classifiers by training classifiers with dissimilar learning parameters [Alpaydin1993], different classifier architectures [Hashem1997], various initial neural-network weight settings [Maclin Opitz1997,Maclin Shavlik1995], or separate partitions of the training set [Breiman1996c,Krogh Vedelsby1995]. Boosting on the other hand is active in trying to generate highly correct networks since it accentuates examples currently classified incorrectly by previous members of the ensemble.
ADDEMUP [Opitz Shavlik1996a,Opitz Shavlik1996b] is another example of an approach that directly tries to create a diverse ensemble. ADDEMUP uses genetic algorithms to search explicitly for a highly diverse set of accurate trained networks. ADDEMUP works by first creating an initial population, then uses genetic operators to create new networks continually, keeping the set of networks that are highly accurate while disagreeing with each other as much as possible. ADDEMUP is also effective at incorporating prior knowledge, if available, to improve the quality of its ensemble.
An alternate approach to the ensemble framework is to train individual networks on a subtask, and to then combine these predictions with a ``gating'' function that depends on the input. Jacobs et al.'s [1991] adaptive mixtures of local experts, Baxt's [1992] method for identifying myocardial infarction, and Nowlan and Sejnowski's [1992] visual model all train networks to learn specific subtasks. The key idea of these techniques is that a decomposition of the problem into specific subtasks might lead to more efficient representations and training [Hampshire Waibel1989].
Once a problem is broken into subtasks, the resulting solutions need to be combined. Jacobs et al. [1991] propose having the gating function be a network that learns how to allocate examples to the experts. Thus the gating network allocates each example to one or more experts, and the backpropagated errors and resulting weight changes are then restricted to these networks (and the gating function). Tresp and Taniguchi [1995] propose a method for determining the gating function after the problem has been decomposed and the experts trained. Their gating function is an input-dependent, linear-weighting function that is determined by a combination of the networks' diversity on the current input with the likelihood that these networks have seen data ``near'' that input.
Although the mixtures of experts and ensemble paradigms seem very similar, they are in fact quite distinct from a statistical point of view. The mixtures-of-experts model makes the assumption that a single expert is responsible for each example. In this case, each expert is a model of a region of the input space, and the job of the gating function is to decide from which model the data point originates. Since each network in the ensemble approach learns the whole task rather than just some subtask and thus makes no such mutual exclusivity assumption, ensembles are appropriate when no one model is highly likely to be correct for any one point in the input space.