Early work [Hansen Salamon1990] on ensembles suggested that ensembles
with as few as ten members were adequate to sufficiently reduce test-set error.
While this claim may be true for the earlier proposed ensembles, the Boosting
literature [Schapire et al.1997] has recently suggested (based on a few
data sets with decision trees) that it is possible to further reduce
test-set error even after ten members have been added to an ensemble (and
they note that this result also applies to Bagging).
In this section, we perform additional experiments to further investigate
the appropriate size of an ensemble.
Figure 5
shows the composite error rate over all of our data sets for neural network
and decision tree ensembles using up to 100 classifiers.
For both Bagging and Boosting applied to neural networks, much of the reduction in error appears to have occurred after ten to fifteen classifiers. A similar conclusion can be reached for Bagging and decision trees, which is consistent with Breiman [1996a]. But Ada-boosting and Arcing continue to measurably improve their test-set error until around 25 classifiers for decision trees. At 25 classifiers the error reduction for both methods appears to have nearly asymptoted to a plateau. Therefore, the results reported in this paper are of an ensemble size of 25 (i.e., a sufficient yet manageable size for qualitative analysis). It was traditionally believed [Freund Schapire1996] that small reductions in test-set error may continue indefinitely for boosting; however, Grove and Schuurmans [1998] demonstrate that Ada-boosting can indeed begin to overfit with very large ensemble sizes (10,000 or more members).