3.4 Ensemble Size

Early work [Hansen Salamon1990] on ensembles suggested that ensembles with as few as ten members were adequate to sufficiently reduce test-set error. While this claim may be true for the earlier proposed ensembles, the Boosting literature [Schapire et al.1997] has recently suggested (based on a few data sets with decision trees) that it is possible to further reduce test-set error even after ten members have been added to an ensemble (and they note that this result also applies to Bagging). In this section, we perform additional experiments to further investigate the appropriate size of an ensemble. Figure 5 shows the composite error rate over all of our data sets for neural network and decision tree ensembles using up to 100 classifiers.

**Figure 5:** Average test-set error over all 23 data sets used in our studies for ensembles incorporating from one to 100 decision trees or neural networks. The error rate graphed is simply the average of the error rates of the 23 data sets. The alternative of averaging the error over all data points (i.e., weighting a data set's error rate by its sample size) produces similarly shaped curves.
$\begin{figure}\centerline{\psfig{figure=sum-across-data-sets-100.ps,width=\textwidth,bbllx=32pt,bblly=395pt,bburx=564pt,bbury=770pt} } \end{figure}$

For both Bagging and Boosting applied to neural networks, much of the reduction in error appears to have occurred after ten to fifteen classifiers. A similar conclusion can be reached for Bagging and decision trees, which is consistent with Breiman [1996a]. But Ada-boosting and Arcing continue to measurably improve their test-set error until around 25 classifiers for decision trees. At 25 classifiers the error reduction for both methods appears to have nearly asymptoted to a plateau. Therefore, the results reported in this paper are of an ensemble size of 25 (i.e., a sufficient yet manageable size for qualitative analysis). It was traditionally believed [Freund Schapire1996] that small reductions in test-set error may continue indefinitely for boosting; however, Grove and Schuurmans [1998] demonstrate that Ada-boosting can indeed begin to overfit with very large ensemble sizes (10,000 or more members).