For the oil dataset, we also followed a slightly different line of experiments to obtain results comparable to [9]. To alleviate the problem of imbalanced datasets the authors have proposed (a) one-sided selection for under-sampling the majority class [17] and (b) the SHRINK system [9]. Table 5.5 contains the results from [9]. Acc+ is the accuracy on positive (minority) examples and Acc- is the accuracy on the negative (majority) examples. Figure 25 shows the trend for Acc+ and Acc- for one combination of the SMOTE strategy and varying degrees of under-sampling of the majority class. The Y-axis represents the accuracy and the X-axis represents the percentage majority class under-sampled. The graphs indicate that in the band of under-sampling between 50% and 125% the results are comparable to those achieved by SHRINK and better than SHRINK in some cases. Table 5.5 summarizes the results for the SMOTE at 500% and under-sampling combination. We also tried combinations of SMOTE at 100-400% and varying degrees of under-sampling and achieved comparable results. The SHRINK approach and our SMOTE approach are not directly comparable, though, as they see different data points. SMOTE offers no clear improvement over one-sided selection.
Method | Acc+ | Acc- |
SHRINK | 82.5% | 60.9% |
One-sided selection | 76.0% | 86.6% |
Under-sampling % | Acc+ | Acc- |
10% | 64.7% | 94.2% |
15% | 62.8% | 91.3% |
25% | 64.0% | 89.1% |
50% | 89.5% | 78.9% |
75% | 83.7% | 73.0% |
100% | 78.3% | 68.7% |
125% | 84.2% | 68.1% |
150% | 83.3% | 57.8% |
175% | 85.0% | 57.8% |
200% | 81.7% | 56.7% |
300% | 89.0% | 55.0% |
400% | 95.5% | 44.2% |
500% | 98.0% | 35.5% |
600% | 98.0% | 40.0% |
700% | 96.0% | 32.8% |
800% | 90.7% | 33.3% |