Next: SMOTE-N
Up: Future Work
Previous: Future Work
While our SMOTE approach currently does not handle data sets with all nominal features, it was
generalized to handle mixed datasets of continuous and nominal features. We call this approach
Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult
dataset from the UCI repository. The SMOTE-NC algorithm is described below.
- 1.
- Median computation: Compute the median of standard deviations of all continuous features for the minority class.
If the nominal features differ between a sample and its potential nearest neighbors, then this median
is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an
amount that is related to the typical difference in continuous feature values.
- 2.
- Nearest neighbor computation: Compute the Euclidean distance between the feature vector
for which k-nearest neighbors are being identified (minority class sample)
and the other feature vectors (minority class samples) using the continuous feature
space. For every differing nominal feature between the considered feature
vector and its potential nearest-neighbor, include the median of the
standard deviations previously
computed, in the Euclidean distance computation. Table 2 demonstrates an example.
Table 6:
Example of nearest neighbor computation for SMOTE-NC.
F1 = 1 2 3 A B C [Let this be the sample for which we are computing nearest neighbors] |
F2 = 4 6 5 A D E |
F3 = 3 5 6 A B K |
So, Euclidean Distance between F2 and F1 would be: |
Eucl = sqrt[(4-1)2 + (6-2)2 + (5-3)2 + Med2 + Med2]
Med is the median of the standard deviations of continuous features
of the minority class. |
The median term is included twice for feature numbers 5: BD
and 6: CE, which differ for the two feature vectors: F1 and F2. |
|
- 3.
- Populate the synthetic sample: The continuous features of the new synthetic minority class sample
are created using the same approach of SMOTE as described earlier. The nominal feature is given
the value occuring in the majority of the k-nearest neighbors.
The SMOTE-NC experiments reported here are set up the same as those with SMOTE, except for the fact that
we examine one dataset only. SMOTE-NC with the Adult dataset differs from our typical result: it performs worse than plain
under-sampling based on AUC, as shown in Figures 26 and 27. We extracted only continuous features to separate the effect of SMOTE
and SMOTE-NC on this dataset, and to determine whether this oddity was due to our handling of nominal features.
As shown in Figure 28, even SMOTE with only continuous features applied to the Adult dataset, does not achieve
any better performance than plain under-sampling.
Some of the minority class continuous features have a very high variance, so,
the synthetic generation of minority class samples could be overlapping with the majority class
space, thus leading to more false positives than plain under-sampling. This hypothesis is also
supported by the decreased AUC measure as we SMOTE at degrees greater than 50%. The higher degrees
of SMOTE lead to more minority class samples in the dataset, and thus a greater overlap with the
majority class decision space.
Figure 26:
Adult. Comparison of SMOTE-C4.5, Under-C4.5, and Naive Bayes. SMOTE-C4.5 and Under-C4.5 ROC curves overlap for
most of the ROC space.
|
Figure 27:
Adult. Comparison of SMOTE-Ripper, Under-Ripper, and modifying Loss Ratio in Ripper. SMOTE-Ripper and Under-Ripper ROC curves overlap for most of the ROC space.
|
Figure 28:
Adult with only continuous features. The
overlap of SMOTE-C4.5 and Under-C4.5 is observed under this scenario as well.
|
Next: SMOTE-N
Up: Future Work
Previous: Future Work
Nitesh Chawla (CS)
6/2/2002