SMOTE-NC

Next: SMOTE-N Up: Future Work Previous: Future Work

SMOTE-NC

While our SMOTE approach currently does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuous and nominal features. We call this approach Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult dataset from the UCI repository. The SMOTE-NC algorithm is described below.

1.

Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an amount that is related to the typical difference in continuous feature values.

2.

Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being identified (minority class sample) and the other feature vectors (minority class samples) using the continuous feature space. For every differing nominal feature between the considered feature vector and its potential nearest-neighbor, include the median of the standard deviations previously computed, in the Euclidean distance computation. Table 2 demonstrates an example.

Table 6: Example of nearest neighbor computation for SMOTE-NC.
F1 = 1 2 3 A B C [Let this be the sample for which we are computing nearest neighbors]

F2 = 4 6 5 A D E

F3 = 3 5 6 A B K

So, Euclidean Distance between F2 and F1 would be:

Eucl = sqrt[(4-1)² + (6-2)² + (5-3)² + Med² + Med²] Med is the median of the standard deviations of continuous features of the minority class.

The median term is included twice for feature numbers 5: B $\rightarrow$ D and 6: C $\rightarrow$ E, which differ for the two feature vectors: F1 and F2.

3.

**Table 6:** Example of nearest neighbor computation for SMOTE-NC.
F1 = 1 2 3 A B C [Let this be the sample for which we are computing nearest neighbors]
F2 = 4 6 5 A D E
F3 = 3 5 6 A B K
So, Euclidean Distance between F2 and F1 would be:
Eucl = sqrt[(4-1)² + (6-2)² + (5-3)² + Med² + Med²] Med is the median of the standard deviations of continuous features of the minority class.
The median term is included twice for feature numbers 5: B $\rightarrow$ D and 6: C $\rightarrow$ E, which differ for the two feature vectors: F1 and F2.

Populate the synthetic sample: The continuous features of the new synthetic minority class sample are created using the same approach of SMOTE as described earlier. The nominal feature is given the value occuring in the majority of the k-nearest neighbors.

The SMOTE-NC experiments reported here are set up the same as those with SMOTE, except for the fact that we examine one dataset only. SMOTE-NC with the Adult dataset differs from our typical result: it performs worse than plain under-sampling based on AUC, as shown in Figures 26 and 27. We extracted only continuous features to separate the effect of SMOTE and SMOTE-NC on this dataset, and to determine whether this oddity was due to our handling of nominal features. As shown in Figure 28, even SMOTE with only continuous features applied to the Adult dataset, does not achieve any better performance than plain under-sampling. Some of the minority class continuous features have a very high variance, so, the synthetic generation of minority class samples could be overlapping with the majority class space, thus leading to more false positives than plain under-sampling. This hypothesis is also supported by the decreased AUC measure as we SMOTE at degrees greater than 50%. The higher degrees of SMOTE lead to more minority class samples in the dataset, and thus a greater overlap with the majority class decision space.

**Figure 26:** Adult. Comparison of SMOTE-C4.5, Under-C4.5, and Naive Bayes. SMOTE-C4.5 and Under-C4.5 ROC curves overlap for most of the ROC space.
$\begin{figure} \centerline{ \psfig {figure=adultb_hull.eps,width=3.75in} }\end{figure}$

**Figure 27:** Adult. Comparison of SMOTE-Ripper, Under-Ripper, and modifying Loss Ratio in Ripper. SMOTE-Ripper and Under-Ripper ROC curves overlap for most of the ROC space.
$\begin{figure} \centerline{ \psfig {figure=adult_rip.eps,width=3.75in} }\end{figure}$

**Figure 28:** Adult with only continuous features. The overlap of SMOTE-C4.5 and Under-C4.5 is observed under this scenario as well.
$\begin{figure} \centerline{ \psfig {figure=adult_cont.eps,width=4.5in} }\end{figure}$

Next: SMOTE-N Up: Future Work Previous: Future Work

Nitesh Chawla (CS)
6/2/2002