A
Brief Literature Review of Class Imbalanced Problem
Yanjun
Qi
In recent years, many difficult machine learning "real-world" problems are characterized by imbalanced learning data, where at least one class is under-represented relative to others. Examples include (but are not limited to): fraud/intrusion detection, medical diagnosis/monitoring, bioinformatics, text categorization and et al. The problem of imbalanced data is often associated with asymmetric costs of misclassifying elements of different classes. Additionally the distribution of the test data may differ from that of the learning sample and the true misclassification costs may be unknown at learning time. Although much awareness of the issues related to data imbalance has been raised, many of the key problems still remain open and are in fact encountered more often, especially when applied to massive datasets. In our work, we concentrate on the two class case. We make a summary about the previous literature that aims to tackle this problem first.
Foster P. gave a good summary about the related issue for imbalance data set classification in [1]. He pointed out first that why the normal classifier would cause problems for the imbalanced data set: (i) Maximizing accuracy is the goal. (ii) In use, the classifier will operate on data drawn from the same distribution as the training data. It also pointed out that when studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake. And it also raised a question that if the natural class distribution is the best training set distribution to build a classifier or not? Usually in the real-world learning task, typically either (i) you have far more data than your algorithms can deal with, and you have to select and sample, or (ii) you have no data at all and you have to go through some process to create them. In the first case, a practical question is how much to sample and in what proportion. In the second case, creating data is costly and the question is of how many to create and in what proportion.
Gary Weiss [2] used an empirical study to answer the question raised in [1]: is the class distribution of the training data should match the “natural ” distribution of the data or not? From their experiments of C4.5 on 25 imbalanced data sets (at different imbalanced levels, 20 of them are from UCI) , the natural distribution usually is not the best distribution for learning – a different class distribution should generally be chosen when the data set size must be limited. They made an analysis that why the minority class error rate is so high? (1). There are many more majority than minority class instances. (2). The class “priors” in the natural training distribution are biased strongly in favor of the majority class. (3) The minority class suffers from the problem of small disjuncts. So they suggested that though the minority-class learning curves begin with a much higher error rate than the majority-class learning curves; they show more rapid improvement, still plateau at a later point. They use error rate and AUC to make the performance comparison. They also suggested that a progressive, adaptive, sampling strategy be developed that incrementally requested new examples based on the improvement in classifier performance due to the recently added minority- and majority- examples.
[3][6][7] used a synthetic data set to make a systematic study about the class imbalance problem on the specific case. It tried to answer three questions: (1) When class imbalance are damaging to classification performance (2). For C5.0, compare the effectiveness of several basic re-sampling or cost-modifying methods (3). How does class imbalance affect other classifiers? They think that the performance of imbalance data set related with three factors: complexity of the problem, training set size and degree of the imbalance. They found that independently of the training size, linearly separable domains are not sensitive to imbalance. Also with very large training data, the imbalance does not hinder their performance C5.0 too much. Then they compared the effectiveness of several techniques for the imbalance problem. (a). over-sampling (b) under-sampling (c) cost modifying. They concluded that the bad performance of the imbalance data set is usually caused by small sub-cluster (small disjuncts) that can not be classified accurately. Moreover, they presented that SVM and Perceptron seem not as sensitive to imbalance as C5.0 and kNN [11].
[4][5] used C4.5 to investigate the sampling strategy on the imbalance data set. [4] pointed out that C4.5 plus the under-sampling has built a baseline for the imbalance classification problem. Their experiments claimed that under-sampling is better than over-sampling by a measurement called cost curve. [5] thought three are three factors related to use C45. to classify imbalance date. First the sampling preprocessing; second, the probability estimation criterion for decision tree leaf; third, the resulting tree structure, pruning or not. They also observed that undersampling better than over-sampling and it seems that no pruning is preferred in the imbalance case. Finally the probability measurement of the tree leaf should be adjust based on the class prior changing between train and test in the imbalance classification case. [6] did similar comparison work as [4][5], but they consider more classifiers other than C4.5 and also the effects of varying the decision thresholds. They use ROC curve to compare different results. They claimed that sampling, moving the decision thresholds and adjusting the cost matrix have the same effect.
Overall, the methods aiming to tackle with the imbalance data problem can be divided into two big categories by [13]:
1. Algorithm Specific Approach
2. Pre-processing for the data (under-, over-, progressive, active)
3. Post-processing for the learned model
For “Algorithm specific kind”, [8][9] are two example. They tried to modified SVM and re-align the boundary for the imbalanced data cases.
For the “pre-processing kind”, most of the current methods are sampling methods. For sampling, it can be divided further into sampling in sampling for cost-sensitive learning and sampling for Query / Active learning ([13]). Precision-Recall often used as evaluation metric for learning algorithms here.
Within the query learning sampling, we can do
· active / query learning
· uncertainly sampling
· selective sampling
· query by bagging
· importance sampling ( better than query by bagging empirically)
Within the cost-sensitive learning sampling, we can do
· cost-proportionate weighted sampling
· cost-proportionate rejection sampling
[13] then pointed further out that there is some relationship with these methods and the traditional over-sampling and under-sampling techniques:
· Cost proportionate rejection sampling and Resampling with replacement correspond to under-sampling and over-sampling
· Under-sampling and over-sampling are the special case in which F1 C1 = F0C0 and where P=R is optimal (assuming slope of PR-curve = -1)
· “Rejection sampling > Resampling” is consistent with and generalizes “Undersampling > Oversampling”
[16] is another kind of sampling advanced methods. They tried to combine different simple sampling together and tried to do better.
Besides sampling methods, there are also some clustering based pre-processing methods. Their objectives are for the small disjuncts problems ( some literature pointed out that part of the imbalanced classification problem is caused by this small disjuncts). So for them, [15] presented two ways to handle. First, you we can cluster based over-sampling. Or, you could cluster the majority class first. Or you could consider the between-class versus class imbalance. That is, clustering each class to identify sub-clusters. And then re-sampling each sub-clusters to maximize class-size, and to get rid of with-in class imbalance.
Reference: