IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem

IR-Lab Project of Yanjun Qi (Fall 2004)
A Brief Literature Review of Class Imbalanced Problem

Yanjun Qi
In recent years, many difficult machine learning "real-world" problems are characterized
by imbalanced learning data, where at least one class is under-represented relative to others.
Examples include (but are not limited to): fraud/intrusion detection, medical
diagnosis/monitoring, bioinformatics, text categorization and et al. The problem of imbalanced
data is often associated with asymmetric costs of misclassifying elements of different classes.
Additionally the distribution of the test data may differ from that of the learning sample and the
true misclassification costs may be unknown at learning time. Although much awareness of the
issues related to data imbalance has been raised, many of the key problems still remain open and
are in fact encountered more often, especially when applied to massive datasets. In our work, we
concentrate on the two class case. We make a summary about the previous literature that aims to
tackle this problem first.
Foster P. gave a good summary about the related issue for imbalance data set
classification in [1]. He pointed out first that why the normal classifier would cause problems for
the imbalanced data set: (i) Maximizing accuracy is the goal. (ii) In use, the classifier will
operate on data drawn from the same distribution as the training data. It also pointed out that
when studying problems with imbalanced data, using the classifiers produced by standard
machine learning algorithms without adjusting the output threshold may well be a critical
mistake. And it also raised a question that if the natural class distribution is the best training set
distribution to build a classifier or not? Usually in the real-world learning task, typically either (i)
you have far more data than your algorithms can deal with, and you have to select and sample, or
(ii) you have no data at all and you have to go through some process to create them. In the first
case, a practical question is how much to sample and in what proportion. In the second case,
creating data is costly and the question is of how many to create and in what proportion.
Gary Weiss [2] used an empirical study to answer the question raised in [1]: is the class
distribution of the training data should match the natural distribution of the data or not? From
their experiments of C4.5 on 25 imbalanced data sets (at different imbalanced levels, 20 of them
are from UCI) , the natural distribution usually is not the best distribution for learning a
different class distribution should generally be chosen when the data set size must be limited.
They made an analysis that why the minority class error rate is so high? (1). There are many
more majority than minority class instances. (2). The class priors in the natural training
distribution are biased strongly in favor of the majority class. (3) The minority class suffers from
the problem of small disjuncts. So they suggested that though the minority-class learning curves
begin with a much higher error rate than the majority-class learning curves; they show more
rapid improvement, still plateau at a later point. They use error rate and AUC to make the
performance comparison. They also suggested that a progressive, adaptive, sampling strategy be
developed that incrementally requested new examples based on the improvement in classifier
performance due to the recently added minority- and majority- examples.
[3][6][7] used a synthetic data set to make a systematic study about the class imbalance
problem on the specific case. It tried to answer three questions: (1) When class imbalance are
damaging to classification performance (2). For C5.0, compare the effectiveness of several basic
re-sampling or cost-modifying methods (3). How does class imbalance affect other classifiers?
They think that the performance of imbalance data set related with three factors: complexity of
the problem, training set size and degree of the imbalance. They found that independently of the
training size, linearly separable domains are not sensitive to imbalance. Also with very large
training data, the imbalance does not hinder their performance C5.0 too much. Then they
compared the effectiveness of several techniques for the imbalance problem. (a). over-sampling
(b) under-sampling (c) cost modifying. They concluded that the bad performance of the
imbalance data set is usually caused by small sub-cluster (small disjuncts) that can not be
classified accurately. Moreover, they presented that SVM and Perceptron seem not as sensitive to
imbalance as C5.0 and kNN [11].
[4][5] used C4.5 to investigate the sampling strategy on the imbalance data set. [4]
pointed out that C4.5 plus the under-sampling has built a baseline for the imbalance classification
problem. Their experiments claimed that under-sampling is better than over-sampling by a
measurement called cost curve. [5] thought three are three factors related to use C45. to classify
imbalance date. First the sampling preprocessing; second, the probability estimation criterion for
decision tree leaf; third, the resulting tree structure, pruning or not. They also observed that
undersampling better than over-sampling and it seems that no pruning is preferred in the
imbalance case. Finally the probability measurement of the tree leaf should be adjust based on
the class prior changing between train and test in the imbalance classification case. [6] did
similar comparison work as [4][5], but they consider more classifiers other than C4.5 and also
the effects of varying the decision thresholds. They use ROC curve to compare different results.
They claimed that sampling, moving the decision thresholds and adjusting the cost matrix have
the same effect.
Overall, the methods aiming to tackle with the imbalance data problem can be divided
into two big categories by [13]:
1. Algorithm Specific Approach
2. Pre-processing for the data (under-, over-, progressive, active)

3. Post-processing for the learned model
For Algorithm specific kind, [8][9] are two example. They tried to modified SVM and
re-align the boundary for the imbalanced data cases.
For the pre-processing kind, most of the current methods are sampling methods. For
sampling, it can be divided further into sampling in sampling for cost-sensitive learning and
sampling for Query / Active learning ([13]). Precision-Recall often used as evaluation metric for
learning algorithms here.
Within the query learning sampling, we can do
active / query learning
uncertainly sampling
selective sampling
query by bagging
importance sampling ( better than query by bagging empirically)
Within the cost-sensitive learning sampling, we can do
cost-proportionate weighted sampling
cost-proportionate rejection sampling
[13] then pointed further out that there is some relationship with these methods and the
traditional over-sampling and under-sampling techniques:
Cost proportionate rejection sampling and Resampling with replacement

correspond to under-sampling and over-sampling
Under-sampling and over-sampling are the special case in which F1 C1 =

F0C0 and where P=R is optimal (assuming slope of PR-curve = -1)
Rejection sampling > Resampling is consistent with and generalizes

Undersampling > Oversampling
[16] is another kind of sampling advanced methods. They tried to combine different
simple sampling together and tried to do better.
Besides sampling methods, there are also some clustering based pre-processing methods.
Their objectives are for the small disjuncts problems ( some literature pointed out that part of the
imbalanced classification problem is caused by this small disjuncts). So for them, [15] presented
two ways to handle. First, you we can cluster based over-sampling. Or, you could cluster the
majority class first. Or you could consider the between-class versus class imbalance. That is,
clustering each class to identify sub-clusters. And then re-sampling each sub-clusters to
maximize class-size, and to get rid of with-in class imbalance.
Reference:
1. Foster Probost, Machine learning from imbalanced data sets 101, Invited paper for the
AAAI'2000 Workshop on Imbalanced Data Sets. (2000)
2. Weiss, G. and F. Provost. The Effect of Class Distribution on Classifier Learning.
Technical Report ML-TR-43, Department of Computer Science, Rutgers University,
January 2001.
3. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study.
Intelligent Data Analysis, 6(5):429--450, 2002
4. Chris Drummond, C4.5, Class Imbalance, and Cost Sensitivity: Why Undersampling
beats Over-Sampling, ICML-KDD'2003 Workshop: Learning from Imbalanced Data Sets,
2003
5. Nitesh Chawla, C4.5 and Imbalanced Datasets: Investigating the effect of sampling
method, probalistic estimate, and decision tree structure, ICML-KDD'2003 Workshop:
Learning from Imbalanced Data Sets, 2003
6. Nathalie Japkowicz, Class Imbalances: Are we Focusing on the Right Issue?, ICMLKDD'2003 Workshop: Learning from Imbalanced Data Sets, 2003
7. Japkowicz, N. (2000), The Class Imbalance Problem: Significance and Strategies, in

Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000) ,
pp. 111-117.
8. Gang Wu, Class-Boundary Alignment for Imbalanced Dataset Learning , ICMLKDD'2003 Workshop: Learning from Imbalanced Data Sets, 2003
9. Bhavani Raskutti, Extreme Re-balancing for SVM's: a case study , ICML-KDD'2003
Workshop: Learning from Imbalanced Data Sets, 2003
10. Marcus Maloof, Learning When Data Sets are Imbalanced and When Costs are Unequal
and Unknown , ICML-KDD'2003 Workshop: Learning from Imbalanced Data Sets, 2003
11. Jianping Zhang, NN Approach to Unbalanced Data Distributions: A Case Study involving
Information Extraction , ICML-KDD'2003 Workshop: Learning from Imbalanced Data
Sets, 2003
12. Aleksander Kolcz, Data Duplication: an Imbalance Problem?, ICML-KDD'2003
Workshop: Learning from Imbalanced Data Sets, 2003
13. Naoki Abe, Sampling approaches to learning from imbalanced datasets: active learning,
cost sensitive learning and beyond, ICML-KDD'2003 Workshop: Learning from
Imbalanced Data Sets
14. Piotr Juszczak, Robert P.W. Duin, Uncertainty sampling methods for one-class classifiers
, ICML-KDD'2003 Workshop: Learning from Imbalanced Data Sets
15. Jo, T. and Japkowicz, N., Class Imbalances versus Small Disjuncts , SIGKDD
Explorations 6(1), June 2004
16. Estabrooks A., Jo, T., and Japkowicz, N., (2004), A Multiple Resampling Method for
Learning from Imbalances Data Sets, in Computational Intelligence, Volume 20, Number
1, 2004

IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem

Uploaded by

Copyright:

Available Formats

IR-Lab Project of Yanjun Qi (Fall 2004)

A Brief Literature Review of Class Imbalanced Problem

2. Pre-processing for the data (under-, over-, progressive, active)

active / query learning

importance sampling ( better than query by bagging empirically)

Within the cost-sensitive learning sampling, we can do

cost-proportionate weighted sampling

cost-proportionate rejection sampling

Cost proportionate rejection sampling and Resampling with replacement

Under-sampling and over-sampling are the special case in which F1 C1 =

Rejection sampling > Resampling is consistent with and generalizes

7. Japkowicz, N. (2000), The Class Imbalance Problem: Significance and Strategies, in

You might also like