You are on page 1of 24

Genetic Programming for

Classification with
Unbalanced Data
A Research Paper by:
Urvesh Bhowan, Mengjie Zhang, Mark
Johnston
(Evolutionary Computation Research
Group,
Victoria University of Wellington, New
Zealand)

Presented by
Noorulain
Amina Asif
Pattern Recognition Lab
Department of Computer Science & Information Sciences
Pakistan Institute of Engineering & Applied Sciences

OUTLINE
Abstract of the paper
Introduction to the basic concepts
Classification
Unbalanced data
Performance bias

GP Framework for classification


Program representation and classification strategy
Evolutionary parameters
Standard fitness function for classification

Improving GP with New and Improved fitness


Functions

4 variations of the fitness function


Genetic Programming for Classification with Unbalanced Data

Abstract of the paper


This paper compares two Genetic Programming
(GP) approaches for classification with
unbalanced data.
The first focuses on adapting the fitness function
to evolve classifiers with good classification
ability across both minority and majority classes.
The second uses a multi-objective approach to
simultaneously evolve a Pareto front (or set) of
classifiers along the minority and majority class
tradeoff surface.
Genetic Programming for Classification with Unbalanced Data

Introduction
Classification
a way of predicting class membership for a set of
examples using properties of the examples.

Unbalanced Dataset
Data sets having an uneven distribution of
class examples,
Minority class : a small number of examples in
dataset
Majority class: make up large part of the data
set.
Genetic Programming for Classification with Unbalanced Data

Introduction
Performance Bias:
poor accuracy on the minority class but high
accuracy on the majority class

Solution??
misclassification costs for minority class
examples

Genetic Programming for Classification with Unbalanced Data

GP Approaches
2 GP approaches discussed
Adaptation of Fitness Function
Multi Objective Genetic Programming (MOGP)

Genetic Programming for Classification with Unbalanced Data

GP Framework for Classification


Program Representation
Terminals (example features and constants)
Functions (+, -, x, % and conditional if )

Classification Strategy
Translates the output of a genetic program
(floating point number) into two class labels
using the division between positive and nonpositive
Minority class: Positive or 0
Majority class: Negative
Genetic Programming for Classification with Unbalanced Data

GP Framework for Classification


Evolutionary Parameters

Initial population (ramped half and half)


Cross over 60%
Mutation30%
Elitism
10%

Training and test data


Half of each data set was randomly chosen as the
training set and the other half as the test set,
both preserving the original class imbalance
ratio.
Genetic Programming for Classification with Unbalanced Data

Standard Fitness Function


Standard fitness measure: success rate

Genetic Programming for Classification with Unbalanced Data

Standard Fitness Function


Two class classification:
Predicted
Positive

Predicted
Non Positive

Actual Positive

TP

FN

Actual Non
Positive

FP

TN

Genetic Programming for Classification with Unbalanced Data

10

Adapting Standard Fitness Function


foverall can be unsuitable
Favors solution with a performance bias

Fitness functions should be modified


To consider the accuracy of each class as
equally important
To improve the minority class accuracy

Genetic Programming for Classification with Unbalanced Data

11

New Fitness Function 1


Correctly classifying TPs may be more
important than correctly classifying TNs
Designed to investigate two aspects
How to effectively balance between TP and TN
rates?
Is the overall classification ability better?

New Fitness Function 1

Weight given to
TPs

Weight given to
TNs

Proportion of
correctly classified
minority class
objects

Proportion of
correctly classified
majority class
objects

New Fitness Function 1


When W=0.5
Classification accuracy given equal importance

When W>0.5
Minority class accuracy given more importance
by a factor W

New Fitness Function 2


Based on Correlation Ratio
Measures the relationship between linear statistical
dispersions within sets of observations

Correlation Ratio adapted


To measure how well two sets of GP class observations
are separated
Higher the correlation ratio, better the separation

Goal
Explore the effectiveness of separability-based
evaluation metric

New Fitness Function 2


The correlation ratio

Sum of the Distances


between the Class
Means and the
Overall Mean
Larger the distance,
Larger the ratio

Sum of the distances


M=Number of classes
between the
Nc=Number of examples in Class c observations and the
population mean

=Overall mean

=Mean for Class c


Pci=Output of the classifier P for ith example of class c

New Fitness Function 2


r will return values between 0 and 1
Close to 1 => better separation
Close to 0 => poor separation

Separation should be according to


classification strategy
Minority class observations should be +ve
numbers
Majority class observations should be -ve
numbers

New Fitness Function 2


The Fitness Function

An indicator function; returns 1 if


mean of majority and minority
classes are negative and positive
respectively, 0 otherwise

Fitness values between 0 and 2

Incorporates the
Values close to 2 => optimal fitness
ordering preference

Values close to 0 => poor fitness

Improved Fitness Function 1


Relatively recent improvements
Equally weighted accuracy + a new
objective
Level of error for each class
Estimated using largest and smallest incorrect
Class observations
observations for a particular class
may be positive or

negative, so
Values scaled between 0 and 1
1=> highest level of error
0=> no error

absolute has been


taken

Improved Fitness Function 1


The Fitness Function

Equal weights for the accuracy of both the


classes
Smaller the level of error => higher the fitness
Fitness Value Between 0 and 4
Values close to 4 => optimum fitness
Values close to 0 => poor fitness

Improved Fitness Function 2


Uses the Wilcoxon-Mann-Whitney Statistic
A well-known approximation for AUC
Without having to compute the curve

Uses the separability-based metric directly


into the program fitness
Expensive to compute

Improved Fitness Function 2


The Fitness Function

Where,

Rating done on two metrics


Classification Accuracy
Separability

Pair-wise
comparisons
Larger Value => More
Separability

Summary
Having an unbalanced data set may cause
performance bias towards the majority
class
In GP Class Imbalance Problems can be
treated in two ways
Adapting the Fitness Function (introducing new
metrics)
Discussed Today
Weights
Correlation Ratio (Separability)
To be discussed in
Levels of Error
the next

MOGP

presentation

THANK YOU

You might also like