Genetic Programming For Classification With Unbalanced Data (Part 1)

Genetic Programming for
Classification with
Unbalanced Data
A Research Paper by:
Urvesh Bhowan, Mengjie Zhang, Mark
Johnston
(Evolutionary Computation Research
Group,
Victoria University of Wellington, New
Zealand)
Presented by
Noorulain
Amina Asif
Pattern Recognition Lab
Department of Computer Science & Information Sciences
Pakistan Institute of Engineering & Applied Sciences
OUTLINE
Abstract of the paper
Introduction to the basic concepts
Classification
Unbalanced data
Performance bias
GP Framework for classification

Program representation and classification strategy
Evolutionary parameters
Standard fitness function for classification
Improving GP with New and Improved fitness

Functions
4 variations of the fitness function

Genetic Programming for Classification with Unbalanced Data
Abstract of the paper

This paper compares two Genetic Programming
(GP) approaches for classification with
unbalanced data.
The first focuses on adapting the fitness function
to evolve classifiers with good classification
ability across both minority and majority classes.
The second uses a multi-objective approach to
simultaneously evolve a Pareto front (or set) of
classifiers along the minority and majority class
tradeoff surface.
Introduction
Classification
a way of predicting class membership for a set of
examples using properties of the examples.
Unbalanced Dataset
Data sets having an uneven distribution of
class examples,
Minority class : a small number of examples in
dataset
Majority class: make up large part of the data
set.
Introduction
Performance Bias:
poor accuracy on the minority class but high
accuracy on the majority class
Solution??
misclassification costs for minority class
examples
GP Approaches
2 GP approaches discussed
Adaptation of Fitness Function
Multi Objective Genetic Programming (MOGP)
GP Framework for Classification

Program Representation
Terminals (example features and constants)
Functions (+, -, x, % and conditional if )
Classification Strategy
Translates the output of a genetic program
(floating point number) into two class labels
using the division between positive and nonpositive
Minority class: Positive or 0
Majority class: Negative
GP Framework for Classification

Evolutionary Parameters
Initial population (ramped half and half)

Cross over 60%
Mutation30%
Elitism
10%
Training and test data

Half of each data set was randomly chosen as the
training set and the other half as the test set,
both preserving the original class imbalance
ratio.
Standard Fitness Function

Standard fitness measure: success rate
Standard Fitness Function

Two class classification:
Predicted
Positive
Predicted
Non Positive
Actual Positive
TP
FN
Actual Non
Positive
FP
TN
10
Adapting Standard Fitness Function

foverall can be unsuitable
Favors solution with a performance bias
Fitness functions should be modified

To consider the accuracy of each class as
equally important
To improve the minority class accuracy
11
New Fitness Function 1

Correctly classifying TPs may be more
important than correctly classifying TNs
Designed to investigate two aspects
How to effectively balance between TP and TN
rates?
Is the overall classification ability better?
Weight given to
TPs
Weight given to
TNs
Proportion of
correctly classified
minority class
objects
Proportion of
correctly classified
majority class
objects

When W=0.5
Classification accuracy given equal importance
When W>0.5
Minority class accuracy given more importance
by a factor W

Based on Correlation Ratio
Measures the relationship between linear statistical
dispersions within sets of observations
Correlation Ratio adapted

To measure how well two sets of GP class observations
are separated
Higher the correlation ratio, better the separation
Goal
Explore the effectiveness of separability-based
evaluation metric

The correlation ratio
Sum of the Distances

between the Class
Means and the
Overall Mean
Larger the distance,
Larger the ratio
Sum of the distances

M=Number of classes
between the
Nc=Number of examples in Class c observations and the
population mean
=Overall mean
=Mean for Class c

Pci=Output of the classifier P for ith example of class c

r will return values between 0 and 1
Close to 1 => better separation
Close to 0 => poor separation
Separation should be according to

classification strategy
Minority class observations should be +ve
numbers
Majority class observations should be -ve
numbers

The Fitness Function
An indicator function; returns 1 if

mean of majority and minority
classes are negative and positive
respectively, 0 otherwise
Fitness values between 0 and 2
Incorporates the
Values close to 2 => optimal fitness
ordering preference
Values close to 0 => poor fitness
Improved Fitness Function 1

Relatively recent improvements
Equally weighted accuracy + a new
objective
Level of error for each class
Estimated using largest and smallest incorrect
Class observations
observations for a particular class
may be positive or
negative, so
Values scaled between 0 and 1
1=> highest level of error
0=> no error
absolute has been

taken

Equal weights for the accuracy of both the

classes
Smaller the level of error => higher the fitness
Fitness Value Between 0 and 4
Values close to 4 => optimum fitness
Values close to 0 => poor fitness

Uses the Wilcoxon-Mann-Whitney Statistic
A well-known approximation for AUC
Without having to compute the curve
Uses the separability-based metric directly

into the program fitness
Expensive to compute

Where,
Rating done on two metrics

Classification Accuracy
Separability
Pair-wise
comparisons
Larger Value => More
Separability
Summary
Having an unbalanced data set may cause
performance bias towards the majority
class
In GP Class Imbalance Problems can be
treated in two ways
Adapting the Fitness Function (introducing new
metrics)
Discussed Today
Weights
Correlation Ratio (Separability)
To be discussed in
Levels of Error
the next
MOGP
presentation
THANK YOU

Genetic Programming For Classification With Unbalanced Data (Part 1)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genetic Programming For Classification With Unbalanced Data (Part 1)

Uploaded by

Copyright:

Available Formats

Genetic Programming for

GP Framework for classification

Improving GP with New and Improved fitness

4 variations of the fitness function

Abstract of the paper

Genetic Programming for Classification with Unbalanced Data

Genetic Programming for Classification with Unbalanced Data

GP Framework for Classification

GP Framework for Classification

Initial population (ramped half and half)

Training and test data

Standard Fitness Function

Genetic Programming for Classification with Unbalanced Data

Standard Fitness Function

Genetic Programming for Classification with Unbalanced Data

Adapting Standard Fitness Function

Fitness functions should be modified

Genetic Programming for Classification with Unbalanced Data

New Fitness Function 1

New Fitness Function 1

New Fitness Function 1

New Fitness Function 2

Correlation Ratio adapted

New Fitness Function 2

Sum of the Distances

Sum of the distances

=Mean for Class c

New Fitness Function 2

Separation should be according to

New Fitness Function 2

An indicator function; returns 1 if

Fitness values between 0 and 2

Values close to 0 => poor fitness

Improved Fitness Function 1

absolute has been

Improved Fitness Function 1

Equal weights for the accuracy of both the

Improved Fitness Function 2

Uses the separability-based metric directly

Improved Fitness Function 2

Rating done on two metrics

You might also like