You are on page 1of 66

Instructions on Using the tool ( Building a Tree based Classification Model

Step 1: Enter Your Data


(A) Enter your data in The Data worksheet, starting from the cell L24
Number of rows in your data should be between 10 and 10,000
Application won't build model for less than 10 data points and
it can't handle more than 10,000 data points.
(B) The observations should be in rows and the variables should be in columns.
(C) Above each column, choose appropriate Type (Omit, Class, Cont, Cat)
To drop a column from model - set the type = Omit
To treat a column as categorical predictor, set type = Cat
To treat a column as continuous Predictor, set type = Cont
To treat a column as Class variable, set type = Output
You can have atmost 50 predictor variables.
You should have exactly 1 Class variable. Application will treat the Class varaible as categorical
Each of your Categorical predictors (including Class variable) should have atmost 20 categories

(D) Please make sure that your data does not have blank rows or blank columns.
(E) Your Class variable should not have any missing values
(E) Continuous Predictors:
Any non-number in Cont column will be treated as missing value.
Application will replace it by the column median
(F) Categorical Predictors:
Any blank cell or cells containing Excel error in Cat column will be treated as missing value
Application will reaplce it by the most frequently occuring category.
Category labels are case insensitive - lables good, Good, GoOd, GOOD will all be treated as the same catego
There should be at least 2 observations in each category of a Cat column.
If one of the category of a Cat column has only 1 observation, you should do one of the following -
Remove that observation OR
Rename the category to any other categories of that Cat column.

Step 2: Fill up Model Inputs


(A) Fill up the model inputs in the User Input Page.
(B) Make sure that your inputs are within the range of values allowed by the application.
(C) Click the 'Build Tree' (near cell K48) button to start modeling.

Step 3: Results of Modeling


(A) When the run is over, the calssification tree can be seen in Tree sheet
You can enter the values of predictors here and in cell H7 the class predicted by the tree is shown
(B) You can select a cell in any of the nodes and click the View Node button to see the
details information on that node in NodeView Sheet
(C) In cell F7 of NodeView sheet you can enter any node number
to see the class distribution and other information about that node

Step 4: Rule Generation


After the tree is grown, tree is further processed to generate Rules
Rule sets generated here are for viewing ONLY. The tool does NOT have the capability of using these rules
Rule Summary Table tells you the quality of the individual rules. Quality is measured by 3 different metrics expla
For example, consider a rule : IF Petal Width > 7 and Petal Length <= 70 THEN Species =
In context of the above rule, the quality metrics are explained as follows

Support: % of training data for which the Left Hand Side (LHS) of the Rule (I.e. PetalWidth > 7 PetalLength <=70) is tru
If for a observation the LHS of the rule is true, we say that the rule APPLIES for that observation.
This measures how widely applicable is the rule.

Confidence: Out of the training records for which the (LHS) of the Rule is true, % of records for which the Right Hand side i
In other words, for what % of the observations on which the rule applies, the rule is true.
This measures the accuracy of the rule.

Capture: What % of Setosa records is correctly captured by this rule. This is more of a reflection of the stru
If there is a rule with Capture close to 100%, that means, in the predictor space, all observations with this class
sit closely to each other and the rule has been able to capture that part of the predictor space very well.

Suppose there are 150 observations in the training data, out of which 50 is Setosa.
Also assume that the above rule applies to 70 records and out of which 30 is Setosa

Then,
Support: 47% (=70 / 150) Confidence: 43% (= 30/70) Capture: 60% (= 30 / 50)
ssification Model )

Few more points on User Inputs …

(1) Adjust for # categories of a categorical predictor


While growing the tree, child nodes are created by splitting parent nodes.
Which predictor to use for this split is decided by certain crierion.
This criterion has an inherent bias towards choosing predictors with more catego
This bias can be adjusted for by switching on this option.
(2) Minimum Node Size Criterion
You may not select this. However, if you select it then you should enter
expressed as % of total observations.
A valid minimum node size should be strictly greater than 0% and strictly less tha
Higher the value of this, SMALLER will be the tree
(3) Maximum Purity Criterion
You may not select this. However, if you select it then you should enter
which should be between 0% and 100%
Higher the value of this, LARGER will be the tree
(4) Maximum Depth Criterion
You may not select this. However, if you select it then you should enter
which should be an integer strictly greater than 1 and less than 20.
Higher the value of this, LARGER will be the tree
(5) Pruning Option
Ideally you should keep this option on by selecting YES
This option is provided to let you study the effect of pruning
eated as the same category (6) Training and Test data
You may like to use a subset of your data to build the model and the rest to study
the following - You may also use the whole data set as training data and may not use any test d
If you want to use test data, you may either randomly select the test set or use la
If you are selecting the test set manually (Option 2), it is a good idea to check wh
are present in both training as well as test data.
(6) Save Model Option
If you choose YES here, application will save the model in a separate workbook.
Since the worksheets in the application are protected, you will not be able to
edit the tree. In such a case you should save the tree in a separate file and furthe

tree is shown

ability of using these rules to classify new data points.


by 3 different metrics explained below
EN Species = Setosa

7 PetalLength <=70) is true


observation.

which the Right Hand side is also true.

re of a reflection of the structure of the problem.


observations with this class
tor space very well.
ting parent nodes.

edictors with more categories.

t it then you should enter a valid minimum node size

an 0% and strictly less than 100%

t it then you should enter a valid maximum purity

t it then you should enter a valid maximum depth

model and the rest to study the performance of the model


nd may not use any test data
elect the test set or use last few rows as test set.
s a good idea to check whether all the Class categories

l in a separate workbook.
you will not be able to
n a separate file and further experiment on it.
Classification Tree Inputs

Node Splitting Criteria


Adjust for # categories of a categorical predictor
While splitting a node, algorithm has a bias towards prefering predictors with more categories
This can be adjusted by turing the above option on

Leaf Node Criteria


While growing the tree whether to stop splitting a node
and declare the node as a leaf node will be dermined by the following criteria
You may choose none, one or more crteria. If you choose none, application will use default values.

Minimum Node Size (Default = 5 records)


Stop splitting a node if number of records in that node is
1% or less of total number of records
Maximum Purity (Default = 100% purity )
Stop splitting a node if its purity is 95% or more
(e.g. Purity is 90% means - % of records in the node with Majority Class is 90%)

Maximum Depth (Default = Maximum Depth 20 )


Stop splitting a node if its depth is 6 or more
( Root node has Depth 1. Any node's depth is it's parent's depth + 1)

In addition to these criteria -


If for any predictor, values are identical for all records in the node
that predictor can't be used to split the node.
So if this happens for all predictors in the node - the node can't be split any further.

Tree Pruning Option After growing the tree do you want to prune ? Yes

Training / Test Set Partition Data into Training / Test set


1
If you want to partition, how do you want to select the Validation set ?
Please choose one option 2
Please fill up the input necessary for the selected option
Option 1 : Randomly select 10% of data as Test set
Option 2: Use last 10 rows of the data as validation set

Save model in a separate workbook? NO


s with more categories

e, application will use default values.

of total number of records

Rule Generation Option Do you want to genarate Rules ? Yes

o Training / Test set Rule Cleaning Option


Minimum Confidence (Default = 50% )
Do not generate rules with confidence 95% or less

Minimum Support (Default = 0% )


as Test set (between 1% and 50%) Do not generate rules with support 30% or less
f the data as validation set
Var Type

Var Name
Enter your Data in this sheet
Instructions:
Start Entering your data from cell G24. Specify variable name in row 23.
Specify variable type in row 22.
Class - Class variable Cat - Categorical Predictor Cont - Continuous Predictor
Omit - If you don't want to use the variable in the model

Var Type Omit Class Cont Cont Omit Omit

Var Name
Species_No Species_name Petal_width Petal_length Sepal_width Sepal_length
1 Setosa 2 14 33 50
1 Setosa 2 10 36 46
1 Setosa 2 16 31 48
1 Setosa 1 14 36 49
1 Setosa 2 13 32 44
1 Setosa 2 16 38 51
1 Setosa 2 16 30 50
1 Setosa 4 19 38 51
1 Setosa 2 14 30 49
1 Setosa 2 14 36 50
1 Setosa 4 15 34 54
1 Setosa 2 14 42 55
1 Setosa 2 14 29 44
1 Setosa 1 14 30 48
1 Setosa 3 17 38 57
1 Setosa 4 15 37 51
1 Setosa 2 13 35 55
1 Setosa 2 13 30 44
1 Setosa 2 16 32 47
1 Setosa 2 12 32 50
1 Setosa 1 11 30 43
1 Setosa 2 14 35 51
1 Setosa 4 16 34 50
1 Setosa 1 15 41 52
1 Setosa 2 15 31 49
1 Setosa 4 17 39 54
1 Setosa 2 13 32 47
1 Setosa 2 15 34 51
1 Setosa 1 15 31 49
1 Setosa 2 15 37 54
1 Setosa 4 13 39 54
1 Setosa 3 13 23 45
1 Setosa 3 15 38 51
1 Setosa 2 15 35 52
1 Setosa 3 14 34 46
1 Setosa 5 17 33 51
1 Setosa 2 14 34 52
1 Setosa 6 16 35 50
1 Setosa 3 14 30 48
1 Setosa 2 19 34 48
1 Setosa 2 12 40 58
1 Setosa 2 14 32 46
1 Setosa 4 15 44 57
1 Setosa 2 15 34 52
1 Setosa 2 15 31 46
1 Setosa 3 13 35 50
1 Setosa 3 14 35 51
1 Setosa 2 16 34 48
1 Setosa 2 17 34 54
1 Setosa 2 15 37 53
3 Verginica 24 56 31 67
3 Verginica 23 51 31 69
3 Verginica 20 52 30 65
3 Verginica 19 51 27 58
3 Verginica 17 45 25 49
3 Verginica 19 50 25 63
3 Verginica 18 49 27 63
3 Verginica 21 56 28 64
3 Verginica 19 51 27 58
3 Verginica 18 55 31 64
3 Verginica 15 50 22 60
3 Verginica 23 57 32 69
3 Verginica 20 49 28 56
3 Verginica 18 58 25 67
3 Verginica 21 54 31 69
3 Verginica 25 61 36 72
3 Verginica 21 55 30 68
3 Verginica 22 56 28 64
3 Verginica 15 51 28 63
3 Verginica 23 59 32 68
3 Verginica 23 54 34 62
3 Verginica 25 57 33 67
3 Verginica 18 51 30 59
3 Verginica 23 53 32 64
3 Verginica 21 57 33 67
3 Verginica 18 60 32 72
3 Verginica 18 49 30 61
3 Verginica 23 61 30 77
3 Verginica 18 48 30 60
3 Verginica 20 51 32 65
3 Verginica 25 60 33 63
3 Verginica 18 55 30 65
3 Verginica 22 67 38 77
3 Verginica 21 66 30 76
3 Verginica 13 52 30 67
3 Verginica 20 64 38 79
3 Verginica 20 67 28 77
3 Verginica 14 56 26 61
3 Verginica 18 48 28 62
3 Verginica 24 56 34 63
3 Verginica 16 58 30 72
3 Verginica 21 59 30 71
3 Verginica 18 56 29 63
3 Verginica 23 69 26 77
3 Verginica 19 61 28 74
3 Verginica 18 63 29 73
3 Verginica 22 58 30 65
3 Verginica 19 53 27 64
3 Verginica 20 50 25 57
3 Verginica 24 51 28 58
2 Versicolor 13 45 28 57
2 Versicolor 16 47 33 63
2 Versicolor 14 47 32 70
2 Versicolor 12 40 26 58
2 Versicolor 10 33 23 50
2 Versicolor 10 41 27 58
2 Versicolor 15 45 29 60
2 Versicolor 10 33 24 49
2 Versicolor 14 39 27 52
2 Versicolor 12 39 27 58
2 Versicolor 15 42 30 59
2 Versicolor 13 44 23 63
2 Versicolor 15 49 25 63
2 Versicolor 11 30 25 51
2 Versicolor 13 36 29 56
2 Versicolor 14 44 30 66
2 Versicolor 17 50 30 67
2 Versicolor 15 45 22 62
2 Versicolor 14 46 30 61
2 Versicolor 11 39 25 56
2 Versicolor 15 45 32 64
2 Versicolor 15 45 30 54
2 Versicolor 14 44 31 67
2 Versicolor 10 35 26 57
2 Versicolor 13 42 26 57
2 Versicolor 13 42 27 56
2 Versicolor 16 45 34 60
2 Versicolor 10 35 20 50
2 Versicolor 13 54 29 62
2 Versicolor 10 50 22 60
2 Versicolor 12 47 28 61
2 Versicolor 13 41 28 57
2 Versicolor 15 49 31 69
2 Versicolor 13 40 25 55
2 Versicolor 14 48 28 68
2 Versicolor 15 46 28 65
2 Versicolor 18 48 32 59
2 Versicolor 16 51 27 60
2 Versicolor 13 40 28 61
2 Versicolor 11 38 24 55
2 Versicolor 12 44 26 55
2 Versicolor 15 45 30 56
2 Versicolor 12 42 30 57
2 Versicolor 13 56 29 66
2 Versicolor 10 37 24 55
2 Versicolor 15 47 31 67
2 Versicolor 13 41 30 56
2 Versicolor 13 43 29 64
2 Versicolor 14 47 29 61
2 Versicolor 13 40 23 55
Continuous Predictor

Omit Omit Omit Omit Omit Omit Omit

MyStuff
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
c
b
b
c
b
c
b
c
b
c
b
c
b
c
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Omit Omit Omit Omit Omit Omit Omit
Omit Omit Omit Omit Omit Omit Omit
Omit Omit Omit Omit Omit Omit Omit
Omit Omit Omit Omit Omit Omit Omit
Omit Omit Omit Omit Omit Omit Omit
Omit Omit
Predicted Class versicolor
3
Predictors Values Root
Petal_width 13 TRUE

Petal_length 40
Node 1
Petal_width<10
FALSE

Node 2
Petal_width>=10
TRUE
0

Node 3
Petal_width<18
TRUE

Node 5
Petal_length<50
TRUE

Node 7
Petal_width<17 3
TRUE

Node 8
Petal_width>=17 0
FALSE

Node 6
Petal_length>=50
FALSE

Node 9
Petal_length<55
FALSE

Node 11
Petal_width<16 0
FALSE
Node 12
Petal_width>=16 0
FALSE

Node 10
Petal_length>=55 0
FALSE

Node 4
Petal_width>=18 0
FALSE
setosa Support: 36% Conf: 100%

versicolor Support: 25% Conf: 100%

verginica Support: 1% Conf: 100%

verginica Support: 4% Conf: 60%

versicolor Support: 1% Conf: 100%

verginica Support: 1% Conf: 100%


verginica Support: 32% Conf: 98%
Class Distribution
Node ID 0
Non-leaf Node

140 Node Size


Number of Records 140
% of total Records 100.00%
versicolor 29%
Majority Class setosa 36%
1
setosa verginica versic

% MissClassified 64.29%

verginica 36%
Class Distribution

Class Label Proportion


1 setosa 35.71%
2 verginica 35.71%
3 versicolor 28.57%
ass Distribution

etosa 36%
setosa verginica versicolor
Classification Tree Model
Tree Information
Number of Training observations 140
Number of Test observations 10 Total Number of Nodes
Number of Leaf Nodes
Number of Predictors 2 Number of Levels

Class Variable Species_name % Missclasssified


Number of Classes 3 On Training Data
Majority Class setosa On Test Data

% MissClassified if Majority Class Time Taken


is used as Predicted Class 60% Data Processing
Tree Growing
Tree Pruning
Tree Drawing
Classification using final tree
Rule Generation
Confusion Matrix Total

Training Data Test Data

Predicted Class Predicted Class


True Class setosa verginica versicolor True Class setosa
setosa 50 50 versicolor
verginica 50 50
versicolor 3 37 40
50 53 37 140
mber of Nodes 12
f Leaf Nodes 7
6

On Training Data 2.14%


On Test Data 10.00%

Data Processing 0 Sec


Tree Growing 2 Sec
Tree Pruning 0 Sec
Tree Drawing 0 Sec
Classification using final tree 0 Sec
Rule Generation 1 Sec
3 Sec

Predicted Class
verginica versicolor
1 9 10
1 9 10
How well do the Rules separate the Classe
7

4
Rule ID

-1
0 20 40 60 80 100
Observations

Rule Summary Table # Rules 6

Rule ID Class Length Support Confidence Capture


0 setosa 0 100.0% 35.7% 100.0%
1 setosa 1 35.7% 100.0% 100.0%
2 verginica 1 32.1% 97.8% 88.0%
3 versicolor 2 30.7% 88.4% 95.0%
4 verginica 1 33.6% 95.7% 90.0%
5 verginica 2 1.4% 100.0% 4.0%
6 verginica 1 34.3% 91.7% 88.0%
les separate the Classes

80 100 120 140 160


bservations

22 Rule Text
Rule0 Species_name = setosa

Rule1 IF Petal_width < 10


THEN Species_name = setosa

Rule2 IF Petal_width >= 18


THEN Species_name = verginica

Rule3 IF Petal_width < 17


AND Petal_width >= 10
THEN Species_name = versicolor

Rule4 IF Petal_width >= 17


THEN Species_name = verginica

Rule5 IF Petal_length >= 55


AND Petal_width < 18
THEN Species_name = verginica

Rule6 IF Petal_length >= 50


THEN Species_name = verginica
140
10

Obs
Number True Class ID
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
1 20
1 21
1 22
1 23
1 24
1 25
1 26
1 27
1 28
1 29
1 30
1 31
1 32
1 33
1 34
1 35
1 36
1 37
1 38
1 39
1 40
1 41
1 42
1 43
1 44
1 45
1 46
1 47
1 48
1 49
1 50
2 51
2 52
2 53
2 54
2 55
2 56
2 57
2 58
2 59
2 60
2 61
2 62
2 63
2 64
2 65
2 66
2 67
2 68
2 69
2 70
2 71
2 72
2 73
2 74
2 75
2 76
2 77
2 78
2 79
2 80
2 81
2 82
2 83
2 84
2 85
2 86
2 87
2 88
2 89
2 90
2 91
2 92
2 93
2 94
2 95
2 96
2 97
2 98
2 99
2 100
3 101
3 102
3 103
3 104
3 105
3 106
3 107
3 108
3 109
3 110
3 111
3 112
3 113
3 114
3 115
3 116
3 117
3 118
3 119
3 120
3 121
3 122
3 123
3 124
3 125
3 126
3 127
3 128
3 129
3 130
3 131
3 132
3 133
3 134
3 135
3 136
3 137
3 138
3 139
3 140
FullData Rule1 Rule2 Rule3 Rule4 Rule5 Rule6
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 4
0 3 6
0 3 6
0 3 6
0 3 5 6
0 3 5 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4
0 2 4 6
0 2 4
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 2 4 6
0 4 6
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3
0 3 6
0 3 6
0 3
0 3
0 3
0 3
0 3
0 3
0 3 6
0 3
0 3
0 2 4

You might also like