You are on page 1of 21

DISCRIMINANT ANALYSIS

Statistic 407, ISU

WHAT IS?
Supervised classication, alternatively called discriminant analysis, includes multivariate techniques nding ________ ______________________________, and using this rule to classify new observations. The process starts with a training sample, that is the full data set with known classes. Typically the variables that will be used to generate the classication rule are easy/cheap to measure, but the class is more difcult to measure. It is important to be able to classify new observations using variables that are easy to measure.

VISUAL METHODS FOR DISCRIMINATION


Use ____________ to code the class/group information in the plots. Then use the full range of plotting methods described in the section of graphics. Look for separations of the points into the color/glyph grouping. Determine what variables are potentially good separators.

EXAMPLE: AUSTRALIAN CRABS


This data is from a study of australian crabs. There are 5 physical measurements recorded on 2 species (blue and orange) and both sexes of each species, giving 4 groups. This is a scatterplot of the blue species with the two sexes identied.
20

Males Females

Rear Width

10

15

5 5

10

15 Frontal Lobe

20

Where would you draw the boundary for this data?


4

LINEAR DISCRIMINANT ANALYSIS

LDA is based on the assumption that the data comes from a Linear Discriminant Analysis ________________________ with equal varianceLDA is based on the assumption that the data functions covariance matrices. Comparing the density comes from a multivariate normal distribution with equal variance-covariance reducesmatrices. Comparing the density functions reduces the rule to: the rule to:
Allocate a new observation, X0 to group 1 if 1 (X1 X2)S1 X0 (X1 X2)S1 (X1 + X2) 0 pooled pooled 2 else allocate to group 2.

LDA RULE FOR P=1, G=2


LDA Boundary is where the two densities intersect
0.4

Density of group 2

Density of group 1

Samples
0.1

2 22 2 22222 222222 2 2 2 2 2222222 2 2 2 2 22 2 2 1 11 1 1 1111 1111111111 1 11 11 11 11 1 1 1 1 1 11 11

2 X

The LDA rule results from assuming that data for each class comes from a MVN with different means but the same variancecovariance matrix. The boundary between the two groups is _________ ___________________.

Density

0.0

0.1

0.2

0.3

XM ale = (14.8 11.7),

XF em = (13.3 12.1)

XM ale = (14.8 11.7),

nM ale = 50, XF em = (13.3 12.1) SM ale = XF em

EXAMPLE
nF em = 50 XM ale = In general,

(14.8 11.7),

XF em = (13.3 12.1)

= (13.3 12.1) nM ale = 50, nF em = 50 (n1 1)S1 (n2 1)S2 Spooled = + (n 1) +(n2 1) (n1 1) + (n2 1) 1 6.9 6.3 In general,ale = 10.3 6.5 SM 8.6 6.4 SF em = 6.5 4.5 = 10.3 6.5 nM ale = 50, nF em =6.3 5.9 50 6.4 SM ale = 5.2 6.5 5.5
6.9 6.3 (n 1)S1 SF em = 10.3 6.5+ Spooled = 6.3 5.9 1 S = (nM ale1) +(n2 1) 6.5 5.5 (n1 1

7

nM ale = (14.8 11.7) XM ale = 50, nF em = 50 ,

10.3 6.5 6.5 4.5

8.6 6.4 7 = 6.4 5.2 Notice the rst part of the classication rule, 6.9 6.3 SF em = 6.3 5.9 1.47 1.81 1 S = 1.47 1.81 S1 pooled 1.81 (X1 X2) pooled = 1.5 0.4 2.42 1.81 2.42 7
=

(n2 1)S2 1) + (n2 1)

1.47 1.81 1 Spooled = 6.3 6.9 1.81 2.42 SF em = 6.3 5.9

3.01 3.86

This forms the coordinates of a vector giving the _________ 8 _________________.


7

where (X1 X2) = (14.8 13.3 11.7 12.1). This denes a 1-D line (vector), through 0, dening the direction of maximum separation between the 2 groups, given by the equation X2 = 3.86 X1 = 1.26X1. 3.01 X0 is projected into this vector, as is the overall mean 1 (X1 + X2) 2

EXAMPLE
9

Direction of maximum separation:

Female

15

EXAMPLE
Data projected into the ___________________. Boundary between groups is at _____.

10 5

10

count

Male

15

10
LD1
0

sex Male Female

5
-5

0 -10 -5 0 5 10
Male Female

LD1

sex

EXAMPLE
The resulting rule is:
The LDA space & boundary Classify the new observation, X0 as Male if

[3.01 3.86]X0 + 2.93 0 else allocate as Female.

11

10

Then the classication rule becomes:

INCORPORATING PRIORS

Using the training Evaluating Classication Rules error rate, after comsample, to calculate the 12 really group 1. puting the classication rule take all the observations and predict 12 Using the training sample, to calculate the error rate, after comtheirPredict the classication rule training observations and predict against membership based ontake all the sample. Tabulate a table such this rule. This provides puting the class of the as this: truemembership based on this rule. This provides a table such the their class.
as this:

Incorporating Sample Size and Cost Functions Let p1 be the prior probability of an observation belonging to group Letand be be the prior probability an an observation be1, Where p1 p2 the the prior probabilities for group 1 are prior probability of of observation belonging t longing to group 2, and be + p2 prior probability 1of 2 is observation b and group 2. and p2 p1 the = 1. Typically p , p an based on group 1, n2 the sample size, so that2,1andnpn1+ , p2= 1. group withp1, p2 is based o longing to group p = 1+n2 p2 the Typically 1 It shifts the boundary _______ from = n1+n2 . n1 n2 the the sample size, so that p1 = n1+n2 , p2 = n1+n2 . highest prior. Also there is often a greater cost associated with incorrect classication. For example, classifying acost associated with incorrect class Also there is often a greater growth as benign lump when it is malignant, For example, classifying a growth as benign lump whe cation. has severe negative consequences for a patient. So it is important to be able to incorporate the relative for aof it is malignant, has severe negative consequences risk patien missclassication. Let c(1|2) = theto incorporate the relative risk an So it is important to be able cost of missclassifying 11 observation as group 1 when it is reallythe cost of missclassifying a missclassication. Let c(1|2) = group 2, and c(2|1) = Evaluating Classication Rules the cost of missclassifying an observation really group when it c(2|1) observation as group 1 when it is as group 2 2, and is really group 1. of missclassifying an observation as group 2 when it the cost

1 p2 1 1 (X1 X2)SpooledX0 Sample 2Size and1 Cost) Functions Incorporating 2 (X1 X ) Spooled(X + X2 ln p 1

MISCLASSIFICATION TABLE

Actual membership

Actual membership

Group 1 Group 2 Group

The ________________ is

Group 2 The apparent error rate is n1M +n2M . n1 +n2

Predicted membership Predicted membership Group 1 Group 2 Group 1 n1 n1C Group 2 n1C n1M = n2M = n2 n2C n2C 1 n n =n n

1C n2M = n2 n2C
13

1M

n2C

The apparent error rate is n1M +n2M . n1 +n2

13
12

EXAMPLE
Male Male Female 45 1 Female 5 49

APR=________

13

The LDA rule function can be split into parts corresponding to each group: The LDA rule can be divided into parts: 1 j cj = X S1 X0 X S1 Xj + ln(pj ) pooled 2 j pooled j = 1, 2; i = j

DISCRIMINANT FUNCTIONS

Discriminant (Classication) Functions

and the rule to allocate the the observation to the group And the rule itwill be to allocatenew new observation X0 to the group with thewith the largestvalue of the discriminant function. _________ cj .

15

14

LDA and Mahalanobis distance from group mean

CLOSEST MEAN?

The LDA rule corresponds to allocating a new observation toto The LDA rule corresponds to allocating a new observation the group that has the ___________ squared Mahalanobis the group that has the smallest squared Mahalanobis distance distance between new new observation and themean. mean. (dj ) between the the observation and the group group
dj = 1 (X0 Xj )S1 (X0 Xj ) ln(pj ) pooled 2 j = 1, 2; i = j

16

15

MORE THAN 2 GROUPS


The rules are simply extended to allocate the new observation X0 to the group g groups, and the cj , for j = ____________, There are now with the largest rule is the 1, ..., g. The LDA classicationthe group with the largest value of the discriminant allocate to functions are:

More than Two Groups, g

function

1 j cj = X S1 X0 X S1 Xj + ln(pj ) pooled 2 j pooled

j = 1, ..., g; i = j

20

16

Canonical Coordinates
Canonical Coordinates Canonical Coordinates

When the variance-covariances are equal the space which the When the variance-covariances are equal the space in in which the These best separated is given by best W1 coordinates. The low-dimensional space whichthe cononical B, where groups are best separated is given by the cononical coordinates. groups areare the the eigenvectors ofseparates the groups 1 These are the the eigenvectors These are the the eigenvectors of W B, where given by the _________ ofof W1B, where where

When the variance-covariances are equal the space in CANONICAL COORDINATESwhich th groups are best separated is given by the cononical coordinate

is

B B= B=

i=1 i=1

gg

= n((i X)(XX),XiW = (n(ni 1)Si X i i X)( = X)(X ni i Xi ni (Xi X), W X) , i W i 1)S =


i=1
i=1 i=1

g g

i=1

(ni 1)Si

g is number of of groups, is the overall mean. g is thethe numbergroups, and X is the overall mean. At most _______ dimensions are needed.
Canonical Coordinates Canonical Coordinates
1D canonical space enough 1D canonical space enough
3 333 3 333 3 33 33 3 33 33333333 3 3333 3 3 33333 3333333 33 3 3 33 333 33333 33333 3 3 3 3 3 3 33 33 3333 3 3 33 3 3 33 33 22

g is the number of groups, X isisthe overall mean. g is the number of groups, X the overall mean.

For g groups at most (g 1) dimensions (given by the eigenvecFor g groups at most (g 1) dimensions (given by the eigenvectors) are needed to separate the groups. tors) are needed to separate the groups.
2D canonical space needed 2D canonical space needed
33 33 3 3 33 3 3 3 3333 3 3 3 3 33 3 33 3 3 33333 3 33 333 3 3 3 3333 33 3 3 33 3 33 33 33 33333 3 3 3 33 3 3 3 33 33 3 3 3 33 3 3 3 33 3 3 333 3 33 11 1 1 11 11 11111 1 11 1 11 22 1 1 1 1 1 11 1 222 1 11 2 2 2 22 2 2 2 1 1 11 1 11 1 1 1 11 1 2 22 22 2222 2 2 2 2 11 2 2 2 1 11 111 1 1 11 2 2 2 1 1 1 111 1 1 1 1 1 11 1 2 22 2 222222222222 1 1 22 2 22 2222 2 2 2 2222222 2 2 1 1 1111 2 2 22 2 2222 11 11 1 111 1 1 22 11 00 22 44 66

Eg, g=3, 1 or 2 dim needed

8 8

21 21
X2 X2 4 4 6 6 2 2

10 10 X2 X2 4 4 6 6

22 222222 2 2 22 22 22 2222 2 2222 2 2 222222222 22 22 222 2 2 22 2 2222 22 22 2222 22 2222 2 2 2222 222 22 22 22222 22 2 11 1 1 11 11 111 1 11111 11 11 11111111 11 11 1 11 111 1 1 11 11111111 11 11 1 11 1 111 11 11 11 111 11 1111 1 1 1 00 22 44 66

0 0

!2 !2

88

10 10

!2 !2

0 0

21

2 2

!2 !2

X1 X1

X1 X1

22 22

17

QUADRATIC DISCRIMINANT ANALYSIS


Suppose that the variance-covariances are ___________ for Quadratic Discrimination Analysis each group, then the rule becomes:
Allocate a new observation, X0 to group 1 if

1 1 1 X (S1 S1)X0 + (X S1 X S1)X0 1 1 2 2 0 1 1 |S1| 2 1 (ln + (X S1X1 X S1X2)) 1 2 2 |S2| p ln 2 p1 else allocate to group 2.
18
18

DISCRIMINANT FUNCTIONS
Allocate the new observation to the group with the _______ value of the discriminant function: Functions Quadratic Discrimination
1 1 j cj = X S1X0+Xj S1X0 ln(|Sj |)+X S1Xj +ln(pj ) j = 1, 2; i = j 0 j j j 2 2 And the rule is to allocate the new observation X0 to the group with the largest cj .

19

19

RELATIONSHIP BETWEEN LDA AND REGRESSION


Relationship between LDA and Regression

A matrix of variables is used to predictisaa___________ The structure of the data for classication matrix of realvalued numbers __________: used to predict a categorical response:

Xnp =

X11 X12 X21 X22 . . . . . . Xn1 Xn2

. . . X1p . . . X2p . ... . . . . . Xnp np

Y=

1 . . . 1 2 . . . 2

23

20

Linear regression can be used to model this type of Regression for Classication data: Y = b0 + b1X1 + . . . + bpXp

LINEAR REGRESSION

can be type of + bpXp


Y

2.5

2 2 2 2222222222 2 2 2 2 2 22 2222 222 22 2 2 2

Problems: ________ ________________


25

1.0

1.5

2.0

11 1 111111111 1 1 1 1 1 1 1 111111111 1 11 1 1

0.5

!2

2 X1

25
21

The logistic regression model is


When the response is categorical it is more common to model the data using a logistic regression model:

LOGISTIC REGRESSION
Logistic Regression
p exp(bk0 + j=1 bkj X0j ) g1 p 1+ l=1 exp(bl0 + j=1 blj X0j ) pk (X0) = 1 1+g1 exp(b +p b X ) l0 j=1 lj 0j l=1

k = 1, . . . , g 1 k=g

and allocate to group with the And classication rule would highest pk(X0). be to allocate to group with the ____________ value.

26

22

This is lowest if either 0 1 is ___. A ___ split 1 To the quality m 1. cases in the and for each buckets, respectively. This is a w variable, left, the he proportion ____each all classto to0therightgivesmeasure calculate theof a split we need to (4 has of For groupsp0 logeachright.possible split logall p log pimpurity measure. p p1side log Expanding equation 4 p1 (bucket), p0 class 01 0 p 1 measured by entropy, andeach the smallest impurity, subset the data into two using this split in all class 1 to the right. To measure bucket. 2. left on thePick the split with p0 = N , p1 = N = 1a p0 are the resultingproportionspL (L log pL pL log pL )respectively. This of cases in classes 0, 1, pR i 1 R pR N N node N , p1 relative tree. 0 where p0 = on thesplitN = 1 p are the0 relative the = proportions (0 login cla mifis as follows:quality of split hasNwe groups0 toeachpside (bucket), all1 + p0of casesleft0 and N the either N0 or N1 is 0. A good a pure need to measure class on the L 1 step proportion of has in the impurity 3. On either qualityp of a 0. A good split casespure groups to buckets, bucket ss 1 to the right.lowest if each theN0 or ,NR is the 1-2. need to measure the left, right each side (buc To measure where prepeatis split we impurity 1 to subset,bucket:measure theentropy, in each bucket. needeachmeasur in eachright. To measured by quality of a split we in to respect all class the the impurity, as ding equation 4 gives
0 1

Figure 7 shows the Var+oleic (# forward stepwise (5) and Variable is dropped! Well try of the palmitic 0.020 process: palmitoleic classication when stearic acid used pairwise plots a miss) most important variables.(8)stearic lino Error 0.032 Oleic 0.024 (6) the most important Error miss) these 0.41variables are0.28 (69)correlated.0.052 (13) two (103) strongly0.032 (8) (117) the Error By Varvariables, but (# Var+linoleic (# miss) 0.052 (13) 0.47 Thus 0 other Error Var+oleic (# miss) acid alone, palmitic acid oleic to some extent. When oleic emerges the most impor Variable used palmitic palmitoleic stearic 0.020 (5) 0.032 (8) aslinoleic (6) linol 0.024 But lower error is obtained by using linoleic as the primary variable in conjunction with l rror By Var (# miss) ErroralgorithmLDA (# miss) 0.052variable alone,(10) lowest errors are obta 0.41 (103) (1 The tree Var+linoleic 0.28 (69) 0.47 (117) 0.04 the 0.06 (15) 0.410.0 (13) 0.032 (8) 0.052 (13) When generates classication rules by acid. Why dont these twois tted using one as important when oleic acid is used alon variables emerge rror Var+oleic (# miss) 0.020 ___________ on0.024data. 0.032 (8) NA 0.040 the 0.032 sequentially doing (5)at equal variance-covariances. Splits arevariables, (10)lowest err the (6) we look boundaries, and assumes tting LDA using oleic and one other rror Var+linoleic (# miss) 0.052 (13) Using linoleic in0.052 (13) the one other variables gives 0.012 0.032 (8) NA palmitic. the model 0.040 (10) the m generatesmade on individualrules by On each variablewith lowest errors splits on th classication using one variable alone, the values When LDA is ttedvariables. sequentially doing binary are obtained lo is linolenic or arachidic. we sorted,tting LDA 7 shows the andare plots of the most important variables. at variable using oleic pairwise values are al variables.are lookvariable alone, thethe valuespair obtainedvariables,splits betweenoe eachand splits between each are ofother with oleic and linoleic acids. 6OnClassicationthe split using one sorted, and the lowest errors I Trees Figure lowest errors a criterion LDA is tted using one examined Using linoleic in the model with one other variables gives the lowest quality palmitic. forthe most of kquality LDAthe splitand one other variables, the the split, theoccur when the other variable i at tting of function. Ofusing a important variables, but these twothe cases strongly corre using oleic the cases criterion of lowest errorsOf variables are to the lef function. to The tree algorithm generates extent. When rules by sequentially doing binary splitsas t is linolenic orotherone somethe left gives the acid alone, palmitic the other variabl arachidic. classication oleic lowest errors when acid emerges on to other variables ic. Using linoleic with es the arachidic.in the model shows theerror On are inby using class,important andvariable in con purity, theon individuallowerwhich obtained each linoleic asare primary splitsfor ca made proportion criterion compares the pairwise plots of the most are sorted, variables.betwee purity, the variable the which the Figure 7 But variables. is each proportionvalues and similarly Olei lenic or are examined for quality of the these usingvariables emerge as important cases to thea split two the right of the atwo variables are strongly when oleic criterion function. Of the correlated. each important Why for the entropy, which dont cases to mon7 criterionpairwiseclass, acid.the variables,two these whichOleic and linoleic and similarly for gure shows thein ismost plotsand the most important variables. wouldeach class, alone emerge a of similarly for but classes, are in be computed as criterion compares purity, boundaries, and oleic acid split. Abut these twocriterion the proportioncorrelated.classes,they substitute the as other some extent. Whenassumes equal variance-covariances. be computed m ost important variables,tocommon criterion isare stronglyalone, palmitic acid emerges asfor each variables entropy, which for two the split. A common is entropy, which for two Thus would to some extent. When oleic errorbe obtained byacid emerges asas the primary variable in conjunc But lower acid is computed using linoleic the most important second variable classes, would alone, palmitic as: wer error is obtained by Why dont these two variables emerge as importantlinolenic, or arachidi the acid. using linoleicClassication Trees 6 as logprimarypvariablein conjunction with when oleic acid is p0 as important 1whenpplog pacidpislog p1 alone? LDA uses linea Why dont theseboundaries, and assumes p0 variance-covariances. two variables emerge oleic equal log 0 1 0 1 used aries, and assumes equal variance-covariances. The tree algorithm generates classication rules by sequentially doing b N where p0 = N0 made on1individual are the relative proportions thecases inare sorted, 1, are the relative proportions N1 N , p1 = N = 1 p0 variables. On each variable of values classes 0, an p0 either N0 relative proportions of cases in classes 0, 1, re 1 = N = 1 lowest ifare the or N1 is 0. A good split has pure groups to each side (bucket), all c of 6A Trees are examined forTreesof thein classes 0,1.criterion function. Of th Classication qualitycases split using a (bucket), all imp To measure the quality proportion Classication good split has pure groups to of a split we which to measure theclas or N1 is 0.all class 1 to the right. compares the purity, theeach sideneed are in each class, an 0 criterion Expanding equation 4 gives The tree thethe split. A common criterion isneed by measure the would b right. To measure algorithm generates classicationentropy, which for two classes,Splits ar quality of a split we binary sequentially doing binary ree algorithm generates classication rules by sequentially doingrules tosplits on the data. impuri each and 1 p0 p0 on individual variables. On each variable pL (L OnL sorted, and splitsvalues pR sorted, 1 spl the values p0 1 log pL ) pR between0each 1pair of ion 4 gives made on individual variables. log are pLvariable+the(R logare pR log pR )value 23 are examined for quality of the split Of a casesp to function. p the th Of amined for quality of the split using a criterion function.usingthe criterion the p1of the split, cas log p left log where pL , pR is the proportion are casesproportionright buckets, 0 each class, This is o of the in the left, which 0 for cases to 1 and sim respectively. right a are on compares the criterion compares the purity, in each class, and similarly in purity, the proportion which the L the L impurity,L whereL logN0Lp + N1R (R log the relative proportions ofbe com as p p criterion = p in each bucket. pR pclasses,R ) measured for two classes, would be computed R log pwould cases by , is entropy, p are two as entropy, = 1 p0which for 1 p criterion is A0common p 1 N the split. p p lit. A common (0 logentropy, 10 = N 1 ) which 0 1 0 The algorithm is as follows: or N is 0. A good split has pure groups to each side lowest if either N

in the classication asthestearic with is other variables gives is lowest errors wh acid when palmiticacid a used stepwis these two variables emergewhenimportantonedropped! Well try theforward alone? oleic palmitoleic palmitic. Using linoleic in used Variable model stearic is linolenic or arachidic.and arachidicmiss) are left out. Interestingly(69) is no (117) ighest errors are obtained when linoleic By Var Error 0.41 (103) 0.28 there 0.47 erro assumes equal variance-covariances. (# acids

CLASSIFICATION TREES

cation Trees

The algorithm is as follows: impurity the node. Stop splitting R when either of these gets below a tolerance. L L 1 1 p p (0 log p0 pL log pL ) + pR (0 log pR pR log pR ) p lit with the smallest impurity,eachpvariable,log pL eachlog pL )1+split(R thisRsplit.loE the 0 intotwo p log the imp data 1 using L 1. For subset L and for pL possible pR calculate p pR the (0 p 1 0 0 1 0 1 L R p resulting tree. respectively. This is a weighted he, p is the proportion of cases in the left, right buckets, smallest impurity, subset the dataaverage o 2. Pick the split with the into two purity, as measured by entropy,is the proportion of cases in the left, right buckets, respectively where pL , pR in each bucket. resulting tree. a node on the e algorithm is asthe impurity, as measured by entropy, in each bucket. follows: 11 bset, repeat step 1-2. 3. Onas follows: repeat step 1-2. each subset, The algorithm is For each variable, and for each possible split calculate the the impurity measure. 4. Splitting cases controlled node is controlled by number of a node isin possibleby number of casesnode,subset at the subsetcalculate the the impurity at that in the and a 1. For each variable, and the data split Pick the split with the smallest impurity, subset for each intosplitting when either of these getsisbelow two using this split. Each split called impurity the node. Stop enode on the resulting tree. when either of these gets below a tolerance. node. Stop splitting a
L

riable, and for4.each possibleissplit calculate the cases in the subset at that node, and the impurity measure. Splitting a node Expanding equation 4 controlled by number of gives

where is the proportion of cases in the left, right buckets, respectively. This is a weighted average of the impurity, as measured by entropy, in each bucket.

On each subset, repeat a node on the resulting tree. step 1-2.

2. Pick the split with the smallest impurity, subset the data into two using

Splitting a node is controlled by subset, repeat step 1-2. 3. On each number of cases in the subset at that node, and also the amount o 11 mpurity the node. Stop splitting when either of these gets below a tolerance. 24

4. Splitting a node is controlled by number of cases in the subset at that

ALGORITHM
1. For each ________, and for each possible ______ calculate the impurity measure. 2. Pick the split with the smallest impurity, ______ the data into two using this split. Each split is called a ____ on the resulting tree. 3. On each subset, repeat step 1-2. 4. Splitting a node is controlled by number of cases in the subset at that node, and also the amount of impurity the node. Stop splitting when either of these gets below a tolerance.

25

DEVIANCE: MEASURING FIT


at a node i is dened to be:
pik log pik Di =
g

node i is dened to be: The ________

i is dened to be:

k=1

he classier is i=1 Di , T =number of terminal nodes. e oils data, to separate the 3 regions CT would rst split on eicosenoic acid, then hese two variables!

and thus T ikilog pik for the classierDi = the pdeviance for theterminal nodes. is i=1 D , T =number of classier is k=1 olive oils data, to separate the 3 regions CT would rst split on eicosenoic acid, the ses these two T variables!

ss, yval, (yprob) minal node


26

yval, (yprob) l node 0.5646853 0.1713287 0.2639860)

EXAMPLE: OLIVE OILS 3 REGIONS, ALL VARIABLES > library(rpart)


> olive.rp<-rpart(d.olive[,1]~.,d.olive[,-c(1,2)], + method="class", parms=list(split='information')) > olive.rp n= 572 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 572 249 1 (0.5646853 2) eicosenoic>=6.5 323 0 3) eicosenoic< 6.5 249 98 6) linoleic>=1053.5 98 7) linoleic< 1053.5 151 0.1713287 0.2639860) 1 (1.0000000 0.0000000 0.0000000) * 3 (0.0000000 0.3935743 0.6064257) 0 2 (0.0000000 1.0000000 0.0000000) * 0 3 (0.0000000 0.0000000 1.0000000) *

node) is the arbitrary numbering of nodes from top to bottom of the tree split is the ____ for the split from that node n is the _______ of cases at this node loss is the number of cases __________ at this node yval is the _________ value for all cases at this node (yprob) are the __________ in each class
27

EXAMPLE: OLIVE OILS 3 REGIONS, ALL VARIABLES

The rst split is on __________ acid and the next split Figure 8: Tree classier acid. is on __________on the three regions of the olive oils data. It only uses these ____ variables! And there is __________!
28

0.0

13

points 4 and 5.Thats the split to use.

1 1

1 2

1 3

1 4 x

2 5

2 6

2 7

2 8

0.0

tree hms would be as follows: class = (1, 1, 1, 1, 2, 2, 2, 2), then all possible splits would be as The missclassication rate 0. Figure 8 shows the tree and a scatterplot illustratin Right then takepossible splits would be all Lets Lefta closer look at how the tree algorithms works. Consider the data x (1,0) (3,4) class = (1, 1, 1, 1, 2, 2, 2, 2), then all possible splits would be as follows: Testing Trees Left Right (2,4) Calculate the (2,0) impurity (on (1,0) (3,4) (1,4) slide 2) for (3,0) (2,0) (2,4) each possible (4,0) (3,0) (1,4) (0,4) split (4,1) (0,3) (4,0) (0,4) (4,1) (0,3) (0,2) (4,2) (4,2) (0,2) (0,1) lowest value (4,3) is between
1.0 0.8 0.6 entropy 0.4 entropy 0.2

points(d.olive[d.olive[,1]==3,10],d.olive[d.olive[,1] par(mar=c(4,4,1,1)) plot(d.olive[,10],d.olive[,7],type="n",xlab="Eicosenoic",ylab="Linol abline(v=6.5) points(d.olive[d.olive[,1]==1,10],d.olive[d.olive[,1]==1,7],pch="1") lines(c(0,6.5),c(1053.5,1053.5)) points(d.olive[d.olive[,1]==2,10],d.olive[d.olive[,1]==2,7],pch="2") Figure 8: Tree classier on the three regio and a points(d.olive[d.olive[,1]==3,10],d.olive[d.olive[,1]==3,7],pch="3") scatterplot illustrating the split. The missclassication rate 0. Figure 8 shows the tree and a sca abline(v=6.5) and works. Consider the data x = (1, 2, 3, 4, 5, 6, 7, 8) and Lets take a closer look at lines(c(0,6.5),c(1053.5,1053.5)) how the tree algorithms works. C

A CLOSER LOOK.....

(4,3)

(0,1)

0.2

0.4

0.6

0.8

1.0

13

13
29

Figure 8: Tree classier on the three regions of the olive oils data.

Figure 9: Entropy split criterion, for tw

14

x x s the second split variable. There areEntropy split criterion, for two simple data sets. two other common impurity measures: Figure 9: r when tting means to a p0 p1 which variable. Bernouilli can be interpreted as Mean Square Error Gini: rnouilli variable. Missclassication: 14 min(0 , p1 ). p L R ted average: p Left these are typically combined for each bucket using a weight And bucket purity+ Right p R purity+ Right bucket purity. example Lee (2001) suggest alternatives for com mbiningpthe bucket purities, for Buja and taking or example taking the minimum or the maximum of the two. 30 Strengths and weaknesses of trees:

How does it work the impurity value entropy for each side, andfor a nonsensical class structure? Consider data: calculate the values of p for Trees side, and thus the Lets Trees Testing Testing each he impurity function. he impurityxvalue for each 5, 6, 7, 8) split. classleftmost plot in Figure 9 shows th ider data = (1, 2, 3, 4, possible and The = impurity function. How does it work for a nonsensical class structure? Cons 7, 8) and class = (1, 2, 1, 2, 1, 2, 1, 2). The rightmost plot in Figure 9 shows the e 10 shows the results for the olive oil variables In practice the impurity functions can be quite noisy. Figur c olive oil variables acid is the variable with the lowest impurity The splitwhen separating southern oils from the other oils. Eicosenoic at the top of the chosen will most tree. e lowest impurity overall, 0. It would be chosen as the most important variable n separating sardinian 1oils 11 shows the results for 2the olive oil variables when 1 1 2 2 1 2 1 2 1 2 likely 1be the rstfrom northern oils. 1 Figure 2one,2 3 4 6 1 5 6 7 8 ll, northern oils. be chosen acid 7is2.second split 3 would 1 Linoleoic and om 0. It between 2points 51as the8the variable 2with 4 the lowest impurity overa
1.0 0.8 0.6 entropy 0.4 entropy 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0

HOW DOES IT WORK ON THE OLIVE OILS DATA?


In practice the impurity functions can be quite _____. The next two sets of plots show the impurity measure calculated to separate the (1) southern oils from the other two regions, and (2) northern from Sardinian oils. ________ acid is the variable with the lowest impurity overall, 0. It would be chosen as the most important variable at the top of the tree. ________ acid is the variable with the lowest impurity, 0, when region 1 is removed. It would be chosen as the second split variable.

31

0.8

0.8

0.6

0.6

0.4

0.4

entropy

entropy

0.2

0.2

entropy

0.0

0.0

2 0.3 800

2 22222222222 2 2 2 22 2222222 2 2 2222222 2 2222222 222222 2222 2 222222 222 2 222 22 22 2 2 222 22 11 11 1111111111111111111 1 1 1 11 11 1 11 111111111111111 11 1 1 11 11 1 1111111111 11 1 1 1 1111 111111 11 111111 11 11 1 1 1 1111 11 11

2 22 222 22222 2222 2 2 22 2222222 22 22 22222 2 22 22222 22 22 222 222222 222 22 222 22 22 222 22 2 0.4 50 100 150

0.0

0.2

0.4

0.6

0.8

22222222222 22 2 2 2 2 2 2 2 2 2 2222 22222 2 2 22 2 222 222 22 222 222222 22 2 2 2 22 22 2 111111111111111111 111 1111 11 1 1 1 1111111111111111 1111 11 1 1 1 1 11111111111111 1 111 1 1 1111111111111111 11 11 11 11 11 11 11 1 11 0.67 150 200 250 stearic 300 350

!0.2

!0.2

1000 1200 1400 1600 palmitic

200

250

palmitoleic
0.8 0.8 entropy

0.6

0.8

0.6

0.4

0.4

entropy

0.2

entropy

0.2

0.0

0.0

2 2222 222222222222222 2222 2222222222222 22 2 222 222222222 22 22222 222 22 22 222 2 22 22 22 2 2 22 22 2 111 1 1111111111111111111111111111 1 11111111111 11111111111111 1 1 11111111111111 1 1 1 1 1 1111111111 11 1 111 111 11 11 1 111 1 1 1 0.49 6500 7000 7500 oleic 8000

2222222222222 22222222222 2 2222 2 2 2222 22 2222 2222222222 22222 2222 222222 22222 2222 2222 22 22 2 222 2 2 2 22 22 22 2 22 2 1 11 11 11111111111 1111111111111 11 1 1 1 1111 1111111 111111111111 11 1 1 1111111 11 11111 11 1 1 11111111 1 11 111111 1 11 1 1 1111 1 11111 11 111 11 1 111 11 1 0.63 600 800 1000 1200 1400 linoleic

0.0

0.2

0.4

0.6

!0.2

1111 11 1111111111111111111111 1111 1 1 1111 11 111111111111111111111 1111 1 11111 1111111 11111111111 11 111 1 11 111111 11 1111 1 111111111 11 11 11 11 11 1 11 1 11 1 1

2 2 22222222222 2 2 2222 2 2222222222 2 2 22 2 22222 22 22222 22 22 22 2 22 22 2222 22 22 2 1111111111111111111 11111111111111 11 1 111111111111 1111111 1 11 111111 11111 11 111111111 1111 11 11 0.51

2 1

!0.2

!0.2

!0.2

10

20

30

40

50

60

70

linolenic

0.8

0.6

0.4

entropy

0.2

entropy

0.0

2 2 2 222222 2 2222222222222222222 2 222 222 2222222222222 2 2 2 2 22 222222222 2 2 22 22 22 1 111111111111111 11 1 1 1 11 11111111111 1 1 11 1 11 11 11 111111111 1 1111111 1 11 1111 1 11 1 1111 111 11 11 11 0.56 0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

22 22 2 1111111111111111111 1 11 111 111 111 111 11 111 11 11 1111111 1111 11 11 11 1 11 1 11 0 0 10 20 30 40 50

!0.2

arachidic

!0.2

eicosenoic

32
Figure 10: Entropy split criterion, for each variable for separating southern oils from other regions oils.

0.8

0.8

0.6

0.6

0.4

0.4

entropy

entropy

0.2

0.2

entropy
3 3 3 333 3333 3 3 3 3 3 3 3 33 33 3 2 0.56 50 100 palmitoleic 150 2222222222 22 2 222222222 2 2 2222 22 3

0.0

0.0

3 33 333333333333 3 3 3 3 33 333333333333 3 33 33333333 3333 33 3333 333 33 3 2222222222 2 22222 2222 2 22222222 2222 22 2222 2 2 2 0.62 800 900 1100 palmitic 1300

0.0

0.2
3333333333333 3 3 3 3 3 3 3 3 3 3333 3 3 3 3 33 3333 33 3 33 222 2 222222 2 22 2222 2222 22 2222 22 22 22 2 22 2 22 2 2 0.59 200 250 stearic 300 350

0.4

0.6

0.8

333

!0.2

!0.2

0.8

0.8

0.6

0.6

0.4

0.4

entropy

entropy

0.2

0.2

entropy
33 33333333333 333333 3 3333 33 3333 33333 3333 33 3333 33333 3 33 3 22222 2 22222 2 22222 222222 2222 222 222 2 22 2222 22 0 600 800 1000 linoleic 1200 1400

0.0

0.0

3 333333333333333333 33333 33333333333 3 33 3333 3 33333 3 3 33333 333 3333 33 3 33 33 33 2 22 222 2222222 22 222 222222 222 2 2222 222 22 2 22 22 0.04 7000 7400 oleic 7800 8200

0.0

0.2
3

0.4

0.6

0.8

!0.2

3 0.51

333 3 333 333 3 3 333 3 3 33 3 2 22222222222 2 22 22 22 22 2222222 22 22 22222 222 22

!0.2

!0.2

!0.2
0

10

20

30

40

50

60

linolenic

0.8

0.6

0.4

entropy

0.2

entropy
3 3 3 333333 3 333333 3 3 3 3 3 333 333 333 3 3 2 2222222222222222 2 2 22 222222222 2 2 22 2222222222 2 22 2 2 22 0.4 0 20 40 60 80 100

0.0

0.0

0.2
3 2 0.54 1.0 1.5 2.0 eicosenoic 2.5 3.0

0.4

0.6

0.8

3 2

3 2

!0.2

arachidic

!0.2

Figure 11: Entropy split criterion, for each variable for separating sardinian oils from northern oils.

33

STRENGTHS AND WEAKNESSES


17

The solutions are usually ________, and easy to implement. There are few probabilistic assumptions underlying trees, which complicate the solution. For example, because LDA assumed that the variancecovariance of the groups are equal it doesn't see the ``perfect'' split of northern and sardinian oils in linoleic acid. The tting ___________________ in the sense that the rst best t will be used at each split, but it may be a better nal result might be obtained by a less optimal previous step.

34

STRENGTHS AND WEAKNESSES


The additive model approach, _______________, is too limited for problems where separations between groups is due to combinations of variables. But because it works variable-by-variable it can ______________________, using complete data on each variable. Trees can also accommodate complex data, where some variables are continuous and some are categorical. Because it is an algorithmic method it can be easy to ___________ (_______) the data. The tree will then not have inferential power: it will have worse error on new data. Split the current data into training and test sets, use the training subset to build the tree, and the test set to estimate the error.

35

TREES DONT DO SO WELL IN THE PRESENCE OF COVARIANCE BETWEEN VARIABLES

Figure 12: Trees dont do well when the best split is in a linear combination of variables.

36

OTHER COMMON CLASSIFICATION METHODS

_______________ - t many trees to samples of the data, and subsets of the variables, and combine the predictions. _______________ - a mixture of logistic regression models. Feed-forward Neural Networks ____________________ - nd gaps between groups l networks are loosely hyperplane to the points bordering the gaps. and t a based on the neuron systems in organisms, where dendrites pass

inform a network based on a chemical threshold. As the level of a chemical builds up in the neur aches a threshold at which it res o the chemical signal to the next neuron.

37

NEURAL NETWORK

Feed-forward neural networks (FFNN) were developed from this concept, that combining small components is a way to build a model from predictors were developed from this concept, ed-forward neural networks (FFNN)to response. They actually generalizethat combining small ___________________. A simple network They is ts is a way to build a model from predictors to response. model actually generalize linear regre represented by: ons. A simple network model as produced by nnet code in S-Plus may be represented by the equ
y = f (x) = ( +
s

wh (h +

h=1

p i=1

wih xi ))

x is the vector of explanatory variable values, y is the target value, p is the number of variable where x is the vector of explanatory variable values, y is the target umber of nodes in the single hidden layer and is a xed function, usually a linear or logistic fun value, p is the number of variables, s is the number of nodes in the model has a single hidden layer, and univariate output values.

single hidden layer and phi is a xed function, usually a linear or logistic function. This model has a single hidden layer, and univariate output values.

38

is a way to build a model from predictors to response. They actually generalize linear regress s. A simple network model as produced by nnet code in S-Plus may be represented by the equati y = f (x) = ( +
s

wh (h +

h=1

p i=1

wih xi ))

is the vector of explanatory variable values, y is the target value, p is the number of variables, ber of nodes in the single hidden layer and is a xed function, usually a linear or logistic functi del has a single hidden layer, and univariate output values.

hed with a lot of hype, making them seem magical and mysterious. B r model. In this simplest case a neural network is similar to an old The network is t by minimizing a squared error pursuit regression. n n 2 ng a squared error i=1 (yi f (x)) or entropy i=1 yi log f (x) l mmonly used for the minimization is a gradient descent method cal t by sweeping forward and backward over the network. p mportant for producing a useful predictive model. As a 0nonlinear i ,mo ponse variable can be multivariate. A simple linear regression model, y = w + j=1 wi x can e points to adequately estimate the parameters. It is important to ha ted as a feed-forward neural network: e data, to maintain predictive accuracy by avoiding overtting the da the initial estimates Machines (SVM) Support Vector and the convergence.
39

SUPPORT VECTOR MACHINES

The points on the edge of NS are from the constraint i=1 iyi = 0. SVMsare called classiers. SV Actually binary we the margin of points in the two grou subset SVMs nding e support vectorsare binary classiers. SVM algorithms work hyperplane which and the response is which gives the_____________, and areby support y. are then known as the subset of points in the two groups that can be separated, whic gest separation between theThe more to dene the to dene a and t classes. Note that the net separ of units in the hidden layer. two used unitsbe used the larger are then known as the supportcan vectors. These support vecto N to incorporate non-linear kernel functions allows for S iyixi, NS is hyperplane. hyperplane, w.x + b = 0 where w = i=1 can be used to dene a separating wij parameters. If notNS theseN isgenerated support support constra areis the numbercomes of linear separations, and giventhat, modications allow vectorsthe random and of by sampling vector where w = also iyixi the number from S i=1 ge controlling the comes of thethenot separable. y = 0. Actually w limits from uniform distribution. form well whenthe classes are constraint NS the support vectors w search for and i=1 i i function ofsearch for the support vectors false means thelogistic function the net, true is linear and which the biggest separationwhic 40 gives gives a hyperplane bet e logistic function to work. it is possible to incorporate non

The algorithm nds a nary classiers. SVM algorithms hyperplane that maximizes work by nding a y Boundary nts in the two groups that can be separated, which the ______________ (gap) , mask, linout = FALSE, entropy = FALSE, wn as the support Support vectors 32 between thevectors vectors. These support two classes. FALSE, skip = FALSE, rang = 0.7, Support Vector to dene a separating hyperplane, w.x + b = 0, s = NS FALSE, trace = TRUE, MaxNWts = 1000, number support vectors, of i=1 i yi xi , NS is the Support Vector Machines (SVM)

.0e-8, ...)

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http:// creativecommons.org/licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

41

You might also like