You are on page 1of 1

y

c u -tr a c k

.c

7.1

Introduction to Cluster Analysis

We use cluster
analysis when
we have no idea
regarding what
the data is all
about.

While we often think of statistics as giving definitive answers to well-posed questions, there
are some statistical techniques that are used simply to gain further insight into a group of
observations. One such technique (which encompasses lots of different methods) is cluster
analysis. The idea of cluster analysis is that we have a set of observations, on which we
have available several measurements. Using these measurements, we want to find out if the
We basically use
this algo to inves- observations naturally group together in some predictable way. For example, we may have
tigate the data &
recorded physical measurements on many animals, and we want to know if theres a natural
to see if there's any
any relation b/w grouping (based, perhaps on species) that distinquishes the animals from another. (This use
the data ,i.e, wether
of cluster analysis is sometimes called numerical taxonomy). As another example, suppose
observations naturally group togeth- we have information on the demographics and buying habits of many consumers. We could
er in some predicause cluster analysis on the data to see if there are distinct groups of consumers with similar
table way.
demographics and buying habits (market segmentation).
Its important to remember that cluster analysis isnt about finding the right answer
its about finding ways to look at data that allow us to understand the data better. For
example, suppose we have a deck of playing cards, and we want to see if they form some
natural groupings. One person may separate the black cards from the red; another may
break the cards up into hearts, clubs, diamonds and spades; a third person might separate
cards with pictures from cards with no pictures, and a fourth might make one pile of aces,
one of twos, and so on. Each person is right in their own way, but in cluster analysis, theres
really not a single correct answer.
Another aspect of cluster analysis is that there are an enormous number of possible ways
of dividing a set of observations into groups. Even if we specify the number of groups,
the number of possibilities is still enormous. For example, consider the task of dividing 25
observations into 5 groups. (25 observations is considered very small in the world of cluster
analysis). It turns out there are 2.4 1015 different ways to arrange those observations into
5 groups. If, as is often the case, we dont know the number of groups ahead of time, and
we need to consider all possible numbers of groups (from 1 to 25), the number is more than
4 1018 ! So any technique that simply tries all the different possibilities is doomed to failure.

7.2

Standardization

There are two very important decisions that need to be made whenever you are carrying out
a cluster analysis. The first regards the relative scales of the variables being measured. Well
see that the available cluster analysis algorithms all depend on the concept of measuring the
distance (or some other measure of similarity) between the different observations were trying
to cluster. If one of the variables is measured on a much larger scale than the other variables,
then whatever measure we use will be overly influenced by that variable. For example, recall
the world data set that we used earlier in the semester. Heres a quick summary of the mean
values of the variables in that data set:
> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)
159

.d o

.d o

lic

to

bu

y
bu
to
k
lic
C

O
W

h a n g e Vi
e

PD

XC

er

O
W

F-

h a n g e Vi
e

PD

XC

er

F-

c u -tr a c k

.c

You might also like