You are on page 1of 6

Cluster Analysis Tool

Created: 6/24/2002

Description:
This sample contains a command that uses a cluster analysis engine (available as a separate sample) to classify data into a predetermined number of classes using multiple numeric attributes. The command adds a field containing the classification to the source data table and generates a dendrogram to help interpret the classification. A detailed discussion of cluster analysis and the use of this sample follows. Cluster Analysis Cluster analysis is a technique for classifying numerical data using multiple attributes. It is an exploratory technique, whose goal is to help you to better understand what patterns exist in a given data set, and to propose explanations for those patterns. Cluster analysis can be particularly useful when combined with mapping, because the clusters that emerge may form geographic patterns that lead to insights about connections between patterns in attribute data and the spatial context within which those patterns formed. To understand how cluster analysis works, it's useful to think about the technique spatially. If cluster analysis is used on two variables, it can be thought of as finding clusters in twodimensional space. For example, in the small map below, cluster analysis is performed on the 50 states and the District of Columbia using the two variables of median rent and median home value.

The Scatterplot The graph below is called a scatterplot. A scatterplot contains a point for each record in the dataset (in this case a point for each state). The X-axis of the scatterplot represents housing value, and the Y-axis represents rent, thus creating an attribute space within which we can identify clusters. Points that are in the upper right-hand corner of the scatterplot are states that have high mean rent and high mean value, while points in the lower left are states that have low mean rent and low mean value. (Note: The scatterplot pictured here was generated using the Scatterplot Tool, which can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Scatterplot Tool).

Cluster analysis looks at the patterns that these points form in data space. The type of cluster analysis that we used here starts out by placing each point in the scatterplot in its own cluster. It then looks to see which two points are closest to one another (in data space). Those two points are added together to create a new, larger cluster. The process then repeats itself, finding the two clusters that are closest to one another, and "lumping" them together into a larger cluster. The process is complete when all individuals in the data set have been lumped together into one big cluster. The Dendrogram One product of cluster analysis is a tree diagram representing the entire process of going from individual points to one big cluster. This diagram is called a dendrogram, and is illustrated below. Once the cluster analysis algorithm has been run, the user must decide how many clusters he or she wants to explore (this is sometimes referred to as "pruning" the dendrogram). In this example we have chosen to look at four clusters (symbolized in red, yellow, green, and blue).

Deciding the number of clusters to map can be aided by looking at the dendrogram. There are three key pieces of information that you can get from the dendrogram. In the dendrogram above, the yellow cluster is labeled so that you can see the parts of it that represent these pieces of information. They are: Weight - the rough percentage of all individuals that fall within each cluster Compactness - how similar to one another the elements of a cluster are Distinctness - how different one cluster is from its closest neighbor

The weight of each cluster is represented by the number of leaves that that branch of the dendrogram leads to. Because each leaf is equally spaced along the Y-axis of the dendrogram, the weight of a cluster is its percentage of the total height of the dendrogram. The compactness of a cluster represents the minimum distance at which the cluster comes into existence. The horizontal axis of the dendrogram measures the distance between clusters. If a cluster contains only one observation, its compactness is 0. This is why all the leaves line up on the left-hand side of the dendrogram. The relative compactness of the yellow cluster can be estimated by looking at the point at which all of its branches merge together, and the relative distance of that point from the left-hand side of the dendrogram. The distinctness of a cluster is the distance along the X-axis from the point at which it comes into existence to the point at which it is aggregated into a larger cluster. Distinctness can be seen on the dendrogram as the length of a branch along the horizontal axis.

When choosing a classification, you will want to choose clusters that are as compact and distinct as possible. In the example above, there are four very distinct clusters. However, we could have chosen to break the green cluster into its two components, which are both fairly distinct clusters as well, and are more compact. You may wish to run the cluster analysis a number of times, choosing different numbers of clusters, and exploring how the mapped patterns of those clusters are distributed. Interpreting the Clusters Once you have chosen the clusters that you will use to symbolize your map, go back to the scatterplot to gain a better understanding of what those clusters mean. In this case, the red cluster represents states with high median rent and high median home value. The blue cluster represents states with low median rent and low median home value. The yellow and green clusters fall somewhere in-between. While it is fairly easy to interpret cluster analysis when it is performed on just two variables, it becomes much more difficult as more variables are added. One tool that you can use for interpreting a more complex run of cluster analysis is a scatterplot matrix (pictured below). The Scatterplot Tool can also be used to generate a scatterplot matrix (can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Scatterplot Tool). The example below shows a scatterplot matrix of cities in the United States with population greater than fifteen thousand classified by the ethnicity variables from the 2000 census.

The User Interface The user interface for the cluster analysis sample consists of a context command that must be placed in the Feature Layer context menu. When executed, that command opens a form that allows you to specify parameters and then to run the cluster analysis algorithm. Running the algorithm produces a dendrogram and a new field in the source table storing the new classification. The dendrogram consists of a dataframe with three layers: two point layers (the first contains the nodes and the second the leaves) and a line layer (contains the branches). Each dendrogram has one leaf corresponding to each feature in the map (in the example above, each leaf represents a state). The leaf feature layer contains all of the data

associated with the source feature layer. The nodes and branches both have the following fields containing results from the cluster analysis: NID - A unique Node ID ParentID - The unique Node ID of this node's parent node LeafCount - The number of leaves under this node in the dendrogram (referred to as weight above) Compactness - The minimum scale at which the cluster exists as a unit (described above) Distinctness - The range of scales throughout which the cluster exists as a distinct unit, calculated as maximum scale-minimum scale (described above) Cluster - The identifier of the cluster the node falls in (can be null if the node is above the scale at which the tree is "pruned")

The parameters in the form include the following: A list of fields to be used by the cluster analysis algorithm The size and position of the dendrogram in the layout A "null" value that can be excluded from all fields The option for distance measure. Possible settings include: o Absolute Maximum (Chebychev) - the variable with the largest distance is used

o Block (Manhattan) - use the sum of orthogonal distances o Euclidean - the square root of the sum of the squared distances o Squared Euclidean - the sum of the squared distances The option for clustering algorithm. Possible settings include: o Minimum Distance (Single Link) - the distances between clusters are determined by the shortest distance any two objects in the different clusters o Mean Distance (Average Link) - the distances between clusters are determined by the mean of the distance between all the objects in the clusters o Median Distance - the distances between clusters are determined by the median of the distance between all the objects in the clusters o Maximum Distance (Complete Link) - the distances between clusters are determined by the greatest distance between any two objects in the different clusters o Centroid Method - the distance between clusters is based on the mean location of the points in the cluster o Weighted Mean Distance - same as mean distance, but weighted by total number of objects in each cluster o Minimum Variance (Ward's) - this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step The number of output clusters An option to standardize variables to the same scale range - if this box is checked, the algorithm standardizes all variables to a range of 0-100 An option to choose V-branches or L-branches in the dendrogram - Below left is a Vbranch dendrogram, and below right is a L-branch dendrogram

o The directory and personal GeoDatabase name for storing the features for the dendrogram The name of the field that will store the classification in the source table

Additional Information This sample consists of a form and two classes (listed below). It also requires that you install the Clustering Engine sample, which is used to perform the cluster analysis. The Clustering Engine sample can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Clustering Engine.

You might also like