Clustering Categorical Data

Clustering Categorical Data
Steven X. Wang Department of Mathematics and Statistics York University
April 11, 2005
Presentation Outline
Brief literature review Some new algorithms for categorical data Challenges in clustering categorical data Future work and discussions
Algorithms for Continuous Data

There are many clustering algorithms proposed in the literature: 1. K-means 2. EM algorithm 3. Hierarchical clustering 4. CLARAN 5. OPTICS
Algorithms for Categorical Data

K modes (modification of K-means) AutoClass (based on EM algorithm) ROCK and CLOPE There are only several algorithms for clustering categorical data.
Categorical Data Structure

Categorical data has a different structure than the continuous data. The distance functions in the continuous data might not be applicable to the categorical data. Algorithms for clustering continuous data can not be applied directly to categorical data.
K-means for clustering Continuous data

K-means is one of the oldest and widely used algorithm for clustering categorical data 1) Choose the number of clusters and initialize the clusters centers 2) Perform iterations until a selected convergence criterion is reached. 3) Computational Complexity O(n).
Categorical Sample Space

Assume that the data set is stored in a n*p matrix, where n is the number of observations and p the number of categorical variables. The sample space consists of all possible combinations generated by p variables. The sample space is discrete and has no natural origin.
K-modes for Categorical Data

K-modes has exactly the same structure of the k-means, i.e., choose k cluster modes and iterates until convergence. K- modes has a fundamental flaw: the partition is sensitive to the input order, i.e., the clustering results would be different for the same data set if the input order is different.
AutoClass Algorithm
This is an algorithms applicable to both continuous and categorical data It is model-based algorithm without the input of the number of clusters Computational Complexity O(n log n) EM algorithm has a slow convergence and sensitive to the initial values
Hamming Distance and CD vector

Hamming distance measures the number of different attributes between two categorical variables. Hamming Distance has been used in clustering categorical data in algorithms similar to K-modes. We construct Categorical Distance (CD) vector to project the sample space into 1-dimesional space.
Example of a CD vector
25 20
15
10
10
12
14
16
18
More on CD vector
The dense region of the CD vector is not necessarily a cluster! We can construct many CD vectors on one data set by choosing different origin.
25 20
15
10
10
12
14
16
18
UCD: Expected CD vector under Null.

20 18
16
14
12
10
10
12
14
16
18
25
20
15
CD Vector
10
10
12
14
16
18
20
18
16
14
12
UCD Vector
10
10
12
14
16
18
CD Algorithm
Find a cluster center; Construct the CD vector given the current center ; Perform modified Chi-square test; If we reject the null, then determine the radius of the current cluster; Extract the cluster Repeat until we do not reject the null.
Numerical Comparison with K-mode and AutoClass

CD AutoClass K-mode No. of Clusters 4 4 [3] [4] [5] _____________________________________________________ Classi. Rates 100% 100% 75% 84% 82% Variations 0% 0% 6% 15% 10% Inform. Gain 100% 100% 67% 84% 93% Variations 0% 0% 10% 15% 11% _____________________________________________________ Soybean Data: n=47 and p=35. No of clusters=4.
Numerical Comparison with K-mode and AutoClass

CD AutoClass K-mode No. of Clusters 7 3 [6] [7] [8] _____________________________________________________ Classi. Rates 95% 73% 74% 72% 71% Variations 0% 0% 6% 15% 10% Inform. Gain 92% 60% 75% 79% 81% Variations 0% 0% 7% 6% 6% _____________________________________________________ Zoo Data: n=101 and p=16. No of clusters=7.
Computational Complexity
The upper bound of the computational complexity of our algorithm is O(kpn) It is much less computational intensive than Kmodes and AutoClass since it does not demand convergence.
CD Algorithm
It is based on hamming distance. It does not require the input of parameters. It has no convergence criterion.
Ref: Zhang, Wang and Song (2005). JASA. To appear.
Difficulties in Clustering Categorical data

Distance function Similarity measure to organize clusters Scalability or computational complexity
Challenge 1: Distance Function

Hamming distance is a natural and reasonable one if the categorical scaling has not natural order (nominal data). If we apply the method for nominal data such as the CD algorithm to the ordinal data, there might be a serious loss of information as the order is ignored.
Challenge 2: Organization of Clusters

Organization of clusters is crucial in clustering large data sets. Similarity measures are needed to organize clusters in hierarchical clustering Different similarity measure will be have different results
Challenge 3: Scalability
In practice, an approximate answer is so much better than no answer at all. Complexity O(n). Scalability O(mn) How many variables that we are dealing with?
Challenge 1:
What to do about the ordering? To propose a reasonable distance function for ordinal data might require a careful examination of the dependence structure. We need to look into different measure of association for categorical data.
Challenge 2:
A nave measure of similarity would be the distance between two clusters. Entropy might be a good one to try even thought it is not a distance function.
Challenge 3:
There are many hierarchical clustering algorithms available. Any clustering algorithm could be integrated into those algorithms if the distance function and similarity measure could be defined appropriately.
Beyond Categorical Data

The ultimate goal is to cluster any data sets with complex data structures. Mixed data types would be the next on the list. The challenge there is again on the distance function (dependence structure between the continuous part and categorical portion.)
More Challenges
Measure of uncertainty Hard clustering vs. soft clustering Parallel computing.
Thank you!

Clustering Categorical Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Categorical Data

Uploaded by

Copyright:

Available Formats

Clustering Categorical Data

Steven X. Wang Department of Mathematics and Statistics York University

April 11, 2005

Algorithms for Continuous Data

Algorithms for Categorical Data

Categorical Data Structure

K-means for clustering Continuous data

Categorical Sample Space

K-modes for Categorical Data

Hamming Distance and CD vector

UCD: Expected CD vector under Null.

Numerical Comparison with K-mode and AutoClass

Numerical Comparison with K-mode and AutoClass

Difficulties in Clustering Categorical data

Challenge 1: Distance Function

Challenge 2: Organization of Clusters

Beyond Categorical Data

You might also like