You are on page 1of 29

Clustering Categorical Data

Steven X. Wang Department of Mathematics and Statistics York University

April 11, 2005

Presentation Outline
Brief literature review Some new algorithms for categorical data Challenges in clustering categorical data Future work and discussions

Algorithms for Continuous Data


There are many clustering algorithms proposed in the literature: 1. K-means 2. EM algorithm 3. Hierarchical clustering 4. CLARAN 5. OPTICS

Algorithms for Categorical Data


K modes (modification of K-means) AutoClass (based on EM algorithm) ROCK and CLOPE There are only several algorithms for clustering categorical data.

Categorical Data Structure


Categorical data has a different structure than the continuous data. The distance functions in the continuous data might not be applicable to the categorical data. Algorithms for clustering continuous data can not be applied directly to categorical data.

K-means for clustering Continuous data


K-means is one of the oldest and widely used algorithm for clustering categorical data 1) Choose the number of clusters and initialize the clusters centers 2) Perform iterations until a selected convergence criterion is reached. 3) Computational Complexity O(n).

Categorical Sample Space


Assume that the data set is stored in a n*p matrix, where n is the number of observations and p the number of categorical variables. The sample space consists of all possible combinations generated by p variables. The sample space is discrete and has no natural origin.

K-modes for Categorical Data


K-modes has exactly the same structure of the k-means, i.e., choose k cluster modes and iterates until convergence. K- modes has a fundamental flaw: the partition is sensitive to the input order, i.e., the clustering results would be different for the same data set if the input order is different.

AutoClass Algorithm
This is an algorithms applicable to both continuous and categorical data It is model-based algorithm without the input of the number of clusters Computational Complexity O(n log n) EM algorithm has a slow convergence and sensitive to the initial values

Hamming Distance and CD vector


Hamming distance measures the number of different attributes between two categorical variables. Hamming Distance has been used in clustering categorical data in algorithms similar to K-modes. We construct Categorical Distance (CD) vector to project the sample space into 1-dimesional space.

Example of a CD vector
25 20

15

10

10

12

14

16

18

More on CD vector
The dense region of the CD vector is not necessarily a cluster! We can construct many CD vectors on one data set by choosing different origin.
25 20

15

10

10

12

14

16

18

UCD: Expected CD vector under Null.


20 18

16

14

12

10

10

12

14

16

18

25

20

15

CD Vector

10

10

12

14

16

18

20

18

16

14

12

UCD Vector

10

10

12

14

16

18

CD Algorithm
Find a cluster center; Construct the CD vector given the current center ; Perform modified Chi-square test; If we reject the null, then determine the radius of the current cluster; Extract the cluster Repeat until we do not reject the null.

Numerical Comparison with K-mode and AutoClass


CD AutoClass K-mode No. of Clusters 4 4 [3] [4] [5] _____________________________________________________ Classi. Rates 100% 100% 75% 84% 82% Variations 0% 0% 6% 15% 10% Inform. Gain 100% 100% 67% 84% 93% Variations 0% 0% 10% 15% 11% _____________________________________________________ Soybean Data: n=47 and p=35. No of clusters=4.

Numerical Comparison with K-mode and AutoClass


CD AutoClass K-mode No. of Clusters 7 3 [6] [7] [8] _____________________________________________________ Classi. Rates 95% 73% 74% 72% 71% Variations 0% 0% 6% 15% 10% Inform. Gain 92% 60% 75% 79% 81% Variations 0% 0% 7% 6% 6% _____________________________________________________ Zoo Data: n=101 and p=16. No of clusters=7.

Computational Complexity
The upper bound of the computational complexity of our algorithm is O(kpn) It is much less computational intensive than Kmodes and AutoClass since it does not demand convergence.

CD Algorithm
It is based on hamming distance. It does not require the input of parameters. It has no convergence criterion.
Ref: Zhang, Wang and Song (2005). JASA. To appear.

Difficulties in Clustering Categorical data


Distance function Similarity measure to organize clusters Scalability or computational complexity

Challenge 1: Distance Function


Hamming distance is a natural and reasonable one if the categorical scaling has not natural order (nominal data). If we apply the method for nominal data such as the CD algorithm to the ordinal data, there might be a serious loss of information as the order is ignored.

Challenge 2: Organization of Clusters


Organization of clusters is crucial in clustering large data sets. Similarity measures are needed to organize clusters in hierarchical clustering Different similarity measure will be have different results

Challenge 3: Scalability
In practice, an approximate answer is so much better than no answer at all. Complexity O(n). Scalability O(mn) How many variables that we are dealing with?

Challenge 1:
What to do about the ordering? To propose a reasonable distance function for ordinal data might require a careful examination of the dependence structure. We need to look into different measure of association for categorical data.

Challenge 2:
A nave measure of similarity would be the distance between two clusters. Entropy might be a good one to try even thought it is not a distance function.

Challenge 3:
There are many hierarchical clustering algorithms available. Any clustering algorithm could be integrated into those algorithms if the distance function and similarity measure could be defined appropriately.

Beyond Categorical Data


The ultimate goal is to cluster any data sets with complex data structures. Mixed data types would be the next on the list. The challenge there is again on the distance function (dependence structure between the continuous part and categorical portion.)

More Challenges
Measure of uncertainty Hard clustering vs. soft clustering Parallel computing.

Thank you!

You might also like