Professional Documents
Culture Documents
Presentation Outline
Brief literature review Some new algorithms for categorical data Challenges in clustering categorical data Future work and discussions
AutoClass Algorithm
This is an algorithms applicable to both continuous and categorical data It is model-based algorithm without the input of the number of clusters Computational Complexity O(n log n) EM algorithm has a slow convergence and sensitive to the initial values
Example of a CD vector
25 20
15
10
10
12
14
16
18
More on CD vector
The dense region of the CD vector is not necessarily a cluster! We can construct many CD vectors on one data set by choosing different origin.
25 20
15
10
10
12
14
16
18
16
14
12
10
10
12
14
16
18
25
20
15
CD Vector
10
10
12
14
16
18
20
18
16
14
12
UCD Vector
10
10
12
14
16
18
CD Algorithm
Find a cluster center; Construct the CD vector given the current center ; Perform modified Chi-square test; If we reject the null, then determine the radius of the current cluster; Extract the cluster Repeat until we do not reject the null.
Computational Complexity
The upper bound of the computational complexity of our algorithm is O(kpn) It is much less computational intensive than Kmodes and AutoClass since it does not demand convergence.
CD Algorithm
It is based on hamming distance. It does not require the input of parameters. It has no convergence criterion.
Ref: Zhang, Wang and Song (2005). JASA. To appear.
Challenge 3: Scalability
In practice, an approximate answer is so much better than no answer at all. Complexity O(n). Scalability O(mn) How many variables that we are dealing with?
Challenge 1:
What to do about the ordering? To propose a reasonable distance function for ordinal data might require a careful examination of the dependence structure. We need to look into different measure of association for categorical data.
Challenge 2:
A nave measure of similarity would be the distance between two clusters. Entropy might be a good one to try even thought it is not a distance function.
Challenge 3:
There are many hierarchical clustering algorithms available. Any clustering algorithm could be integrated into those algorithms if the distance function and similarity measure could be defined appropriately.
More Challenges
Measure of uncertainty Hard clustering vs. soft clustering Parallel computing.
Thank you!