K Means

K-means clustering
CS281B Winter02 Yan Wang and Lihua Lin 1

What are clustering algorithms?
What is clustering ?
Clustering of data is a method by which large sets of data is

grouped into clusters of smaller sets of similar data.
Example:
The balls of same color are clustered into a group as shown

below :
Thus, we see clustering means grouping of data or dividing a

large data set into smaller data sets of some similarity.
CS281B Wint Yan Wang and Lihua 2

What is a clustering algorithm ?
 A clustering algorithm attempts to find natural groups of
components (or data) based on some similarity.
 The clustering algorithm also finds the centroid of a group of

data sets.
 The centroid of a cluster is a point whose parameter values are

the mean of the parameter values of all the points in the
clusters.

What is the common metric for clustering techniques ?
 Generally, the distance between two points is taken as a

common metric to assess the similarity among the components
of a population. The most commonly used distance measure is
the Euclidean metric which defines the distance between two
points p= ( p1, p2, ....) and q = ( q1, q2, ....) as :
k
d (p  q )
i 1
i i
2

Uses of clustering algorithms
 Engineering sciences: pattern recognition, artificial intelligence,
cybernetics etc. Typical examples to which clustering has been
applied include handwritten characters, samples of speech,
fingerprints, and pictures.
 Life sciences (biology, botany, zoology, entomology, cytology,

microbiology): the objects of analysis are life forms such as
plants, animals, and insects.
 Information, policy and decision sciences: the various

applications of clustering analysis to documents include votes
on political issues, survey of markets, survey of products,
survey of sales programs, and R & D.

Types of clustering algorithms
 The various clustering
concepts available can be
grouped into two broad
categories :
 Hierarchial methods –
Minimal Spanning Tree
Method (Fig)
 Nonhierarchial methods –
K-means Algorithm

K-Means Clustering Algorithm
Definition:
This nonheirarchial method initially takes the number of

components of the population equal to the final required
number of clusters. In this step itself the final required number
of clusters is chosen such that the points are mutually farthest
apart. Next, it examines each component in the population
and assigns it to one of the clusters depending on the
minimum distance. The centroid's position is recalculated
everytime a component is added to the cluster and this
continues until all the components are grouped into the final
required number of clusters.

K-Means Clustering Algorithm

The Parameters and options for the k-means algorithm
• Initialization: Different init Methods

• Distance Measure:There are different distance measures that
can be used. (Manhattan distance & Euclidean distance).
• Termination: k-means should terminate when no more pixels
are changing classes.
• Quality: the quality of the results provided by k-means
classification
• Parallelism: There are several ways to parallelize the k-means
algorithm
• What to do with dead classes:A class is "dead" if no pixels
belong to it.
• Variants: one pass on-the-fly calculation of means
• Number of classes: Number of classes is usually given as an
input variable.
Comments on the K-means Methods
Strength of the K-means:
• Relatively efficient: O(tkn), where n is the number of objects,
k is the number of clusters, and t is number of iterations.
Normally, k,t << n.
• Often terminates at a local optimum.
Weakness of the k-means:
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance.
• Unable tom handle noisy data and outlines.
•Not suitable to discover clusters with non-convex shapes.
Direct k-means clustering algorithm

Demo (I)
2 Initial
Clusters

Demo (I)
2-means
Clustering

Demo (II) – Init Method: Random

Demo (II) – Init Method: Linear

Demo (II) – Init Method: Cube

Demo (II) – Init Method: Statistics

Demo (II) – Init Method: Possibility

K Means

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means

Uploaded by

Copyright:

Available Formats

K-means clustering

CS281B Winter02 Yan Wang and Lihua Lin 1

Clustering of data is a method by which large sets of data is

The balls of same color are clustered into a group as shown

Thus, we see clustering means grouping of data or dividing a

CS281B Wint Yan Wang and Lihua 2

 The clustering algorithm also finds the centroid of a group of

 The centroid of a cluster is a point whose parameter values are

CS281B Wint Yan Wang and Lihua 3

 Generally, the distance between two points is taken as a

CS281B Wint Yan Wang and Lihua 4

 Life sciences (biology, botany, zoology, entomology, cytology,

 Information, policy and decision sciences: the various

CS281B Wint Yan Wang and Lihua 5

CS281B Wint Yan Wang and Lihua 6

This nonheirarchial method initially takes the number of

CS281B Wint Yan Wang and Lihua 7

CS281B Wint Yan Wang and Lihua 8

• Initialization: Different init Methods

CS281B Wint Yan Wang and Lihua 11

CS281B Wint Yan Wang and Lihua 12

CS281B Wint Yan Wang and Lihua 13

CS281B Wint Yan Wang and Lihua 14

CS281B Wint Yan Wang and Lihua 15

CS281B Wint Yan Wang and Lihua 16

CS281B Wint Yan Wang and Lihua 17

CS281B Wint Yan Wang and Lihua 18

You might also like