You are on page 1of 23

TM HIU GOM CM D LIU V H GII THUT K-MEAN

GOM CM D LIU
Gom cm d liu l mt tc v trong khai ph d liu. Gom cm d liu gip ta c th h thng li d liu lm cho chng khng b ri rc. Vi mt c s d liu ln v ri rc th vic gom cm rt cn thit v hu nh l khng th thiu.

MC CH CA GOM CM
Mc ch ca gom cm d liu l nhm khm ph ra cu trc d liu thnh lp cc tp d liu t cc nhm d liu ln

YU CU CA GOM CM D LIU
Gom cm d liu l lm cho cc d liu trong cm th tng t nhau. Cn cc phn t khc cm th khng tng t nhau. tng t gia cc cm d liu do ngi dng nh ngha. c xc nh da trn cc i tng thuc tnh m t i tng. Thng ta o khon cch gia cc i tng.

YU CU CA GOM CM D LIU
Kh nng co gin v tp d liu. Kh nng x l nhiu thuc tnh khc nhau. Kh nng khm ph cc cm vi hnh dng ty . Ti thiu ha yu cu v tri thc min trong vic xc nh thng s nhp. Kh nng x l d liu c nhiu.

YU CU CA GOM CM D LIU
Kh nng gom cm tng dn c lp vi d liu nhp Kh nng x l d liu a chiu Kh nng gom cm da trn rng buc Kh din v kh dng

PHN LOI CC PHNG PHP GOM CM


Phn hoch (partitioning): cc phn hoch c to ra v nh gi theo mt tiu ch no . Phn cp (hierarchical): phn r tp d liu/i tng c th t phn cp theo mt tiu ch no . Da trn mt (density-based): da trn connectivity and density functions.

Da trn li (grid-based): da trn a multiple-level granularity structure.


Da trn m hnh (model-based): mt m hnh gi thuyt c a ra cho mi cm; sau hiu chnh cc thng s m hnh ph hp vi cm d liu/i tng nht.

PHNG PHP NH GI GOM CM D LIU


nh gi ngoi (external validation)
nh gi kt qu gom cm da vo cu trc c ch nh trc cho tp d liu o : Rand statistic, Jaccard coefficient, Folkes and Mallows index

nh gi ni (internal validation)
nh gi kt qu gom cm theo s lng cc vector ca chnh tp d liu (ma trn gn proximity matrix) o : :Huberts statistic, Silhouette index, Dunns index,

nh gi tng i (relative validation)


nh gi kt qu gom cm bng vic so snh cc kt qu gom cm khc ng vi cc b tr thng s khc nhau

Tiu ch cho vic nh gi v chn kt qu gom cm ti u


- nn (compactness): cc i tng trong cm nn gn nhau. - phn tch (separation): cc cm nn xa nhau.

PHNG PHP NH GI GOM CM D LIU


nh gi theo Entropy (tr nh khi cht lng gom cm tt)

Entropy ( I ) i pi ( j

pij

nij nij ni log ) i ( j log ) pi pi n ni ni

pij

CC VN CN GII QUYT
Biu Din Kiu D Liu + Ta ch quan tm n nhng kiu m cn thit cho vic gom cm m thi + Ta nh ngha d(i,j) l khon cch gia 2 i tng i v j.
d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

vi k l mt im bt k khc i,j.

CC VN CN GII QUYT
i tng i,j c biu din bi vector x,y tng t (similarity) gia i v j dc tnh theo cng thc

x = (x1, , xp)

y = (y1, , yp)
s(x, y) = (x1*y1 + + xp*yp)/((x12 + + xp2)1/2*(y12+ + yp2)1/2)

CC VN CN GII QUYT
Interval-scaled variables/attributes + khon lch
s f 1 (| x1 f m f | | x2 f m f | ... | xnf m f |) n

+ khon cch
m f 1 (x1 f x2 f ... xnf ) n
xif m f zif sf
.

+ Z-score measurement

CC VN CN GII QUYT
Cc cng thc tnh o khon cch + o khong cch Minkowski

+ o khon cch Manhattan


d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j2 ip jp

+ o khon cch Euclidean


d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp

CC VN CN GII QUYT
Binary variables/attributes Obj j
1 0 b d sum a b cd p 1 0 a c

Obj i

sum a c b d

H s so trng n gin (nu i xng): H s so trng Jaccard (nu bt i xng):

d (i, j) d (i, j)

bc a bc d bc a bc

CC VN CN GII QUYT
Variables/attributes of mixed types ( ( p 1 ij f ) dij f ) d (i, j) f p ( f 1 ij f )
Nu xif hoc xjf b thiu (missing) th f (variable/attribute): binary (nominal)

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise


f : interval-scaled (Minkowski, Manhattan, Euclidean) f : ordinal or ratio-scaled

tnh ranks rif v zif

zif tr thnh interval-scaled

r 1 M 1
if f

CC VN CN GII QUYT

zif

r 1 M 1
if f

NGHA CA VIC PHN CM


Phn cm ta c th i su vo phn tch nghin cu tng cm d liu nhm khm ph v tm kim cc thng tin n nhm h tr cho vic ra quyt nh

CC GII THUT GOM CM D LIU


Trong gom cm d liu c nhiu gii thut , tiu biu l gii thut k-mean v gii thut gom cm phn cp nhm. Chng ta s tm hiu gii thut K-Mean trong gom cm d liu

GII THUT K-MEANS


INPUT: Mt CSDL gm n i tng v s cc cm k. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti thiu. Bc 1: Khi to Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp d liu (vic la chn ny c th l ngu nhin hoc theo kinh nghim). Bc 2: Tnh ton khong cch i vi mi i tng Xi (1 <i <n) , tnh ton khong cch t n ti mi trng tm mj vi j=1,..,k, sau tm trng tm gn nht i vi mi i tng. Bc 3: Cp nht li trng tm i vi mi j=1,..,k, cp nht trng tm cm mj bng cch xc nh trung bnh cng ca cc vector i tng d liu. Bc 4: iu kin dng Lp cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i.

GII THUT K-MEANS


phc tp d liu c tnh l O(n.k.d.t.T) Trong : n l s i tng d liu k l s cm d liu d l s chiu t l s vng lp T l thi gian tnh ton mt php tnh c s nh : cng , tr, nhn hoc chia.....

GII THUT K-MEANS


u im :K-Means phn tch phn cm n gin nn c th p dng vi tp d liu ln Nhc im: K-Means ch p dng vi d liu c thuc tnh s v khm ph ra cc cm c dng hnh cu, k-means cn rt nhy cm vi nhiu v cc phn t ngoi lai trong d liu. Ngoi ra cn ph thuc nhiu vo cc thng s u vo

GII THUT K-MEANS


Trong trng hp, cc trng tm khi to ban u m qu lch so vi cc trng tm cm t nhin th kt qu phn cm ca k-means l rt thp, ngha l cc cm d liu c khm ph rt lch so vi cc cm trong thc t. Trn thc t ngi ta cha c mt gii php ti u no chn cc tham s u vo, gii php thng c s dng nht l th nghim vi cc gi tr u vo k khc nhau ri sau chn gii php tt nht.

GII THUT K-MEANS


n nay, c rt nhiu thut ton k tha t tng ca thut ton k-means p dng trong khai ph d liu gii quyt tp d liu c kch thc rt ln ang c p dng rt hiu qu v ph bin nh thut ton k-medoid, PAM, CLARA, CLARANS, k- prototypes,

You might also like