You are on page 1of 119

B GIO DC V O TO TRNG I HC BCH KHOA H NI ------------------------------------

LUN VN THC S KHOA HC NGNH: CNG NGH THNG TIN

NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

V LAN PHNG

H NI 2006

-1MC LC M U............................................................................................................... 3 MT S T VIT TT V THUT NG THNG DNG ........................ 5 DANH MC BNG............................................................................................. 6 DANH MC HNH .............................................................................................. 7 CHNG 1: TNG QUAN PHT HIN TRI THC V KHAI PH D LIU...................................................................................................................... 8 1.1 Gii thiu chung .......................................................................................... 8 1.2 Cc k thut khai ph d liu .................................................................... 10 1.3 Li th ca khai ph d liu so vi cc phng php khc ...................... 13 1.4 Cc ng dng ca KDD v nhng thch thc i vi KDD .................... 15 1.5 Kt lun...................................................................................................... 17 CHNG 2: K THUT PHN LOI TRONG KHAI PH D LIU ....... 18 2.1 Phn loi l g? .......................................................................................... 18 2.2 Cc vn quan tm ca phn loi ........................................................... 20 2.3 Phn loi bng cy quyt nh quy np..................................................... 22 2.4 Phn loi Bayesian .................................................................................... 30 2.5 Phn loi bng lan truyn ngc ............................................................... 37 2.6 Phn loi da trn s kt hp .................................................................... 48 2.7 Cc phng php phn loi khc .............................................................. 50 2.8 chnh xc classifier .............................................................................. 56 2.9 Kt lun...................................................................................................... 59 CHNG 3: K THUT PHN CM TRONG KHAI PH D LIU........ 60 3.1 Phn cm l g ........................................................................................... 60 3.2 Cc kiu d liu trong php phn cm...................................................... 64 3.3 Phn loi cc phng php phn cm chnh ............................................. 74 3.4 Cc phng php phn chia ...................................................................... 77 3.5 Cc phng php phn cp ....................................................................... 84 3.6 Cc phng php phn cm da trn mt ............................................ 94 3.7 Cc phng php phn cm da trn li .............................................. 101 3.8 Kt lun.................................................................................................... 107 CHNG 4: CI T TH NGHIM.......................................................... 108 4.1 Thit k tng th...................................................................................... 108 4.2 Chun b d liu ...................................................................................... 108 4.3 Thit k chng trnh .............................................................................. 109 4.4 Kt qu thc nghim v nh gi............................................................ 110 4.5 Kt lun.................................................................................................... 114 KT LUN ....................................................................................................... 116 TI LIU THAM KHO................................................................................. 118

-2LI CM N Trc tin em xin chn thnh cm n thy gio PGS.TS Nguyn Ngc Bnh tn tnh hng dn, ch bo em trong thi gian qua. Em xin by t lng bit n ti cc thy c gio trong khoa Cng ngh Thng tin ni ring v trng i hc Bch Khoa H Ni ni chung dy bo, cung cp nhng kin thc qu bu cho em trong sut qu trnh hc tp v nghin cu ti trng. Em cng xin gi li cm n ti gia nh, bn b, nhng ngi lun c v, quan tm v gip em trong sut thi gian hc tp cng nh lm lun vn. Do thi gian v kin thc c hn nn lun vn chc khng trnh khi nhng thiu st nht nh. Em rt mong nhn c nhng s gp qu bu ca thy c v cc bn. H Ni, 11-2006 V Lan Phng

-3M U Gii thiu S pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin trong nhiu lnh vc ca i sng, kinh t x hi trong nhiu nm qua cng ng ngha vi lng d liu c cc c quan thu thp v lu tr ngy mt tch lu nhiu ln. H lu tr cc d liu ny v cho rng trong n n cha nhng gi tr nht nh no . Tuy nhin, theo thng k th ch c mt lng nh ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s phi lm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s rng s c ci g quan trng b b qua sau ny c lc cn n n. Mt khc, trong mi trng cnh tranh, ngi ta ngy cng cn c nhiu thng tin vi tc nhanh tr gip vic ra quyt nh v ngy cng c nhiu cu hi mang tnh cht nh tnh cn phi tr li da trn mt khi lng d liu khng l c. Vi nhng l do nh vy, cc phng php qun tr v khai thc c s d liu truyn thng ngy cng khng p ng c thc t lm pht trin mt khuynh hng k thut mi l K thut pht hin tri thc v khai ph d liu (KDD - Knowledge Discovery and Data Mining). K thut pht hin tri thc v khai ph d liu v ang c nghin cu, ng dng trong nhiu lnh vc khc nhau cc nc trn th gii, ti Vit Nam k thut ny tng i cn mi m tuy nhin cng ang c nghin cu v dn a vo ng dng. Bc quan trng nht ca qu trnh ny l Khai ph d liu (Data Mining - DM), gip ngi s dng thu c nhng tri thc hu ch t nhng CSDL hoc cc ngun d liu khng l khc. Rt nhiu doanh nghip v t chc trn th gii ng dng k thut khai ph d liu vo hot ng sn xut kinh doanh ca mnh v thu c nhng li ch to ln. Nhng lm c iu , s pht trin ca cc m hnh ton hc v cc gii thut hiu qu l cha kho quan trng. V vy, trong lun vn ny, tc gi s cp ti hai k

-4thut thng dng trong Khai ph d liu, l Phn loi (Classification) v Phn cm (Clustering hay Cluster Analyse). B cc lun vn Ngoi cc phn M u, Mc lc, Danh mc hnh, Danh mc bng, Kt lun, Ti liu tham kho, lun vn c chia lm 4 phn: Phn I: Tng quan v Pht hin tri thc v Khai ph d liu Phn ny gii thiu mt cch tng qut v qu trnh pht hin tri thc ni chung v khai ph d liu ni ring. c bit nhn mnh v hai k thut chnh c nghin cu trong lun vn l K thut phn loi v K thut phn cm. Phn II: K thut phn loi (Classification) Trong phn ny, k thut phn loi c gii thiu mt cch chi tit. C nhiu kiu phn loi nh phn loi bng cy quyt nh quy np, phn loi Bayesian, phn loi bng mng lan truyn ngc, phn loi da trn s kt hp v cc phng php phn loi khc. Ngoi ra cn nh gi chnh xc ca phn loi thng qua cc classifier - ngi phn loi. Phn III: K thut phn cm (Clustering) K thut phn cm cng c chia lm nhiu kiu: phn cm phn chia, phn cm phn cp, phn cm da trn mt v phn cm da trn li. Phn IV: Ci t th nghim Phn ny trnh by mt s kt qu t c khi tin hnh p dng cc gii thut khai ph d liu khai thc thng tin d liu mu.

-5MT S T VIT TT V THUT NG THNG DNG KDD DM Classification Clustering CSDL Pht hin tri thc Khai ph d liu Phn loi Phn cm C s d liu

-6-

DANH MC BNG
Bng 2.1: Cc b d liu hun luyn t c s d liu khch hng AllElectronics .......25 Bng 2.2: D liu mu cho lp mua my tnh...............................................................30 Bng 2.3: Cc gi tr u vo, trng s v bias khi u ..............................................45 Bng 2.4: Cc tnh ton mng u vo v u ra ..........................................................45 Bng 2.5: Tnh ton sai s ti mi nt...........................................................................45 Bng 2.6: Tnh ton vic cp nht trng s v bias.......................................................45 Bng 3.1: Bng ngu nhin cho cc bin nh phn .......................................................69 Bng 3.2: Bng quan h cha hu ht cc thuc tnh nh phn.....................................70 Bng 4.1: Mt v d tp nh dng d liu *.names....................................................109 Bng 4.2: Mt v d tp d liu *.data ........................................................................109 Bng 4.3: Kt qu th nghim phn lp.......................................................................111 Bng 4.4: Kt qu ci thin cht lng phn lp ........................................................112 Bng 4.5: Kt qu th nghim phn loi ca Kmeans v Kmedoids ...........................113 Bng 4.6: Kt qu th nghim phn loi ca Kmedoids v See5 ................................113

-7DANH MC HNH
Hnh 1.1: Qu trnh pht hin tri thc .............................................................................9 Hnh 1.2: Tp d liu vi 2 lp: c v khng c kh nng tr n.................................11 Hnh 1.3: Phn loi c hc bng mng nron cho tp d liu cho vay ....................12 Hnh 1.4: Phn cm tp d liu cho vay vo trong 3 cm ............................................13 Hnh 2.1: X l phn loi d liu..................................................................................19 Hnh 2.2: Cy quyt nh cho khi nim mua my tnh ................................................22 Hnh 2.3: Gii thut ID3 cho cy quyt nh ................................................................23 Hnh 2.4: Thuc tnh tui c thng tin thu c cao nht ............................................26 Hnh 2.5: Cc cu trc d liu danh sch thuc tnh v danh sch lp c dng trong SLIQ cho d liu mu trong bng 2.2 ...................................................................30 Hnh 2.6: a) Mng belief Bayesian n gin, b) Bng xc sut c iu kin cho cc gi tr ca bin LungCancer (LC)................................................................................35 Hnh 2.7: Mt mng nron truyn thng a mc ..........................................................38 Hnh 2.8: Gii thut lan truyn ngc...........................................................................41 Hnh 2.9: Mt unit lp n hay lp u ra ......................................................................42 Hnh 2.10: V d mt mng nron truyn thng a mc ..............................................45 Hnh 2.11: Cc lut c th c trch ra t cc mng nron hun luyn......................48 Hnh 2.12: Mt xp x tp th ca tp cc mu thuc lp C.........................................54 Hnh 2.13: Cc gi tr m i vi thu nhp...................................................................55 Hnh 2.14: nh gi chnh xc classifier vi phng php holdout........................56 Hnh 2.15: Tng chnh xc classifier........................................................................58 Hnh 3.1: Gii thut k-means.........................................................................................79 Hnh 3.2: Phn cm mt tp cc im da trn phng php k-means ........................79 Hnh 3.3: Gii thut k-medoids......................................................................................82 Hnh 3.4: Phn cm mt tp cc im da trn phng php k-medoids.....................82 Hnh 3.5: Phn cm mt tp cc im da trn phng php "Tch ng lng" .........86 Hnh 3.6: Phn cm mt tp cc im bng CURE ......................................................91 Hnh 3.7: CHAMELEON: Phn cm phn cp da trn k-lng ging gn v m hnh ho ng ................................................................................................................93 Hnh 3.8: Mt tin v mt lin kt trong phn cm da trn mt ..................95 Hnh 3.9: Sp xp cm trong OPTICS ..........................................................................98 Hnh 3.10: Hm mt v attractor mt ..................................................................99 Hnh 3.11: Cc cm c nh ngha trung tm v cc cm c hnh dng tu .......100 Hnh 3.12: Mt cu trc phn cp i vi phn cm STING .....................................101 Hnh 3.13: Gii thut phn cm da trn wavelet.......................................................105 Hnh 3.14: Mt mu khng gian c trng 2 chiu.....................................................105 Hnh 3.15: a phn gii ca khng gian c trng trong hnh 3.14. a) t l 1; b) t l 2; c) t l 3 ...............................................................................................................106 Hnh 4.1: Thit k chng trnh ..................................................................................110 Hnh 4.2: Biu so snh Kmeans v Kmedoids trong bi ton phn lp vi K=10 111 Hnh 4.3: Biu so snh Kmeans v Kmedoids trong bi ton phn loi ................113 Hnh 4.4: Biu so snh Kmedoids v See5 trong bi ton phn loi .....................114

-8CHNG 1: TNG QUAN PHT HIN TRI THC V KHAI PH D LIU 1.1 Gii thiu chung Trong nhng nm gn y, s pht trin mnh m ca CNTT v ngnh cng nghip phn cng lm cho kh nng thu thp v lu tr thng tin ca cc h thng thng tin tng nhanh mt cch chng mt. Bn cnh vic tin hc ho mt cch t v nhanh chng cc hot ng sn xut, kinh doanh cng nh nhiu lnh vc hot ng khc to ra cho chng ta mt lng d liu lu tr khng l. Hng triu CSDL c s dng trong cc hot ng sn xut, kinh doanh, qun l..., trong c nhiu CSDL cc ln c Gigabyte, thm ch l Terabyte. S bng n ny dn ti mt yu cu cp thit l cn c nhng k thut v cng c mi t ng chuyn i lng d liu khng l kia thnh cc tri thc c ch. T , cc k thut khai ph d liu tr thnh mt lnh vc thi s ca nn CNTT th gii hin nay. 1.1.1 Khi nim khai ph d liu Khai ph d liu (Data Mining) l mt khi nim ra i vo nhng nm cui ca thp k 1980. N l qu trnh trch xut cc thng tin c gi tr tim n bn trong lng ln d liu c lu tr trong cc CSDL, kho d liu... Hin nay, ngoi thut ng khai ph d liu, ngi ta cn dng mt s thut ng khc c ngha tng t nh: khai ph tri thc t CSDL, trch lc d liu, phn tch d liu/mu, kho c d liu, no vt d liu. Nhiu ngi coi Khai ph d liu v mt thut ng thng dng khc l Pht hin tri thc trong CSDL (Knowlegde Discovery in Databases - KDD) l nh nhau. Tuy nhin trn thc t, khai ph d liu ch l mt bc thit yu trong qu trnh Pht hin tri thc trong CSDL. C th ni Data Mining l giai on quan trng nht trong tin trnh Pht hin tri thc t c s d liu, cc tri thc ny h tr trong vic ra quyt nh trong khoa hc v kinh doanh. 1.1.2 Cc bc ca qu trnh pht hin tri thc Qu trnh pht hin tri thc tin hnh qua 6 giai on nh hnh 1.1:

-9nh gi lut Khai ph d liu Chuyn i d liu Lm sch, tin x l d liu Trch lc d liu Gom d liu D liu lm sch, tin x l M hnh D liu chuyn i

Tri thc

Internet, ...

D liu ch D liu

Hnh 1.1: Qu trnh pht hin tri thc Bt u ca qu trnh l kho d liu th v kt thc vi tri thc c chit xut ra. V l thuyt th c v rt n gin nhng thc s y l mt qu trnh rt kh khn gp phi rt nhiu vng mc nh: qun l cc tp d liu, phi lp i lp li ton b qu trnh, v.v... (1) Gom d liu: Tp hp d liu l bc u tin trong qu trnh khai ph d liu. y l bc c khai thc trong mt c s d liu, mt kho d liu v thm ch cc d liu t cc ngun ng dng Web. (2) Trch lc d liu: giai on ny d liu c la chn hoc phn chia theo mt s tiu chun no phc v mc ch khai thc, v d chn tt c nhng ngi c tui i t 25 - 35 v c trnh i hc. (3) Lm sch, tin x l v chun b trc d liu: Giai on th ba ny l giai on hay b sao lng, nhng thc t n l mt bc rt quan trng trong qu trnh khai ph d liu. Mt s li thng mc phi trong khi gom d liu l tnh khng cht ch, logc. V vy, d liu thng cha cc gi tr v ngha v khng c kh nng kt ni d liu. V d: tui = 673. Giai on ny s tin hnh x l nhng dng d liu khng cht ch ni trn. Nhng d liu dng ny c xem nh thng tin d tha, khng c gi tr. Bi vy, y l mt qu trnh rt

-10quan trng v d liu ny nu khng c lm sch - tin x l - chun b trc th s gy nn nhng kt qu sai lch nghim trng. (4) Chuyn i d liu: Tip theo l giai on chuyn i d liu, d liu a ra c th s dng v iu khin c bi vic t chc li n, tc l d liu s c chuyn i v dng ph hp cho vic khai ph bng cch thc hin cc thao tc nhm hoc tp hp. (5) Khai ph d liu: y l bc mang tnh t duy trong khai ph d liu. giai on ny nhiu thut ton khc nhau c s dng trch ra cc mu t d liu. Thut ton thng dng l nguyn tc phn loi, nguyn tc kt, v.v... (6) nh gi cc lut v biu din tri thc: giai on ny, cc mu d liu c chit xut ra bi phn mm khai ph d liu. Khng phi bt c mu d liu no cng u hu ch, i khi n cn b sai lch. V vy, cn phi u tin nhng tiu chun nh gi chit xut ra cc tri thc (Knowlege) cn chit xut ra. nh gi s hu ch ca cc mu biu din tri thc da trn mt s php o. Sau s dng cc k thut trnh din v trc quan ho d liu biu din tri thc khai ph c cho ngi s dng. Trn y l 6 giai on ca qu trnh pht hin tri thc, trong giai on 5 - khai ph d liu (hay cn gi l Data Mining) l giai on c quan tm nhiu nht. 1.2 Cc k thut khai ph d liu Hnh 1.2 biu din mt tp d liu gi hai chiu bao gm 23 case (trng hp). Mi mt im trn hnh i din cho mt ngi vay tin ngn hng ti mt s thi im trong qu kh. D liu c phn loi vo hai lp: nhng ngi khng c kh nng tr n v nhng ngi tnh trng vay n ang trng thi tt (tc l ti thi im c kh nng tr n ngn hng). Hai mc ch chnh ca khai ph d liu trong thc t l d on v m t.

-11N
Khng c kh nng tr n

C kh nng tr n

Thu nhp

Hnh 1.2: Tp d liu vi 2 lp: c v khng c kh nng tr n 1.2.1 Khai ph d liu d on Nhim v ca khai ph d liu d on l a ra cc d on da vo cc suy din trn d liu hin thi. N s dng cc bin hay cc trng trong c s d liu d on cc gi tr khng bit hay cc gi tr tng lai. Bao gm cc k thut: phn loi (classification), hi quy (regression)... 1.2.1.1 Phn loi Mc tiu ca phng php phn loi d liu l d on nhn lp cho cc mu d liu. Qu trnh phn loi d liu thng gm 2 bc: xy dng m hnh v s dng m hnh phn loi d liu. Bc 1: Xy dng m hnh da trn vic phn tch cc mu d liu cho trc. Mi mu thuc v mt lp, c xc nh bi mt thuc tnh gi l thuc tnh lp. Cc mu d liu ny cn c gi l tp d liu hun luyn. Cc nhn lp ca tp d liu hun luyn u phi c xc nh trc khi xy dng m hnh, v vy phng php ny cn c gi l hc c gim st. Bc 2: S dng m hnh phn loi d liu. Trc ht chng ta phi tnh chnh xc ca m hnh. Nu chnh xc l chp nhn c, m hnh s c s dng d on nhn lp cho cc mu d liu khc trong tng lai. Hay ni cch khc, phn loi l hc mt hm nh x mt mc d liu vo mt trong s cc lp cho trc. Hnh 1.3 cho thy s phn loi ca cc d liu vay n vo trong hai min lp. Ngn hng c th s dng cc min phn loi t ng quyt nh liu nhng ngi vay n trong tng lai c nn cho vay hay khng.

-12N

Thu nhp

Hnh 1.3: Phn loi c hc bng mng nron cho tp d liu cho vay 1.2.1.2 Hi quy Phng php hi qui khc vi phn loi d liu ch, hi qui dng d on v cc gi tr lin tc cn phn loi d liu th ch dng d on v cc gi tr ri rc. Hi quy l hc mt hm nh x mt mc d liu vo mt bin d bo gi tr thc. Cc ng dng hi quy c nhiu, v d nh nh gi xc xut mt bnh nhn s cht da trn tp kt qu xt nghim chn on, d bo nhu cu ca ngi tiu dng i vi mt sn phn mi da trn hot ng qung co tiu dng. 1.2.2 Khai ph d liu m t K thut ny c nhim v m t v cc tnh cht hoc cc c tnh chung ca d liu trong CSDL hin c. Bao gm cc k thut: phn cm (clustering), phn tch lut kt hp (association rules)... 1.2.2.1 Phn cm Mc tiu chnh ca phng php phn cm d liu l nhm cc i tng tng t nhau trong tp d liu vo cc cm sao cho cc i tng thuc cng mt cm l tng ng cn cc i tng thuc cc cm khc nhau s khng tng ng. Phn cm d liu l mt v d ca phng php hc khng gim st. Khng ging nh phn loi d liu, phn cm d liu khng i hi phi nh ngha trc cc mu d liu hun luyn. V th, c th coi phn cm d liu l mt cch hc bng quan st (learning by observation), trong khi phn loi d liu l hc bng v d (learning by example). Trong phng php ny bn s

-13khng th bit kt qu cc cm thu c s nh th no khi bt u qu trnh. V vy, thng thng cn c mt chuyn gia v lnh vc nh gi cc cm thu c. Phn cm d liu c s dng nhiu trong cc ng dng v phn on th trng, phn on khch hng, nhn dng mu, phn loi trang Web Ngoi ra phn cm d liu cn c th c s dng nh mt bc tin x l cho cc thut ton khai ph d liu khc. Hnh 1.4 cho thy s phn cm tp d liu cho vay vo trong 3 cm: lu rng cc cm chng ln nhau cho php cc im d liu thuc v nhiu hn mt cm.
N Cm 1 Cm 3

Cm 2 Thu nhp

Hnh 1.4: Phn cm tp d liu cho vay vo trong 3 cm 1.2.2.2 Lut kt hp Mc tiu ca phng php ny l pht hin v a ra cc mi lin h gia cc gi tr d liu trong CSDL. Mu u ra ca gii thut khai ph d liu l tp lut kt hp tm c. Khai ph lut kt hp c thc hin qua 2 bc: Bc 1: tm tt c cc tp mc ph bin, mt tp mc ph bin c xc nh qua tnh h tr v tha mn h tr cc tiu. Bc 2: sinh ra cc lut kt hp mnh t tp mc ph bin, cc lut phi tha mn h tr cc tiu v tin cy cc tiu. Phng php ny c s dng rt hiu qu trong cc lnh vc nh marketing c ch ch, phn tch quyt nh, qun l kinh doanh, 1.3 Li th ca khai ph d liu so vi cc phng php khc

-14Khai ph d liu l mt lnh vc lin quan ti rt nhiu ngnh hc khc nh: h CSDL, thng k,... Hn na, tu vo cch tip cn c s dng, khai ph d liu cn c th p dng mt s k thut nh mng n ron, l thuyt tp th hoc tp m, biu din tri thc Nh vy, khai ph d liu thc ra l da trn cc phng php c bn bit. Tuy nhin, s khc bit ca khai ph d liu so vi cc phng php l g? Ti sao khai ph d liu li c u th hn hn cc phng php c? Ta s ln lt xem xt v gii quyt cc cu hi ny. 1.3.1 Hc my (Machine Learning) So vi phng php hc my, khai ph d liu c li th hn ch, khai ph d liu c th s dng vi cc c s d liu thng ng, khng y , b nhiu v ln hn nhiu so vi cc tp d liu hc my in hnh. Trong khi phng php hc my ch yu c p dng trong cc CSDL y , t bin ng v tp d liu khng qu ln. Tht vy, trong hc my, thut ng c s d liu ch yu cp ti mt tp cc mu c lu trong tp. Cc mu thng l cc vect vi di c nh, thng tin v c im, dy cc gi tr ca chng i khi cng c lu li nh trong t in d liu. Mt gii thut hc s dng tp d liu v cc thng tin km theo tp d liu lm u vo v u ra biu th kt qu ca vic hc. Hc my c kh nng p dng cho c s d liu, lc ny, hc my s khng phi l hc trn tp cc mu na m hc trn tp cc bn ghi ca c s d liu. Tuy nhin, trong thc t, c s d liu thng ng, khng y v b nhiu, ln hn nhiu so vi cc tp d liu hc my in hnh. Cc yu t ny lm cho hu ht cc gii thut hc my tr nn khng hiu qu. Khai ph d liu lc ny s x l cc vn vn in hnh trong hc my v vt qu kh nng ca hc my, l s dng c cc CSDL cha nhiu nhiu, d liu khng y hoc bin i lin tc. 1.3.2 H chuyn gia (Expert Systems) Cc h chuyn gia nm bt cc tri thc cn thit cho mt bi ton no . Cc k thut thu thp gip cho vic ly tri thc t cc chuyn gia con ngi.

-15Mi phng php h chuyn gia l mt cch suy din cc lut t cc v d v gii php i vi bi ton chuyn gia a ra. Phng php h chuyn gia khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cht lng cao hn nhiu so vi cc d liu trong CSDL, v chng thng ch bao hm c cc trng quan trng. Hn na cc chuyn gia s xc nhn gi tr v tnh hu ch ca cc mu pht hin c. 1.3.3 Thng k (Statistics) Mc d cc phng php thng k cung cp mt nn tng l thuyt vng chc cho cc bi ton phn tch d liu nhng ch c tip cn thng k thun tu thi cha bi: Cc phng php thng k khng ph hp vi cc kiu d liu c cu trc trong rt nhiu cc c s d liu Thng k hon ton tnh ton trn d liu, n khng s dng tri thc sn c v lnh vc quan tm Cc kt qu ca phn tch thng k c th rt nhiu v kh c th lm r c Cc phng php thng k cn c s hng dn ca ngi dng xc nh phn tch d liu nh th no v u. Phng php thng k l mt trong nhng nn tng l thuyt ca khai ph d liu. S khc nhau c bn gia khai ph d liu v thng k ch khai ph d liu l mt phng tin c dng bi ngi s dng u cui ch khng phi l cc nh thng k. Khai ph d liu khc phc c cc yu im trn ca thng k, t ng qu trnh thng k mt cch hiu qu v th gim bt cng vic ca ngi dng u cui, to ra mt cng c d s dng hn. 1.4 Cc ng dng ca KDD v nhng thch thc i vi KDD 1.4.1 Cc ng dng ca KDD Cc k thut KDD c th c p dng vo trong nhiu lnh vc: Thng tin thng mi: Phn tch d liu tip th v bn hng, phn tch vn u t, chp thun cho vay, pht hin gian ln, ...

-16 Thng tin sn xut: iu khin v lp lch, qun l mng, phn tch kt qu th nghim, ... Thng tin khoa hc: a l: Pht hin ng t,... ... 1.4.2 Nhng thch thc i vi KDD Cc c s d liu ln hn rt nhiu: c s d liu vi hng trm trng v bng, hng triu bn ghi v kch thc ln ti nhiu gigabyte l vn hon ton bnh thng v c s d liu terabyte (1012 bytes) cng bt u xut hin.
S chiu cao: Khng ch thng c mt s lng rt ln cc bn ghi

trong c s d liu m cn c mt s lng rt ln cc trng (cc thuc tnh, cc bin) lm cho s chiu ca bi ton tr nn cao. Thm vo , n tng thm c hi cho mt gii thut khai ph d liu tm ra cc mu khng hp l. Vy nn cn gim bt hiu qu kch thc ca bi ton v tnh hu ch ca tri thc cho trc nhn bit cc bin khng hp l.
Over-fitting (qu ph hp): Khi gii thut tm kim cc tham s tt nht

cho mt m hnh c bit s dng mt tp hu hn d liu, kt qu l m hnh biu din ngho nn trn d liu kim nh. Cc gii php c th bao gm hp l cho, lm theo quy tc v cc chin lc thng k tinh vi khc.
Thay i d liu v tri thc: Thay i nhanh chng d liu (ng) c th

lm cho cc mu c pht hin trc khng cn hp l. Thm vo , cc bin o trong mt c s d liu ng dng cho trc c th b sa i, xo b hay tng thm cc php o mi. Cc gii php hp l bao gm cc phng php tng trng cp nht cc mu v x l thay i.
D liu thiu v b nhiu: Bi ton ny c bit nhy trong cc c s d

liu thng mi. D liu iu tra dn s U.S cho thy t l li ln ti 20%. Cc thuc tnh quan trng c th b mt nu c s d liu khng c thit k vi s khm ph bng tr tu. Cc gii php c th gm nhiu chin lc thng k phc tp nhn bit cc bin n v cc bin ph thuc.

-17 Mi quan h phc tp gia cc trng: Cc thuc tnh hay cc gi tr c

cu trc phn cp, cc quan h gia cc thuc tnh v cc phng tin tinh vi hn cho vic biu din tri thc v ni dung ca mt c s d liu s i hi cc gii thut phi c kh nng s dng hiu qu cc thng tin ny. V mt lch s, cc gii thut khai ph d liu c pht trin cho cc bn ghi c gi tr thuc tnh n gin, mc du cc k thut mi bt ngun t mi quan h gia cc bin ang c pht trin.
Tnh d hiu ca cc mu: Trong nhiu ng dng, iu quan trng l

nhng g khai thc c phi cng d hiu i vi con ngi th cng tt. Cc gii php c th thc hin c bao gm c vic biu din c minh ho bng th, cu trc lut vi cc th c hng, biu din bng ngn ng t nhin v cc k thut hnh dung ra d liu v tri thc.
Ngi dng tng tc v tri thc sn c: Nhiu phng php KDD hin

hnh v cc cng c khng tng tc thc s vi ngi dng v khng th d dng kt hp cht ch vi tri thc c sn v mt bi ton loi tr theo cc cch n gin. Vic s dng ca min tri thc l quan trng trong ton b cc bc ca x l KDD.
Tch hp vi cc h thng khc: Mt h thng pht hin ng mt mnh

c th khng hu ch lm. Cc vn tch hp in hnh gm c vic tch hp vi mt DBMS (tc l qua mt giao din truy vn), tch hp vi cc bng tnh v cc cng c trc quan v iu tit cc d on cm bin thi gian thc. 1.5 Kt lun Khai ph d liu l lnh vc v ang tr thnh mt trong nhng hng nghin cu thu ht c s quan tm ca nhiu chuyn gia v CNTT trn th gii. Trong nhng nm gn y, rt nhiu cc phng php v thut ton mi lin tc c cng b. iu ny chng t nhng u th, li ch v kh nng ng dng thc t to ln ca khai ph d liu. Phn ny trnh by mt s kin thc tng quan v khai ph d liu, nhng kin thc c bn nht v cc phng php phn cm d liu, phn loi d liu v khai ph lut kt hp.

-18CHNG 2: K THUT PHN LOI TRONG KHAI PH D LIU Cc c s d liu vi rt nhiu thng tin n c th c s dng to nn cc quyt nh kinh doanh thng minh. Phn loi l mt dng ca phn tch d liu, n dng trch ra cc m hnh m t cc lp d liu quan trng hay d on cc khuynh hng d liu tng lai. Phn loi dng d on cc nhn xc thc (hay cc gi tr ri rc). Nhiu phng php phn loi c xut bi cc nh nghin cu cc lnh vc nh hc my, h chuyn gia, thng k... Hu ht cc gii thut dng vi gi thit kch thc d liu nh. Cc nghin cu khai ph c s d liu gn y pht trin, xy dng m rng cc k thut phn loi c kh nng s dng d liu thng tr trn a ln. Cc k thut ny thng c xem xt x l song song v phn tn. Trong chng ny, ta s xem xt cc k thut c bn phn loi d liu nh cy quyt nh quy np, phn loi Bayesian, cc mng belief Bayesian, cc mng nron v phn loi da trn s kt hp. Cc tip cn khc ca phn loi nh cc k thut classifier k-lng ging gn nht, lp lun da trn tnh hung, gii thut di truyn, tp th v logic m cng c cp. 2.1 Phn loi l g? Phn loi d liu l mt x l bao gm hai bc (Hnh 2.1). bc u tin, xy dng m hnh m t mt tp cho trc cc lp d liu. M hnh ny c c bng cch phn tch cc b c s d liu. Mi b c gi nh thuc v mt lp cho trc, cc lp ny chnh l cc gi tr ca mt thuc tnh c ch nh, gi l thuc tnh nhn lp. Cc b d liu xy dng m hnh gi l tp d liu hun luyn. Do nhn lp ca mi mu hun luyn bit trc nn bc ny cng c bit n nh l hc c gim st. iu ny tri ngc vi hc khng c gim st, trong cc mu hun luyn cha bit s thuc v nhn lp no v s lng hay tp cc lp c hc cha bit trc. M hnh hc c biu din di dng cc lut phn loi, cy quyt nh hay cng thc ton hc. V d, cho trc mt c s d liu thng tin v tn nhim ca khch hng, cc lut phn loi c hc nhn bit cc khch hng

-19c tn nhim l tt hay kh tt (Hnh 2.1a). Cc lut c dng phn loi cc mu d liu tng lai cng nh cung cp cch hiu tt hn v ni dung c s d liu. a) D liu hun luyn Thu nhp tn nhim Thp Kh tt Thp Tt Cao Tt Trung bnh Kh tt Trung bnh Kh tt Cao Tt ... ... Cc lut phn loi D liu kim nh Tui Thu nhp tn nhim Tn Frank >40 Cao Kh tt Sylvia <30 Thp Kh tt Anne 30-40 Cao Tt ... ... ... ... Hnh 2.1: X l phn loi d liu D liu mi (John, 30-40,Cao) tn nhim? Gii thut phn loi

Tui Tn Sandy <30 Bill <30 Courtney 30-40 Susan >40 Claire >40 Andre 30-40 ... ... b)

Cc lut phn loi IF Tui 30-40 AND Thu nhp = Cao THEN tn nhim = Tt

Tt

Trong bc th hai (hnh 2.1b), m hnh c dng phn loi. Trc tin, nh gi chnh xc d on ca m hnh (hay classifier). Phn 2.8 ca chng ny m t mt s phng php nh gi chnh xc classifier. Phng php holdout l mt k thut n gin s dng mt tp kim nh cc mu c gn nhn lp. Cc mu ny c chn la ngu nhin v c lp vi cc mu hun luyn. chnh xc ca m hnh trn mt tp kim nh cho trc l phn trm cc mu ca tp kim nh c m hnh phn loi ng. i vi mi mu kim nh, nhn lp bit c so snh vi d on lp ca m hnh hc cho mu . Nu chnh xc ca m hnh c nh gi da trn tp d

-20liu hun luyn, s nh gi ny c th l ti u, do vy m hnh hc c khuynh hng qu ph hp (overfit) d liu. Bi vy, cn dng mt tp kim nh. Nu chnh xc ca m hnh l chp nhn c, m hnh c th c s dng phn loi cc b hay cc i tng d liu tng lai m cha bit nhn lp. V d, cc lut phn loi hc trong hnh 2.1a: vic phn tch d liu khch hng t cc khch hng tn ti c th c dng d on tn nhim ca cc khch hng mi. V d 2.1: Gi s rng ta c mt c s d liu cc khch hng trn danh sch th (mailing list) AllElectronics. Danh sch th c dng gi i cc ti liu qung co m t cc sn phm mi v yt ln cc sn phm h gi. C s d liu m t cc thuc tnh ca khch hng nh tn, tui, thu nhp, ngh nghip v tn nhim. Khch hng c phn loi vo nhm ngi mua hay khng mua my tnh ti AllElectronics. Gi s rng cc khch hng mi c thm vo c s d liu v bn s thng bo cho nhng khch hng ny thng tin bn my tnh. Thay v gi ti liu qung co ti tng khch hng mi, ta ch gi ti liu qung co ti nhng ngi c kh nng mun mua my tnh, nh vy chi ph s hiu qu hn. M hnh phn loi c xy dng v s dng cho mc ch ny. 2.2 Cc vn quan tm ca phn loi 2.2.1 Chun b d liu phn loi: Cc bc tin x l d liu sau y gip ci thin chnh xc, hiu sut v kh nng m rng ca phn loi. - Lm sch d liu: y l qu trnh thuc v tin x l d liu g b hoc lm gim nhiu v cch x l cc gi tr khuyt. Bc ny gip lm gim s mp m khi hc. - Phn tch s thch hp: Nhiu thuc tnh trong d liu c th khng thch hp hay khng cn thit phn loi. V vy, php phn tch s thch hp c thc hin trn d liu vi mc ch g b bt k nhng thuc tnh khng thch hp hay khng cn thit. Trong hc my, bc ny gi l trch chn c trng. Php phn tch ny gip phn loi hiu qu v nng cao kh nng m rng.

-21- Bin i d liu: D liu c th c tng qut ho ti cc mc khi nim cao hn. iu ny rt hu ch cho cc thuc tnh c gi tr lin tc. V d, cc gi tr s ca thuc tnh thu nhp c tng qut ho sang cc phm vi ri rc nh thp, trung bnh v cao. Tng t, cc thuc tnh gi tr tn nh ng ph c tng qut ho ti khi nim mc cao hn nh thnh ph. Nh cc thao tc vo/ra trong qu trnh hc s t i. D liu cng c th c tiu chun ho, c bit khi cc mng nron hay cc phng php dng php o khong cch trong bc hc. Tiu chun ho bin i theo t l tt c cc gi tr ca mt thuc tnh cho trc chng ri vo phm vi ch nh nh nh [-1.0,1.0] hay [0,1.0]. Tuy nhin iu ny s cn tr cc thuc tnh c phm vi ban u ln (nh thu nhp) c nhiu nh hng hn i vi cc thuc tnh c phm vi nh hn ban u (nh cc thuc tnh nh phn). 2.2.2 So snh cc phng php phn loi: Cc phng php phn loi c th c so snh v nh gi theo cc tiu ch sau: - chnh xc d on: Da trn kh nng m hnh d on ng nhn lp ca d liu mi. - Tc : Da trn cc chi ph tnh ton. Chi ph ny bao gm sinh v s dng m hnh. - S trng kin: Da trn kh nng m hnh a ra cc d on chnh xc d liu nhiu hay d liu vi cc gi tr khuyt cho trc. - Kh nng m rng: Da trn kh nng trnh din hiu qu ca m hnh i vi d liu ln. - Kh nng din dch: Da trn mc kh nng m m hnh cung cp hiu thu o d liu.

-222.3 Phn loi bng cy quyt nh quy np Tui? <30 Sinh vin? Khng Khng C C 30-40 C >40 tn nhim? Tt C Kh tt Khng

Hnh 2.2: Cy quyt nh cho khi nim mua my tnh "Cy quyt nh l g?" Cy quyt nh l cu trc cy c dng biu lung, mi nt trong l kim nh trn mt thuc tnh, mi nhnh i din cho mt kt qu kim nh, cc nt l i din cho cc lp. Nt cao nht trn cy l nt gc. Hnh 2.2 th hin cy quyt nh biu din khi nim mua my tnh, n d on liu mt khch hng ti AllElectronics c mua my tnh hay khng. Hnh ch nht biu th cc nt trong, hnh elip biu th cc nt l. phn loi mt mu cha bit, cc gi tr thuc tnh ca mu s c kim nh trn cy. ng i t gc ti mt nt l cho bit d on lp i vi mu . Cy quyt nh c th d dng chuyn i thnh cc lut phn loi. Mc 2.3.1 l gii thut hc c bn ca cy quyt nh. Khi cy quyt nh c xy dng, nhiu nhnh c th phn nh nhiu hay cc outlier trong d liu hun luyn. Vic ct ta cy c gng nhn bit v g b cc nhnh ny. Cy ct ta c m t trong mc 2.3.3. Ci tin gii thut cy quyt nh c bn c cp ti trong mc 2.3.4. Cc vn v kh nng m rng cho cy quyt nh quy np t c s d liu ln c cp trong mc 2.3.5. 2.3.1 Cy quyt nh quy np Gii thut 2.3.1 Generate_decision_tree (Sinh cy quyt nh): Xy dng cy quyt nh t d liu hun luyn cho trc. u vo: Cc mu hun luyn samples, l cc gi tr ri rc ca cc thuc tnh;

-23tp cc thuc tnh attribute-list. u ra: Cy quyt nh. Gii thut: 1) create mt nt N; 2) if tt c cc samples c cng lp C then 3) 5) return N l mt nt l vi nhn lp C; return N l mt nt l vi nhn l lp ph bin nht trong samples; 4) if attribute-list l rng then 6) select test-attribute - l thuc tnh c thng tin thu c cao nht trong attribute-list; 7) Nhn nt N l test-attribute; 8) for mi mt gi tr ai ca test-attribute 9) 10) 11) 12) 13) Pht trin mt nhnh t nt N vi iu kin test-attribute= ai; t si l tp cc mu trong samples c test-attribute= ai; if si l rng then gn mt l vi nhn l lp ph bin nht trong samples; else gn mt nt c tr li bi Generate_decision_tree(si, attribute-list Hnh 2.3: Gii thut ID3 cho cy quyt nh Gii thut nn tng ca cy quyt nh quy np l ID3, mt gii thut cy quyt nh quy np ni ting. M rng gii thut c tho lun trong mc 2.3.4 ti 2.3.6. * Php o la chn thuc tnh: Php o thng tin thu c (information gain) c dng la chn thuc tnh kim nh ti mi nt trn cy. Php o nh vy cn c gi l php o la chn thuc tnh hay php o cht lng phn chia. Thuc tnh vi thng tin thu c cao nht (hay entropy ln nht) c chn l thuc tnh kim nh ti nt hin thi. Thuc tnh ny ti thiu ho thng tin cn thit phn loi cc mu. Php o thng tin ny s tin ti cc tiu ho s lng cc kim nh cn

test-attribute);

-24thit phn loi mt i tng v m bo rng mt cy n gin (nhng khng nht thit phi l n gin nht) c tm thy. Cho S l tp gm s mu d liu. Gi s thuc tnh nhn lp c m gi tr ring bit nh ngha m lp ring bit (vi i = 1,...,m), si l s lng cc mu ca S trong lp Ci. Thng tin cn thit phn loi mt mu cho trc c th hin trong phng trnh (2.1):
I ( s1 , s2 ,..., sm ) = pi log 2 ( pi )
i =1 m

(2.1)

vi pi l xc sut mt mu tu thuc lp Ci v bng si/s. Cho thuc tnh A c v gi tr ring bit, {a1,a2,...,av}. Thuc tnh A dng phn chia S vo trong v tp con {S1,S2,...,Sv}, Si l cc mu trong S c gi tr thuc tnh A l ai. Nu A c chn l thuc tnh kim nh (tc l thuc tnh tt nht phn chia), th cc tp con ny s tng ng vi cc nhnh tng trng t nt cha tp S. Cho sij l s cc mu ca lp Ci trong tp con Sj. Entropy hay thng tin cn phn chia s mu vo trong v tp con l:
E ( A) =
j =1 v

s1 j + ... + s mj s

I ( s1 j ,..., s mj )

(2.2)

M ho thng tin s c c bng cch phn nhnh trn A l: Gain(A) = I(s1,s2,...,sm) - E(A) (2.3) Gii thut tnh ton thng tin thu c ca tng thuc tnh. Thuc tnh vi thng tin thu c cao nht c la chn l thuc tnh kim nh cho tp S. To mt nt vi nhn l thuc tnh , cc nhnh c to cho mi gi tr ca thuc tnh ny v cc mu c phn chia ph hp. V d 2.2: Quy np ca mt cy quyt nh: Bng 2.1 miu t mt tp hun luyn cc b d liu ly t c s d liu khch hng AllElectronics. Thuc tnh nhn lp mua my tnh c hai gi tr ring bit l {C,Khng}, do vy c hai nhn ring bit (m=2). Cho C1 tng ng vi lp C v nhn C2 tng ng vi Khng. C 9 mu ca lp C v 5 mu ca lp Khng. tnh ton thng

-25tin thu c ca tng thuc tnh, trc tin ta s dng phng trnh (2.1) tnh ton thng tin cn phn loi mt mu cho trc:
I ( s1 , s2 ) = I (9,5) = 5 9 5 9 = 0.940 log 2 log 2 14 14 14 14

Tip theo ta cn tnh entropy ca tng thuc tnh. Bt u vi thuc tnh tui. Ta cn xem s phn b ca cc mu c v khng cho mi gi tr ca tui. Ta tnh thng tin trng ch cho mi phn b ny:
For tui="<30": For tui="30-40": For tui=">40": s11 = 2 s12 = 4 s13 = 3 s21 = 3 s22 = 0 s23 = 2 I(s11,s21) = 0.971 I(s12,s22) = 0 I(s13,s23) = 0.971

Bng 2.1: Cc b d liu hun luyn t c s d liu khch hng AllElectronics STT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Tui <30 <30 30-40 >40 >40 >40 30-40 <30 <30 >40 <30 30-40 30-40 >40 Thu nhp Cao Cao Cao Trung bnh Thp Thp Thp Trung bnh Thp Trung bnh Trung bnh Trung bnh Cao Trung bnh Sinh vin Khng Khng Khng Khng C C C Khng C C C Khng C Khng tn nhim Lp: mua my tnh Kh tt Khng Tt Khng Kh tt C Kh tt C Kh tt C Tt Khng Tt C Kh tt Khng Kh tt C Kh tt C Tt C Tt C Kh tt C Tt Khng

S dng phng trnh (2.2), thng tin trng ch cn phn loi mt mu cho trc nu cc mu ny c phn chia theo tui l:
E (Tuoi ) = 5 4 5 I ( s11 , s21 ) + I ( s12 , s22 ) + I ( s13 , s23 ) = 0.694 14 14 14

Do vy thng tin thu c t s phn chia l: Gain(tui) = I(s1,s2) - E(tui) = 0.246 Tng t nh vy, ta c th tnh Gain(thu nhp) = 0.029, Gain(sinh vin) = 0.151, v Gain( tn nhim) = 0.048. T thuc tnh tui thu c thng

-26tin cao nht, n c chn la l thuc tnh kim nh. Mt nt c to lp v gn nhn vi tui v phn nhnh tng trng i vi tng gi tr thuc tnh. Cc mu sau c phn chia theo, nh hnh 2.4. Cc mu ri vo nhnh tui = 30-40 u thuc v lp C, do vy mt l vi nhn C c to lp ti on cui ca nhnh ny. Cy quyt nh cui cng c c bi thut gii c th hin trong hnh 2.2. (Vit tt trong hnh 2.4: TN: Thu nhp; SV: Sinh vin; TN: tn nhim; TB: Trung bnh; KT: Kh tt; C: C; K: Khng; L:Lp) Tui? <30 TN Cao Cao TB Thp TB 30-40 >40 L C C K C K

SV TN L TN SV TN L TN SV TN K KT K Cao K KT C TB K KT Thp C Tt C Thp C K Tt K KT K KT K TB K Tt C Thp C Tt C KT C Cao C KT C TB C KT C Tt C TB K Tt Hnh 2.4: Thuc tnh tui c thng tin thu c cao nht

Tui tr thnh mt thuc tnh kim nh ti nt gc ca cy quyt nh. Cc nhnh c tng trng theo tng gi tr ca tui. Cc mu c phn chia theo tng nhnh. 2.3.2 Cy ct ta Khi mt cy quyt nh c xy dng, nhiu nhnh s phn nh s bt bnh thng trong d liu hun luyn bi nhiu hay cc outlier. Cc phng php ct ta cy x l bi ton ny. Cc phng php ny s dng cc php o thng k g b ti thiu cc nhnh tin cy, nhn chung kt qu phn loi nhanh hn, ci tin kh nng phn loi ph hp d liu kim nh c lp. C hai tip cn ph bin ct ta cy: Trong tip cn tin ct ta (prepruning approach), mt cy c ct ta bng cch dng sm vic xy dng n (tc l bng cch dng hn s phn chia hay s phn chia tp con ca cc mu hun luyn ti mt nt cho trc). Nh

-27vy, nt s tr thnh mt l. L nm gi tn s lp ln nht gia cc mu tp con. Khi xy dng mt cy, cc php o v d nh ngha thng k 2, thng tin t c, v.v..., c th c dng nh gi cht lng phn tch. Nu phn chia cc mu ti mt nt cho kt qu phn tch di mt ngng ch nh th dng vic phn chia tng lai ca tp con cho trc. C nhiu kh khn trong vic la chn mt ngng thch hp. Tip cn hu ct ta (postpruning): g b cc nhnh t mt cy "tng trng y ". Mt nt cy c ta bng cch g cc nhnh ca n. Tin ct ta cy v hu ct ta c th c xen k i vi mt tip cn kt hp. Hu ct ta yu cu tnh ton nhiu hn tin ct ta, nhn chung s dn ti mt cy ng tin cy hn. 2.3.3 Trch lut phn loi t cc cy quyt nh Tri thc trnh by trong cc cy quyt nh c th c trch v trnh by di dng cc lut phn loi IF-THEN. Mt lut tng ng vi mt ng i t gc ti mt nt l. Mi cp thuc tnh - gi tr dc theo ng i to thnh mt lin kt trong tin lut (phn "IF"). Nt l l lp d on, thit lp nn mnh kt qu lut (phn "THEN"). Cc lut IF-THEN gip ta d hiu hn, c bit nu cy cho trc l rt ln. V d 2.3: Sinh ra cc lut phn loi t mt cy quyt nh: Cy quyt nh nh hnh 2.2 c th c chuyn i thnh cc lut phn loi "IF-THEN" bng cch ln theo ng i t nt gc ti tng nt l trn cy. Cc lut trch c t hnh 2.2 l: IF tui = "<30" AND sinh vin = khng THEN mua my tnh = khng IF tui = "<30" AND sinh vin = c THEN mua my tnh = c IF tui = "30-40" THEN mua my tnh = c IF tui = ">40" AND tn nhim = tt THEN mua my tnh = c IF tui = ">40" AND tn nhim = kh tt THEN mua my tnh = khng

-28Mt lut c th c ta bt bng cch g b mt s iu kin trong tin lut m khng lm nh hng lm n chnh xc ca lut. i vi mi lp, cc lut trong phm vi mt lp c th c sp xp theo chnh xc ca chng. Do rt d xy ra hin tng l mt mu kim nh s khng tho bt k mt tin lut no, mt lut ngm nh n nh lp a s (majority class) c thm vo kt qu tp lut. 2.3.4 Ci tin cy quyt nh quy np c bn Gii thut cy quyt nh quy np c bn mc 2.3.1 i hi tt cc cc thuc tnh l xc thc (categorical) hay ri rc (discretized). Gii thut c th sa i cho php cc thuc tnh c gi tr lin tc. Kim nh trn mt thuc tnh A c gi tr lin tc cho kt qu vo hai nhnh, tng ng vi hai iu kin A V v A >V cho cc gi tr s (numeric) V ca A. Nu A c v gi tr th c th c v-1 php phn tch c xem xt khi xc nh V. Thng thng cc im gia mi cp gi tr k nhau c xem xt. Nu cc gi tr c sp xp trc th ch cn mt ln duyt qua cc gi tr. Gii thut cy quyt nh quy np c bn to mt nhnh cho mi gi tr ca mt thuc tnh kim nh, sau phn phi cc mu mt cch ph hp. Phn chia ny c th cho kt qu l mt s lng ln cc tp con nh. Khi cc tp con tr nn ngy cng nh i, x l phn chia c th s dng mu c quy m l thng k khng y . Lc ny, vic tm mu hu ch trong cc tp con s tr nn khng thch hp bi tnh khng y ca d liu. Mt cch khc phc l nhm cc gi tr c thuc tnh xc thc hoc to cc cy quyt nh nh phn, ti mi nhnh l mt kim nh boolean trn mt thuc tnh. Cc cy nh phn cho kt qu phn mnh d liu t nht. Nhiu nghin cu cho thy cc cy quyt nh nh phn c khuynh hng chnh xc hn cc cy truyn thng. Nhiu phng php c xut x l cc gi tr thuc tnh khuyt. Mt gi tr b khuyt ca thuc tnh A c th c thay th bi gi tr ph bin nht ca A. 2.3.5 Kh nng m rng v cy quyt nh quy np

-29Cc gii thut cy quyt nh nh ID3 v C4.5 c thit lp cho cc tp d liu tng i nh. Hiu qu v kh nng m rng l cc vn lin quan vi nhau khi cc gii thut ny c p dng vo vic khai ph cc c s d liu rt ln, th gii thc. Hu ht cc gii thut quyt nh u c hn ch l cc mu hun luyn tp trung b nh chnh. Trong cc ng dng khai ph d liu, cc tp hun luyn rt ln ca hng triu mu l ph bin. Do vy, hn ch ny gii hn kh nng m rng ca cc gii thut trn, ti y cu trc cy quyt nh c th tr nn v ch bi vic trao i ca cc mu hun luyn trong v ngoi cc b nh chnh v cache. Lc u, chin lc cho cy quyt nh quy np cc c s d liu ln c th l ri rc ho cc thuc tnh lin tc, gi nh tp hun luyn va trong b nh. m rng, trc tin phn chia d liu vo trong cc tp con mt cch ring bit c th va vo trong b nh v sau xy dng mt cy quyt nh t mi tp con. Classifier u ra cui cng l s kt hp ca cc classifier c c t cc tp con. Mc d phng php ny cho php phn loi cc tp d liu ln, chnh xc phn loi ca n khng cao nh ch c mt classifier - n c xy dng bng cch s dng tt c d liu cng mt lc. Mt trong s cc gii thut cy quyt nh gn y c xut x l vn kh nng m rng l SLIQ, n c th vn dng cc thuc tnh c gi tr xc thc v lin tc. C hai gii thut xut cc k thut tin sp xp trn a cc tp d liu thng tr l qu ln va trong b nh. C hai u nh ngha ch li ca cc cu trc d liu mi gip cho vic xy dng cy tr nn thun li. SLIQ dng a lu cc danh sch thuc tnh v mt b nh n l lu danh sch lp. Cc danh sch thuc tnh v cc danh sch lp c sinh ra bi SLIQ i vi d liu mu bng 2.2 c ch ra trn hnh 2.5. Mi thuc tnh c mt danh sch thuc tnh kt hp, c nh ch s bi STT. Mi b c biu din bi lin kt ca mt mc (entry) t mi danh sch thuc tnh sang mt mc trong danh sch lp, n ln lt c lin kt ti nt l tng ng trong cy quyt nh. Danh sch lp vn trong b nh v n thng c truy cp,

-30sa i trong cc pha xy dng v ct ta. Kch thc ca danh sch lp tng trng cn xng vi s lng cc b trong tp hun luyn. Khi mt danh sch lp khng th va vo trong b nh, vic biu din ca SLIQ suy gim. Bng 2.2: D liu mu cho lp mua my tnh STT tn nhim Tui Mua my tnh 1 Tt 38 C 2 Tt 26 C 3 Kh tt 35 Khng 4 Tt 49 Khng

tn STT nhim Tt 1 Tt 2 Kh 3 tt Tt 4

Tui 26 35 38 49

STT 2 3 1 4

STT 1 2 3 4

Mua Nt my tnh C 5 C 2 Khng 3 Khng 6

0 1 3 5 4 6 2

Hnh 2.5: Cc cu trc d liu danh sch thuc tnh v danh sch lp c dng trong SLIQ cho d liu mu trong bng 2.2 2.4 Phn loi Bayesian Classifier Bayesian l classifier thng k. Phn loi Bayesian da trn nh l Bayes. Mt classifier n gin ca Bayesian l Naive Bayesian, so vi vic thc thi ca classifier cy quyt nh v mng nron, classifier Bayesian a ra chnh xc cao v nhanh khi p dng vo cc c s d liu ln. Cc classifier Naive Bayesian gi nh rng hiu qu ca mt gi tr thuc tnh trn mt lp l c lp so vi gi tr ca cc thuc tnh khc. Gi nh ny c gi l c lp c iu kin lp. Nh vy s n gin ho cc tnh ton rc ri, v th coi n l "naive-ngy th". Cc mng belief (da trn) Bayesian l cc m hnh th, n khng ging nh classifier Bayesian ngy th, cho php biu din s ph thuc gia cc tp con ca cc thuc tnh. Cc mng belief Bayesian cng c dng cho phn loi.

-31Mc 2.4.1 ni li cc khi nim xc sut c bn v nh l Bayes. Sau ta s xem phn loi Bayesian ngy th trong 2.4.2, cc mng belief Bayes c m t trong mc 2.4.3. 2.4.1 nh l Bayes Cho X l mu d liu cha bit nhn lp, H l gi thuyt v d nh mu d liu X thuc v lp C. i vi cc bi ton phn loi, ta cn xc nh P(H|X) l xc sut xy ra gi thuyt H trn mu d liu X. P(H|X) l xc sut hu nghim ca H vi iu kin X. V d, gi s cc mu d liu trong tp hoa qu c m t bi mu sc v hnh dng ca chng. Gi s X l v trn, H l gi thuyt X l qu to. Th P(H|X) phn nh tin cy rng X l mt qu to vi vic nhn thy X l v trn. Ngc li, P(H) l xc sut tin nghim ca H. Nh v d, y l xc sut mt mu d liu bt k cho trc l qu to bt k n trng nh th no. Xc sut hu nghim P(H|X) da trn nhiu thng tin (nh nn tng tri thc) hn xc sut tin nghim P(H), n c lp vi X. Tng t nh vy, P(X|H) l xc sut hu nghim ca X vi iu kin H. l xc sut X l v trn, ta bit s tht l X l mt qu to. P(X) l tin nghim ca X. Theo v d trn, n l xc sut cho mt mu d liu t tp hoa qu l v trn. P(X), P(H), P(X|H) c nh gi t d liu cho trc. nh l Bayes thc s c ch bi n cung cp cch thc tnh ton xc sut hu nghim P(H|X) t P(X), P(H) v P(X|H). nh l Bayes nh sau:
P( H | X ) = P( X | H ) P( H ) P( X )

(2.4)

Trong mc tip theo ta s xem nh l Bayes c dng nh th no trong classifier Bayesian ngy th. 2.4.2 Phn loi Bayesian ngy th Classifier Bayesian ngy th hay classifier Bayessian n gin lm vic nh sau:

-321. Mi mu d liu c i din bi mt vector c trng n-chiu, X=(x1,x2,...,xn), m t n php o c c trn mu t n thuc tnh tng ng A1, A2,..., An. 2. Gi s rng c m lp C1,C2,...Cm. Cho trc mt mu d liu cha bit nhn lp X, classifier s d on X thuc v lp c xc sut hu nghim cao nht, vi iu kin trn X. Classifier Bayesian ngy th n nh mt mu khng bit X vo mt lp Ci khi v ch khi: P(Ci|X) > P(Cj|X) vi 1 j m, j i Do vy cn tm P(Ci|X) ln nht. Theo nh l Bayes (Phng trnh 2.4):
P (Ci | X ) = P ( X | C i ) P(C i ) P( X )

(2.5)

3. P(X) khng i vi mi lp, P(Ci)=si/s (si l s lng cc mu hun luyn ca lp Ci v s l tng s cc mu hun luyn), P(X|Ci)P(Ci) cn c cc i. 4. Cho trc cc tp d liu vi nhiu thuc tnh, vic tnh P(X|Ci) s rt tn km. gim tnh ton khi nh gi P(X|Ci), gi nh ngy th ca c lp c iu kin lp c thit lp. iu ny lm cho gi tr ca cc thuc tnh l c lp c iu kin vi nhau, cho trc nhn lp ca mu, tc l khng c mi quan h c lp gia cc thuc tnh. V th,
P( X | Ci ) = P( xk | Ci )
k =1 n

(2.6)

P(x1|Ci), P(x2|Ci),..., P(xn|Ci) c nh gi t cc mu hun luyn vi: (a) Nu Ak l xc thc th P(xk|Ci)=sik/si vi sik l s lng cc mu hun luyn ca lp Ci c gi tr xk ti Ak v si l s lng cc mu hun luyn thuc v Ci. (b) Nu Ak l gi tr lin tc th thuc tnh c gi nh c phn phi Gaussian. Bi vy,
P ( xk | Ci ) = g ( xk , Ci , Ci ) = 1 2 Ci

(x C i )2
2 2 C i

(2.7)

-33vi g(xk,Ci,Ci) l hm mt (thng thng) Gaussian ca thuc tnh Ak, vi Ci,Ci i din cho cc gi tr trung bnh v lch chun ca thuc tnh Ak i vi cc mu hun luyn ca lp Ci. 5. phn loi mt mu cha bit X, vi P(X|Ci)P(Ci) c nh gi cho lp Ci. Mu X c n nh vo lp Ci khi v ch khi: P(X|Ci)P(Ci) > P(X|Cj)P(Cj) vi 1 j m, j i Hay ni cch khc, n c n nh ti lp Ci m ti P(X|Ci)P(Ci) cc i. V d 2.4: D on mt nhn lp s dng phn loi Bayesian ngy th: Ta cn d on nhn lp ca mt mu cha bit s dng phn loi Bayesian ngy th, vi cng d liu hun luyn c trong v d 2.2 cho cy quyt nh quy np. D liu hun luyn trong bng 2.1. Cc mu d liu c m t bi cc thuc tnh tui, thu nhp, sinh vin v tn nhim. Thuc tnh nhn lp mua my tnh c hai gi tr ring bit (tn l {c v khng}). Cho C1 tng ng vi lp mua my tnh = c v C2 tng ng vi lp mua my tnh = khng. Mu cha bit ta s phn loi chng l: X = (tui = "<30", thu nhp=trung bnh, sinh vin= c, tn nhim=kh tt) Ta cn cc i ho P(X|Ci)P(Ci) vi i=1,2. P(Ci) l xc sut tin nghim ca mi lp c th c tnh ton da trn cc mu hun luyn: P(mua my tnh = c) = 9/14 = 0.643 P(mua my tnh = khng) = 5/14 = 0.357 tnh P(X|Ci) vi i=1,2, ta tnh cc xc sut c iu kin sau: P(tui = "<30" | mua my tnh = c) = 2/9 = 0.222 P(tui = "<30" | mua my tnh = khng) = 3/5 = 0.600 P(thu nhp = trung bnh | mua my tnh = c) = 4/9 = 0.444 P(thu nhp = trung bnh | mua my tnh = khng) = 2/5 = 0.400 P(sinh vin = c | mua my tnh = c) = 6/9 = 0.667 P(sinh vin = c | mua my tnh = khng) = 1/5 = 0.200 P( tn nhim = kh tt | mua my tnh = c) = 6/9 = 0.667

-34P( tn nhim = kh tt | mua my tnh = khng) = 2/5 = 0.400 S dng cc xc sut trn ta c: P(X|mua my tnh = c) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|mua my tnh = khng) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P(Xjmua my tnh = c)P(mua my tnh = c) = 0.044x 0.643 = 0.028 P(Xjmua my tnh = khng)P(mua my tnh = khng) = 0.019 x 0.357 = 0.007 Bi vy, classifier Bayesian ngy th d on "mua my tnh = c" cho mu X. 2.4.3 Cc mng belief Bayesian Classifier Bayesian ngy th thc hin trn gi nh c lp iu kin lp, tc l nhn lp ca mt mu l cho trc, gi tr ca cc thuc tnh c lp c iu kin vi nhau. Gi nh ny lm n gin ho vic tnh ton. Khi gi nh l ng th classifier Bayesian ngy th c chnh xc cao nht so vi tt c cc classifier khc. Tuy nhin trong thc tin, s ph thuc c th tn ti gia cc bin. Cc mng belief Bayes nh r phn chung cc phn b xc sut c iu kin. Chng cung cp mt m hnh th cc mi quan h nhn qu, trn vic hc c thc hin. Mt mng belief c nh ngha bi hai thnh phn. Th nht l mt th khng c chu trnh v c hng, ti mi nt i din cho mt bin ngu nhin v mi cung i din cho mt ph thuc xc sut. Nu mt cung c v t mt nt Y ti mt nt Z th Y l cha ca Z hay t tin gn nht ca Z v Z l con chu ca Y. Mi bin l c lp c iu kin vi nhng nt khng phi con chu ca n trn th, cho trc cc cha ca chng. Gi tr ca cc bin ny c th l ri rc hay lin tc. Ta c th gi chng l cc mng belief, cc mng Bayesian hay cc mng xc sut. Mt cch ngn gn, ta s xem chng nh l cc mng belief.

-35(FamilyHistory: tin s gia nh; LungCancer: ung th phi; Smoker: ngi ht thuc; PositiveXRay: phim X quang; Emphysema: kh thng; Dyspnoea: kh th) a)
FamilyHistory Smoker FH,S FH,~S LungCancer Emphysema LC ~LC PositiveXRay Dyspnoea 0.8 0.2 0.5 0.5 ~FH,S 0.7 0.3 ~FH,~S 0.1 0.9

b)

Hnh 2.6: a) Mng belief Bayesian n gin, b) Bng xc sut c iu kin cho cc gi tr ca bin LungCancer (LC) Hnh 2.6a) cho thy mt mng belief n gin ly t [Russell et al. 1995a] cho 6 bin Boolean. Cc cung cho php mt biu din tri thc nhn qu. V d, bnh phi mt ngi b nh hng bi lch s bnh phi ca gia nh anh ta, cng nh liu ngi c nghin thuc l hay khng. Hn na, cc cung cng ch ra rng cc bin LungCancer l c lp c iu kin vi Emphysema, cho trc cc cha ca n: FamilyHistory v Smoker. iu ny c ngha l mt khi cc gi tr ca FamilyHistory v Smoker c bit th bin Emphysema khng cn cung cp thm bt k mt thng tin no nh gi LungCancer. Thnh phn th hai nh ngha mng belief l mt bng xc sut c iu kin (vit tt: CPT - conditional probability table) cho mi bin. CPT cho mt bin Z ch ra phn phi c iu kin P(Z|Parents(Z)) vi Parents(Z) l cc cha ca Z. Hnh 2.6b) cho thy mt CPT cho LungCancer. Xc sut c iu kin cho mi gi tr ca LungCancer cho trc i vi mi kt ni c th c ca cc gi tr cc cha ca n. V d, t cc mc pha trn tri nht v pha di phi nht tng ng nh sau: P(LungCancer = C | FamilyHistory = C, Smoker = C) = 0.8, v

-36P(LungCancer = Khng | FamilyHistory = Khng, Smoker = Khng) = 0.9. Xc sut chung ca bt k mt b (z1,z2,...,zn) tng ng vi cc bin hay cc thuc tnh Z1,Z2,...,Zn c tnh ton bi :
P ( z1 ,..., z n ) = P( z i | Parents( Z i ))
i =1 n

(2.8)

P(zi|Parents(Zi)) tng ng vi cc mc trong CPT cho Zi. Mt nt trn mng c th c chn nh l nt "u ra", biu din mt thuc tnh nhn lp. C th c nhiu hn mt nt u ra. Cc gii thut suy din cho vic hc cng p dng c trn mng ny. X l phn loi, c th tr li mt nhn lp n l, hay mt phn phi xc sut cho thuc tnh nhn lp, tc l d on xc sut ca mi lp. 2.4.4 Hun luyn cc mng belief Bayesian Trong vic hc hay hun luyn mt mng belief cu trc mng c trc hay c suy din t d liu. Cc bin mng c th quan st c hay n tt c hoc mt s mu hun luyn. D liu n c xem l gi tr khuyt hay d liu cha y . Nu cu trc mng c bit v cc bin l quan st c th vic hc mng l khng phc tp, ch cn tnh cc mc CPT, nh lm vi Bayesian ngy th. Vi cu trc mng cho trc, mt s b bin n th dng phng php gradient descent hun luyn mng belief. i tng ny hc cc gi tr cho cc mc CPT. S l tp c s mu hun luyn X1,X2,...,Xs. wijk l mt mc CPT cho bin Yi=yij c cc cha Ui=uik. V d, nu wijk l mc CPT pha trn tri nht ca hnh 2.6b) th Yi = LungCancer; gi tr ca n yij = C; danh sch cc nt cha ca Yi l Ui = {FamilyHistory, Smoker}; v danh sch gi tr ca cc nt cha uik = {C, C}. wijk c xem nh l cc trng s, ging nh cc trng s trong cc unit n ca cc mng nron (mc 2.5). Cc trng s, wijk ban u l cc gi tr xc sut ngu nhin. Chin lc gradient descent biu din leo i (hill-

-37climbing) tham. Ti mi ln lp, cc trng s c cp nht v cui cng s hi t v mt gii php ti u cc b. Phng php nhm mc ch cc i ho P(S|H). Cho trc cu trc mng v wijk khi u, gii thut x l nh sau: 1. Tnh cc gradient: Cho i, j, k tnh:
s P (Y = y , U = u | X ) ln P ( S | H ) i ij i ik d = wijk wijk d =1

(2.9)

Xc sut bn phi ca phng trnh (2.9) c tnh cho mi mu hun luyn Xd trong S, xem n l xc sut n gin p. Khi cc bin c miu t bi Yi v Ui l n i vi mt vi Xd no th xc sut tng ng p c th c tnh t cc bin quan st c ca mu s dng cc gii thut chun cho suy din mng Bayesian. 2. Ly mt bc nh theo hng ca gradient: Cc trng s c cp nht bi
wijk wijk + (l ) ln P ( S | H ) wijk

(2.10)
ln P ( S | H ) c tnh t wijk

vi l l t s hc biu din kch thc bc v phng trnh (2.9). T s hc l mt hng s nh.

3. Chun ha li cc trng s: V cc trng s wijk l cc gi tr xc sut, chng phi gia 0 v 1.0 v

wijk phi bng 1 vi mi i, k. Nhng tiu chun

ny c c bng cch chun ho li cc trng s sau khi chng c cp nht bi phng trnh (2.10). 2.5 Phn loi bng lan truyn ngc Lan truyn ngc l mt gii thut hc mng nron. Ni mt cch th s, mt mng nron l mt tp cc unit vo/ra c kt ni, ti , mi kt ni c mt trng s kt hp vi n. Trong sut pha hc, mng hc bng cch iu chnh cc trng s c th d on nhn lp ca cc mu u vo mt cch chnh xc.

-38Cc mng nron cn thi gian hun luyn di, do vy cc ng dng ph hp th s kh thi hn. Chng yu cu mt s lng cc tham s m theo kinh nghim n c xc nh tt nht nh cu trc lin kt mng hay "cu trc" mng. Kh nng din dch ca cc mng nron ngho nn, do vy vic hiu c ngha biu tng ng sau cc trng s c hc l rt kh. Cc c trng ny lc u lm cho nhu cu khai ph d liu dng mng nron t i. Thun li ca cc mng nron l cao dung sai ca chng i vi d liu nhiu cng nh kh nng phn loi cc mu khng c hun luyn. Mt s gii thut gn y c pht trin trch lc cc lut t cc mng nron hun luyn. Cc yu t ny gp phn lm cho cc mng nron tr nn hu ch hn khi phn loi trong khai ph d liu. Gii thut mng nron ph bin nht l gii thut lan truyn ngc, c xut nm nhng nm 1980. Mc 2.5.1 l cc mng truyn thng a mc, y l mt kiu mng nron biu din bng gii thut lan truyn ngc. Mc 2.5.2 nh ngha mt cu trc lin kt mng. Gii thut lan truyn ngc c m t trong mc 2.5.3. Rt trch lut t cc mng nron hun luyn trong mc 2.5.4. 2.5.1 Mt mng nron truyn thng a mc
Lp vo
x1 x2 xi

Lp n

Lp ra

wij

oj

wk

ok

Hnh 2.7: Mt mng nron truyn thng a mc Gii thut lan truyn ngc biu din vic hc trn mt mng nron truyn thng a mc. Nh th d trn hnh 2.7, cc u vo tng ng vi cc thuc tnh o c i vi mi mu hun luyn, cung cp ng thi vo mt lp cc unit to thnh lp u vo. u ra c trng s ca cc unit ny sau cung cp

-39ng thi ti lp cc unit th hai, ta gi l lp n. Cc u ra c trng s ca lp n c th l u vo cho mt lp n khc, v.v... S lng cc lp n l tu , mc du trong thc tin thng xuyn ch c mt lp c s dng. Cc u ra c trng s ca lp n cui cng l u vo cho cc unit to nn lp u ra, n a ra d on ca mng cho cc mu cho trc. Cc unit trong cc lp n v lp u ra c coi l cc unit u ra. Mng nron a mc nh biu din trong hnh 2.7 c 2 lp unit u ra. Bi vy, ta ni rng n l mt mng nron 2 lp. Tng t, mt mng cha 2 lp n c gi l mt mng nron 3 lp, v.v... Mng c coi l truyn thng nu nh n khng c mt trng s no quay li mt unit u vo hay ti mt unit u ra ca mt lp trc n. Mng c gi l kt ni y khi m trong mng, mi mt unit cung cp u vo cho tng unit lp tip theo. Vi cc unit n y cho trc, cc mng truyn thng a mc ca cc hm ngng tuyn tnh c th xp x ti bt k mt hm no. 2.5.2 nh ngha cu trc lin kt mng "Ta c th thit k cu trc lin kt ca mt mng nron nh th no?" - Trc khi hun luyn bt u, ngi dng phi quyt nh cu trc lin kt mng bng cch ch ra s lng cc unit trong lp u vo, s lng cc lp n (nu nhiu hn 1), s lng cc unit trong mi lp n v s lng cc unit trong lp u ra. - Chun ho cc gi tr u ra cho mi thuc tnh o trong cc mu hun luyn s gip tng tc pha hc. Cc gi tr u vo c chun ha nm trong khong [0,1]. Cc thuc tnh c gi tr ri rc c th c m ho mt unit u vo tng ng vi mt gi tr min. V d, nu min ca mt thuc tnh A l {a0,a1,a2} th ta c th n nh 3 unit u vo cho A. Ta c I0, I1, I2 l cc unit u vo. Mi unit c gi tr ban u l 0. Nu A=a0 th I0 c t l 1, nu A=a1 th I1 c t l 1, v.v... Mt unit u ra c th c dng biu din hai lp (1 i din cho mt lp, 0 i din cho lp khc). Nu c nhiu hn hai lp th unit u ra 1 tng ng vi lp c s dng.

-40Khng c lut r rng v s lng tt nht cc unit lp n. Th mng thit k bng phng php sai s, mng nh hng n chnh xc kt qu mng hun luyn. Gi tr u tin ca cc trng s tc ng ti kt qu chnh xc. Khi mt mng c hun luyn, chnh xc khng chp nhn c th x l hun luyn thng lp li vi mt cu trc lin kt mng khc hay mt tp cc trng s khi u khc. 2.5.3 Lan truyn ngc Lan truyn ngc hc bng cch lp i lp li vic x l mt tp cc mu hun luyn, so snh d on ca mng cho mi mu vi nhn lp thc s bit. i vi mi mu hun luyn, cc trng s c sa i cc tiu ho trung bnh ca bnh phng sai s gia d on ca mng v lp thc t. Cc sa i ny c lm theo hng "ngc li", tc l t lp u ra, xuyn qua mi lp n xung ti lp n u tin (do c tn l lan truyn ngc). Mc du iu ny khng m bo lm, nhn chung cc trng s cui cng s hi t v x l vic hc s dng. Gii thut c tm tt trong hnh 2.8. Gii thut 2.5.1 (Lan truyn ngc): Hc mng nron phn loi, s dng gii thut lan truyn ngc. u vo: Cc mu hun luyn samples; tc hc l; mt mng truyn thng a mc network. u ra: Mt mng nron hun luyn phn loi cc mu. Gii thut: 1) Khi to gi tr ban u cho cc trng s v cc bias trong network; 2) while iu kin dng cha tha { 3) 4) 5) 6) 7) for mi mu hun luyn X trong samples { //Truyn u vo theo hng tin v pha trc for mi unit j lp n hay lp u ra
I j = i wij Oi + j ; //Tnh u vo mng ca unit j

for mi unit j lp n hay lp u ra

-418) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) 20) }} Hnh 2.8: Gii thut lan truyn ngc * Gi tr khi u cc trng s: Cc trng s trong mng c thit lp gi tr ban u l cc s ngu nhin nh (phm vi t -1.0 ti 1.0 hay -0.5 ti 0.5). Mi nt kt hp vi mt bias, cc bias c gn gi tr ban u nh nhau, l cc s ngu nhin nh. * Mi mu hun luyn X c x l theo cc bc sau: - Truyn cc u vo v pha trc: bc ny, mng u vo v u ra ca mi unit trong cc lp n v lp u ra c tnh ton. + Cung cp mu hun luyn cho lp u vo. + Tnh mng u vo cho mi unit cc lp n v lp u ra, l s kt hp tuyn tnh cc u vo ca n. minh ha iu ny, mt unit lp n hay lp u ra c biu din trn hnh 2.9. Cc u vo ca unit trong thc t l cc u ra ca cc unit kt ni ti n lp trc. tnh mng u vo ca unit, mi u vo c nhn vi trng s tng ng sau
Oj = 1 1+ e
I j

; //Tnh u ra cho mi unit j

//Lan truyn ngc cc sai s for mi unit j lp u ra Errj = Oj(1 - Oj) (Tj - Oj); //Tnh sai s for mi unit j cc lp n
Err j = O j (1 O j )k Errk w jk ; //Tnh sai s

for mi trng s wij trong network{ wij = (l)ErrjOi; //Trng s tng dn wij = wij + wij; } //Cp nht trng s for mi bias j trong network{ j = (l) Errj; //bias tng dn

j = j + j ; } //Cp nht bias

-42cng li. Cho trc mt unit j thuc lp n hay lp u ra, mng u vo Ij ti unit j l:
I j = wij Oi + j
i

(2.11)

vi wij l trng s kt ni t unit i lp trc ti unit j; Oi l u ra ca unit i; j l bias ca unit j. Bias ng vai tr l ngng phc v cho mc hot ng khc nhau ca unit. Cc trng s w0 w1 wn

x0 x1 xn

bias

Tng c trng s

u ra

Vect u vo X

Hm hot ng

Hnh 2.9: Mt unit lp n hay lp u ra hnh 2.9, cc u vo c nhn vi cc trng s tng ng hnh thnh mt tng trng s cng vi bias kt hp vi unit. Mt hm hot ng khng tuyn tnh c gn vo mng u vo. Mi unit trong lp n v lp u ra c mt mng u vo, sau gn vo mt hm hot ng nh minh ho hnh 2.9. Hm tng trng cho hot ng ca nron c i din bi unit, v d nh logistic hay simoid. Cho trc mng u vo Ij ca unit j th Oj l u ra ca unit j, c tnh nh sau:
Oj = 1 1+ e
I j

(2.12)

Hm ny cng c da trn mt hm nn (squashing), t n nh x mt min u vo ln ln trn phm vi nh hn, l min [0,1]. Hm logistic l khng tuyn tnh v ly vi phn, cho php gii thut lan truyn ngc m hnh ho cc bi ton phn loi, chng khng th tch bit mt cch tuyn tnh.

-43- Sai s lan truyn ngc: Sai s c truyn ngc bng cch cp nht trng s v cc bias phn nh sai s ca d on mng. i vi mt unit j lp u ra, sai s Errj c tnh nh sau: Errj = Oj(1 - Oj)(Tj - Oj) (2.13) vi Oj l u ra thc t ca unit j v Tj l u ra true, da trn nhn lp bit ca mu hun luyn cho trc. Lu rng Oj(1 - Oj) l o hm ca hm logistic. tnh sai s ca mt unit lp n j , tng c trng s ca cc sai s ca cc unit kt ni ti unit j lp tip theo c xem xt. Sai s ca mt unit lp n j l:
Err j = O j (1 O j ) Errk wij
k

(2.14)

vi wjk l trng s ca kt ni t unit j ti mt unit k trong lp cao hn tip theo v Errj l sai s ca unit k. Cc trng s v bias c cp nht phn nh cc sai s truyn. Cc trng s c cp nht bi phng trnh (2.15) v (2.16) di y, vi wij l thay i ca trng s wij. wij = (l) ErrjOi wij = wij + wij (2.15) (2.16)

* 'l' trong phng trnh (2.15) l g?: Bin l l tc hc, l hng s thuc khong [0,10]. Vic hc ca lan truyn ngc s dng phng php gim dc (gradient descent) tm kim mt tp cc trng s c th m hnh bi ton phn loi cho trc vi mc tiu ti thiu ho trung bnh bnh phng khong cch gia cc d on lp ca mng v nhn lp thc t ca cc mu. Tc hc gip trnh b sa ly ti mt ti thiu ho cc b trong khng gian quyt nh (tc l ti ch cc trng s xut hin hi t nhng khng phi l gii php ti u) v gip tm ra ti thiu ho ton cc. Nu nh tc hc qu nh th vic hc s rt chm. Nu tc hc qu ln th s dao ng gia cc gii php

-44khng y c th xut hin. Lut thumb cho tc hc bng 1/t vi t l s ln lp i vi tp hun luyn. Cc bias c cp nht bi phng trnh (2.17) v (2.18) di y, j thay cho bias j . j = (l) Errj (2.17) (2.18)

j = j + j

y ta cp nht cc trng s v cc bias sau khi mi mu c a vo. iu ny c quy vo cp nht trng thi (case updating). Tip n, trng s v bias tng dn c th c chng trong cc bin cc trng s v bias c cp nht sau khi tt c cc mu trong tp hun luyn c a ra. Chin lc sau ny c gi l cp nht epoch, ti mt ln lp ht tp hun luyn l mt epoch. Theo l thuyt, ngun gc ton hc ca lan truyn ngc l dng cp nht epoch, nhng trong thc tin, cp nht trng thi cng ph bin th chnh xc cc kt qu cng cao hn. * iu kin ti hn: Hun luyn dng khi mt trong nhng iu kin sau xy ra: 1. Tt c wij epoch trc nh hn hoc bng ngng c ch nh. 2. T l phn trm ca mu phn loi sai trong epoch trc thp hn mt ngng no . 3. Mt s lng cc epoch c ch nh t trc kt thc. Trong thc tin, hng trm ngn cc epoch c th c yu cu trc khi cc trng s hi t. x1 1 w15 w24 x2 2 w25 w34 x3 3 w35 5 w14 4 w46 6 w56

-45Hnh 2.10: V d mt mng nron truyn thng a mc V d 2.5: Cc tnh ton mu i vi vic hc bng gii thut lan truyn ngc Hnh 2.10 cho thy mt mng nron truyn thng a mc. Cc gi tr trng s khi u v cc gi tr bias ca mng c cho trong bng 2.3 vi mu hun luyn u tin X = (1,0,1). Bng 2.3: Cc gi tr u vo, trng s v bias khi u x1 1 x2 0 x3 1 w14 0.2 w15 w24 w25 w34 w35 w46 w56 4 4 4 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

V d ny cho thy cc tnh ton lan truyn ngc cho mu hun luyn u tin X cho trc. Mu c a vo trong mng, mng u vo v u ra ca mi unit c tnh ton. Kt qu nh bng 2.4. Bng 2.4: Cc tnh ton mng u vo v u ra Unit j 4 5 6 Mng u vo Ij 0.2 + 0 - 0.5 - 0.4 = -0.7 -0.3 + 0 + 0.2 + 0.2 = 0.1 (0.3)(0.33) - (0.2)(0.52) + 0.1 = 0.19 u ra Oj 1/(1 + e0.7) = 0.33 1/(1 + e-0.1) = 0.52 1/(1 + e-0.19) = 0.55

Sai s ca mi unit c tnh v truyn ngc. Cc gi tr sai s c ch ra trong bng 2.5. Trng s v cc bias cp nht c ch ra trong bng 2.6. Bng 2.5: Tnh ton sai s ti mi nt Unit j 6 5 4 Errj (0.55)(1- 0.55)(1- 0.55) = 0.495 (0.52)(1- 0.52)(0.495)(-0.3) = 0.037 (0.33)(1- 0.33)(0.495)(-0.2) = -0.022

Bng 2.6: Tnh ton vic cp nht trng s v bias Trng s hay bias w46 w56 w14 w15 w24 w25 Gi tr mi -0.3 + (0.9)(0.495)(0.33) = - 0.153 -0.2 + (0.9)(0.495)(0.52) = -0.032 0.2 + (0.9)(-0.022)(1) = 0.180 -0.3 + (0.9)(0.037)(1) = -0.267 0.4 + (0.9)(-0.022)(0) = 0.4 0.1 + (0.9)(0.037)(0) = 0.1

-46w34 w35 -0.5 + (0.9)(-0.022)(1) = -0.520 0.2 + (0.9)(0.037)(1) = 0.233 0.1 + (0.9)(0.495) = 0.546 0.2 + (0.9)(0.037) = 0.233 -0.4 + (0.9)(-0.022) = -0.420

6 5 4

Mt s s bin i v lun phin gii thut lan truyn ngc c xut phn loi trong cc mng nron. iu ny bao gm iu chnh ng cu trc lin kt mng v tc hc, cc tham s khc hay s dng cc hm sai s khc nhau. 2.5.4 Lan truyn ngc v tnh d hiu Bt li ch yu ca cc mng nron l kh nng biu din tri thc ca chng. Tri thc c c di dng mt mng cc unit c kt ni vi cc lin kt c trng s lm kh cho con ngi khi din dch. Yu t ny thc y nghin cu trch lc tri thc nhng trong cc mng nron hun luyn v trong vic biu din tri thc mt cch tng trng. Cc phng php gm cc lut rt trch t cc mng v php phn tch nhy. Nhiu gii thut khc nhau rt trch cc lut c xut. Cc phng php in hnh vn tn ti nhng hn ch khi nh gi cc th tc dng hun luyn mng nron cho trc, cu trc lin kt mng cho trc v s ri rc ca cc gi tr u vo cho trc. Vic hiu r rng cc mng kt ni y l kh. Do vy, bc u tin thng c khuynh hng trch ra cc lut t cc mng nron gi l ct ta mng. iu ny bao gm g b cc lin kt c trng s m khng lm suy gim chnh xc phn loi ti kt qu ca mng cho. Mi khi mng hun luyn c ct ta, nhiu tip cn sau s thc hin vic phn cm gi tr lin kt, gi tr unit hay gi tr hot ng. Theo phng php ny, v d, phn cm c dng tm tp cc gi tr hot ng thng dng cho mi unit n trong mt mng nron 2 lp c hun luyn cho (hnh 2.11). S kt hp ca cc gi tr hot ng ny i vi mi unit n c phn tch. Cc lut nhn c t s kt hp ca cc gi tr hot ng vi cc gi tr

-47unit u ra tng ng. Tng t nh vy, cc tp gi tr u vo v cc gi tr hot ng c hc a ra cc lut m t mi quan h gia cc lp unit u vo v n. Cui cng, hai tp lut c th c kt hp di dng cc lut IFTHEN. Cc gii thut khc c th nhn c cc lut t nhng hnh thc khc, k c cc lut M x N (vi M khng nm trong N iu kin cho trc trong tin lut, phi l true mnh kt qu lut c p dng), cc cy quyt nh vi cc kim nh M x N, cc lut m v automata hu hn. O1 H1 I1 I2 I3 H2 I4 I5 O2 H3 I6 I7

Nhn bit cc tp gi tr hot ng cho mi nt n Hi: for H1: (-1,0,1) for H2: (0,1) for H3: (-1,0.24,1) Nhn c cc lut lin h cc gi tr hot ng chung vi cc nt u ra Oj: IF (a2 =0 AND a3=-1) OR (a1= -1 AND a2=1 AND a3=-1) OR (a1=-1 AND a2=0 AND a3=0.24) THEN O1=1, O2=0 ELSE O1=0, O2=1 Nhn c cc lut lin h cc nt u vo Ii ti cc nt u ra Oj: IF(I2=0 AND I7=0) THEN a2=0 IF(I4=1 AND I6=1) THEN a3=-1 IF(I5=0) THEN a3=-1 ... Cc lut thu c lin h cc lp u vo v u ra:

-48IF(I2=0 AND I7=0 AND I4=1 AND I6=1)THEN class=1 IF(I2=0 AND I7=0 AND I5=0)THEN class=1 Hnh 2.11: Cc lut c th c trch ra t cc mng nron hun luyn * Phn tch nhy: c dng nh gi tc ng mt bin u vo cho trc trn mt mng u ra. u vo bin b bin i trong khi vn duy tr cc bin u vo c n nh ti mt vi gi tr. Trong khi , cc thay i mng u ra b gim st. Tri thc thu c t dng phn tch ny c th c biu din di dng cc lut nh "IF X gim 5% THEN Y tng 8%". 2.6 Phn loi da trn s kt hp "Khai ph lut kt hp c th c s dng phn loi khng?" Khai ph lut kt hp l mt lnh vc quan trng v c tnh thit thc cao ca nghin cu khai ph d liu. Cc k thut khai ph d liu p dng khai ph lut kt hp cho cc bi ton phn loi pht trin. Trong phn ny, ta nghin cu phn loi da trn s kt hp. Mt phng php phn loi da trn s kt hp gi l phn loi kt hp, gm c 2 bc. Bc u tin, cc lut kt hp c sinh ra s dng mt version sa i ca gii thut khai ph lut kt hp chun bit nh Apriori. Bc 2 xy dng mt classifier da trn cc lut kt hp pht hin. Cho D l d liu hun luyn v Y l tp tt c cc lp trong D. Gii thut nh x cc thuc tnh xc thc vo cc gi tr nguyn dng lin tip. Cc thuc tnh lin tc c ri rc ho v c nh x. Mi mu d liu d trong D sau c biu din bi mt tp cc cp (thuc tnh, gi tr nguyn) gi l cc item v mt nhn lp y. Cho I l tp tt c cc item trong D. Mt lut kt hp lp (vit tt: CAR - class association rule) c dng condset y, vi conset l mt tp cc item (conset I) v y Y. Cc lut c biu din bi cc ruleitem c dng <condset,y>. CAR c tin cy c nu c% cc mu trong D cha condset thuc lp y. CAR c h tr s nu s% cc mu trong D cha condset v thuc lp y. Tng h tr ca mt condset (condsupCount) l s lng mu trong D cha condset.

-49Tng lut ca mt ruleitem (rulesupCount) l s lng mu trong D c condset v c gn nhn vi lp y. Cc ruleitem tho h tr cc tiu l cc ruleitem thng xuyn. Nu mt tp cc ruleitem c cng condset th lut vi tin cy cao nht c la chn nh mt lut c th (vit tt: PR - Possible Rule) miu t tp. Mt lut tho tin cy cc tiu c gi l lut chnh xc. "Phn loi kt hp lm vic nh th no?" Trc tin, phng php phn loi kt hp tm tp tt c cc PR m c c tnh thng xuyn v tnh chnh xc. chnh l cc lut kt hp lp (vit tt CARs - class association rules). Mt ruleitem m condset ca n cha k item l mt k-ruleitem. Gii thut dng mt tip cn lp, y cc ruleitem c x l tt hn cc itemset. Gii thut qut c s d liu, tm kim k-ruleitems thng xuyn, vi k = 1,2,... cho ti khi tt c cc k-ruleitems thng xuyn c tm ra. Mt ln qut c thc hin i vi mi gi tr ca k. k-ruleitems c dng kho st (k +1)-ruleitems. Khi qut c s d liu ln u tin, tng s h tr ca 1-ruleitems c xc nh v 1-ruleitems thng xuyn c gi li. 1ruleitems thng xuyn cn gi l tp F1 c dng sinh ra ng c 2ruleitems C2. Tri thc ca cc c tnh ruleitem thng xuyn c dng ct ta cc ruleitem ng c khng phi l thng xuyn. Tri thc ny cho thy rng tt c cc tp con khng rng ca mt ruleitems thng xuyn cng phi l thng xuyn. C s d liu c qut ln th 2 tnh tng s h tr ca mi ng c, 2-ruleitems thng xuyn (F2) c th c xc nh. X l ny lp li vi Fk c dng sinh ra Ck+1, cho ti khi khng tm thy mt ruleitems thng xuyn no na. Cc ruleitems thng xuyn m tho tin cy cc tiu hnh thnh nn tp cc CAR. Vic ct ta c th c p dng cho tp lut ny. Bc th 2 ca phng php phn loi kt hp x l cc CAR c pht sinh xy dng classifier. V tng s lng cc tp con cc lut c kim tra xc nh tp cc lut chnh xc nht c th l khng l nn mt phng php heuristic s c dng. Mt th t quyn u tin gia cc lut c nh ngha, ti mt lut ri c u tin cao hn cc lut rj (tc l ri f rj) nu:

-50(1) tin cy ca ri ln hn ca rj, hay (2) Cc tin cy l ging nhau nhng ri c h tr ln hn, hay (3) Cc tin cy v h tr ca ri v rj l nh nhau nhng ri c sinh ra sm hn rj. Nhn chung, gii thut la chn mt tp cc CAR quyn u tin cao ph cc mu trong D. Classifier duy tr cc lut c chn la t th t u tin cao ti thp. Khi phn loi mt mu mi, lut u tin tho mu s c dng phn loi n. Classifier cng cha ng mt lut ngm nh, c th t u tin thp nht, n nh r mt lp ngm nh cho bt k mt mu mi no m khng tho bi bt c mt lut no khc trong classifier. Do vy, khai ph lut kt hp l mt chin lc quan trng sinh ra cc classifier chnh xc v c th m rng. 2.7 Cc phng php phn loi khc Phn ny, ta m t ngn gn mt s phng php phn loi: k-lng ging gn nht, lp lun da trn tnh hung, cc gii thut di truyn, tp th v tp m. Trong cc h thng khai ph d liu thng mi, so vi cc phng php m t cc mc trn, cc phng php ny nhn chung t c dng phn loi hn. V d, phn loi lng ging gn nht lu tr tt c cc mu hun luyn, nh vy s gp kh khn khi hc t cc tp d liu rt ln; nhiu ng dng ca lp lun da trn tnh hung, cc gii thut di truyn v cc tp th cho phn loi vn trong pha nguyn mu. Tuy vy cc phng php ny c mc ph bin ngy cng tng v sau y ta s ln lt xem xt chng. 2.7.1 Cc classifier k-lng ging gn nht Cc classifier lng ging gn nht da trn vic hc bng s ging nhau. Cc mu hun luyn c m t bi cc thuc tnh s n - chiu. Mi mu i din cho mt im trong mt khng gian n - chiu. V vy tt c cc mu hun luyn c lu tr trong khng gian mu n - chiu. Khi c mt mu cha bit cho trc th classifier k-lng ging gn s tm kim trong khng gian mu k mu hun luyn gn mu cha bit nht. k mu hun luyn ny l k "lng

-51ging gn nht" ca mu cha bit. " gn" c nh ngha di dng khong cch Euclidean, ti khong cch Euclidean gia hai im X = (x1,x2,...,xn) v Y = (y1,y2,...,yn) l:
d ( X ,Y ) =

(x
i =1

yi )

(2.19)

Mu cha bit c phn vo lp ph bin nht trong s k lng ging gn nht ca n. Khi k = 1 th mu cha bit c n nh lp ca mu hun luyn gn nht vi n trong khng gian mu. Cc classifier lng ging gn nht da trn khong cch, t chng lu tr tt c cc mu hun luyn. Cc k thut nh ch s hiu qu c dng khi s lng cc mu hun luyn l rt ln. Khng ging nh cy quyt nh quy np v lan truyn ngc, cc classifier lng ging gn nht n nh cc trng s bng nhau cho tng thuc tnh. iu ny c th l nguyn nhn gy nhp nhng khi c nhiu thuc tnh khng thch hp trong d liu. Cc classifier lng ging gn nht cng c dng d on, tc l tr li mt d on gi tr thc cho mt mu cha bit cho trc. Lc ny, classifier tr li gi tr trung bnh ca cc nhn gi tr thc kt hp vi k-lng ging gn nht ca mu cha bit . 2.7.2 Lp lun da trn tnh hung Cc classifier lp lun da trn tnh hung (CBR: Case-based reasoning) l da trn khong cch. Khng ging nh cc classifier k-lng ging gn nht lu tr cc mu hun luyn nh l cc im trong khng gian Euclidean, cc mu hay "cc tnh hung" c lu tr bi CRB l cc m t biu tng phc tp. Cc ng dng thng mi ca CBR gm bi ton gii quyt dch v khch hng tr gip ti ch, v d, ti cc tnh hung m t cc bi ton chn on c lin quan ti sn phm. CBR cng c p dng cho nhiu lnh vc nh cng trnh v php lut, ti cc tnh hung hoc l cc thit k k thut, hoc l cc quyt nh php l tng ng.

-52Khi c mt tnh hung mi cho trc cn phn loi, mt reasoner da trn tnh hung trc tin s kim tra xem liu mt tnh hung hun luyn ng nht tn ti hay khng. Nu n c tm thy th gii php i km tnh hung c tr li. Nu tnh hung ng nht khng tm thy th reasoner da trn tnh hung s kim tra cc tnh hung hun luyn c cc thnh phn ging cc thnh phn ca tnh hung mi. Theo quan nim, cc tnh hung hun luyn ny c th c xem xt nh l cc lng ging ca tnh hung mi. Nu cc tnh hung c biu din nh cc th, iu ny bao gm c vic tm kim cc th con ging vi cc th con nm trong phm vi tnh hung mi. Reasoner da trn tnh hung th kt hp gii php ca cc tnh hung hun luyn lng ging ra mt gii php cho tnh hung mi. Nu xy ra hin tng khng tng hp gia cc gii php ring bit th quay lui tm kim cc gii php cn thit khc. reasoner da trn tnh hung c th dng nn tng tri thc v cc chin lc gii quyt bi ton xut mt gii php kt hp kh thi. Nhng thch thc trong lp lun da trn tnh hung l tm mt metric tng t tt (v d, i vi cc th con i snh), pht trin cc k thut hiu qu nh ch s cc tnh hung hun luyn v cc phng php cho cc gii php kt hp. 2.7.3 Cc gii thut di truyn Cc gii thut di truyn c gng kt hp cht ch cc tng pht trin t nhin. Vic hc di truyn nhn chung s c bt u nh sau: Mt qun th (population) ban u c to gm cc lut c sinh ra ngu nhin. Mi lut c biu din bi mt dy cc bit. V d, gi s rng cc mu trong mt tp hun luyn cho trc c m t bi hai thuc tnh Boolean A1 v A2, c hai lp C1 v C2. Lut "IF A1 and not A2 THEN C2" c m ho thnh dy bit "100", vi 2 bit tri nht i din cho cc thuc tnh A1 v A2 v bit phi nht i din cho lp. Tng t, lut "IF not A1 and not A2 THEN C1" c m ho thnh "001". Nu mt thuc tnh c gi tr k vi k > 2 th k bit c dng m ho cc gi tr thuc tnh. Cc lp c th c m ho theo cch tng t.

-53Da trn khi nim v s tn ti ca kim nh ph hp, mt qun th mi c thit lp bao gm cc lut kim nh ph hp trong qun th hin thi, cng nh con chu (offspring) ca cc lut ny. S ph hp ca mt lut c nh gi bi chnh xc phn loi ca n trn mt tp cc mu hun luyn. Con chu c to bng cch p dng cc php di truyn nh lai nhau v t bin. Trong php ton lai nhau, cc chui con t cc cp lut c trao i thit lp cc cp lut mi. Trong php ton t bin, cc bit c la chn ngu nhin trong chui lut o ngc. X l vic sinh ra cc qun th mi da trn cc qun th trc ca cc lut tip tc cho ti khi mt qun th P "tin ho" ti mi lut trong P tho mt ngng ph hp c ch nh trc. Cc gii thut di truyn d x l song song v c s dng cho phn loi cng nh cc bi ton ti u khc. Trong khai ph d liu, chng c th c dng nh gi ph hp ca cc gii thut khc. 2.7.4 L thuyt tp th L thuyt tp th c dng cho phn loi pht hin ra cc mi quan h c cu trc trong phm vi d liu khng chnh xc hay d liu nhiu. N p dng cho cc thuc tnh c gi tr ri rc. Cc thuc tnh c gi tr lin tc do vy phi c ri rc ho trc khi s dng. L thuyt tp th da trn s thit lp cc lp tng ng trong phm vi d liu hun luyn. Tt c cc mu d liu to thnh mt lp tng ng khng phn bit c, l cc mu ng nht v phng din cc thuc tnh m t d liu. Trong d liu th gii thc cho trc, thng thng l cc lp khng th c phn bit di dng ca cc thuc tnh c sn. Cc tp th c dng xp x hay "lm th" nh ngha cc lp nh vy. nh ngha tp th cho mt lp C cho trc c xp x bi hai tp - mt xp x thp hn C v mt xp x cao hn C. Xp x thp hn C gm tt c cc mu d liu da trn tri thc ca cc thuc tnh, tt nhin thuc v C m khng mp m. Xp x cao hn C gm tt c cc mu d liu c da trn tri thc ca cc thuc tnh, khng

-54c m t nh khng thuc v C. Cc xp x thp hn v cao hn ca lp C nh biu din hnh 2.12, ti min mi hnh ch nht i din cho mt lp tng ng. Cc lut quyt nh c th c sinh ra cho mi lp, mt bng quyt nh c dng miu t cc lut.

Hnh 2.12: Mt xp x tp th ca tp cc mu thuc lp C Cc tp th cng c dng gim bt c trng (cc thuc tnh khng gp phn vo vic phn loi d liu hun luyn cho trc, chng c th c nhn bit v g b) v php phn tch s thch hp (s ng gp hay ngha ca mi thuc tnh c nh gi di phng din l tc v phn loi). Bi ton tm kim cc tp con ti thiu (cc reduct) ca cc thuc tnh c th m t tt c cc khi nim trong tp d liu cho l NP-kh. Tuy nhin, cc gii thut gim mc tnh ton c xut. V d, dng mt ma trn nhn thc (discernibility matrix) lu tr cc khc bit ca cc gi tr thuc tnh i vi mi cp mu d liu. Hn na, ma trn ny thay cho vic tm kim d cc thuc tnh d tha trn ton b tp hun luyn. 2.7.5 Cc tip cn tp m Cc h thng da trn lut cho phn loi c im bt li l chng i hi cc ngng r rng cho cc thuc tnh lin tc. V d, xem lut (2.20) di y thy chp thun yu cu cho khch hng vay. V c bn lut cho bit cc yu cu i vi khch hng: phi l nhng ngi c vic lm t nht trong hai nm v thu nhp ti thiu $50K th mi c chp thun. IF (nm cng tc2)(thu nhp> 50K) THEN quyt nh=chp thun (2.20) Vi lut (2.20), mt khch hng - ngi m lm vic t nht l 2 nm s c cho vay nu thu nhp ca c ta l $51K, nhng khng nhn c nu l

-55$50K. t ngng th nh vy c v khng thun li lm. Logic m s khc phc c nhc im ny bng cch nh ngha cc ngng m hay cc ng bin "m". Khng cn mt ngng r rng gia cc tp hay cc loi, logic m s dng cc gi tr chn l gia 0.0 v 1.0 biu din mc thnh vin ca mt gi tr no vo mt loi cho trc. Do vy, vi logic m, ta c c khi nim thu nhp=$50K mt mc no l cao mc du khng cao nh thu nhp= $51K. Logic m hu ch cho cc h thng khai ph d liu biu din phn loi. N cung cp thun li khi lm vic ti mt mc tru tng cao. Nhn chung, tnh hu ch ca logic m trong cc h thng da trn lut bao gm: Cc gi tr thuc tnh c chuyn i sang cc gi tr m. Hnh 2.13 cho thy cc gi tr cho thuc tnh lin tc thu nhp c nh x vo trong cc loi ri rc {thp, trung bnh, cao}, cng nh cc gi tr thnh vin m hay chn l c tnh ton nh th no. Cc h thng logic m cung cp cc cng c th tr gip cc user trong bc ny. i vi mt mu mi cho trc, c th p dng nhiu hn mt lut m. Mi mt lut thch hp xy dng mt biu quyt thnh vin trong cc loi, in hnh, cc gi tr chn l cho mi loi d on c tnh tng.
Thnh vin m 1.0 Thp Hi thp Trung bnh Cao ng ranh gii cao 40K 50K 60K 70K Thu nhp

0.5

10K 20K 30K

Hnh 2.13: Cc gi tr m i vi thu nhp Cc tng c c trn c kt hp vo trong mt gi tr m h thng cp. X l ny c th c lm bng cch nh trng s mi loi bng tng chn l ca n v nhn vi gi tr chn l trung bnh ca mi loi. Cc php tnh

-56ny c th l phc tp hn, tu thuc vo phc tp ca cc th thnh vin m. Cc h thng logic m c dng phn loi trong nhiu lnh vc nh chm sc sc kho, ti chnh. 2.8 chnh xc classifier
Classifier nhn c nh gi chnh xc

Tp hun luyn

D liu
Tp kim nh

Hnh 2.14: nh gi chnh xc classifier vi phng php holdout nh gi chnh xc classifier l vic quan trng. D liu nh gi l d liu khng dng hun luyn classifier, chnh xc mt classifier l ph hp ca nhn d liu tng lai. V d, hun luyn mt classifier t d liu bn hng d on thi quen mua sm ca khch hng, ta cn nh gi chnh xc classifier c th d on thi quen mua sm ca cc khch hng tng lai nh th no. chnh xc nh gi ny s tr gip cho vic so snh cc classifier khc nhau. Phn 2.9.1 ni v cc k thut nh gi chnh xc classifier nh phng php holdout v hp l cho k-fold. Trong mc 2.9.2 m t hai chin lc tng chnh xc classifier: bagging v boosting. Mc 2.9.3 l cc vn c lin quan n vic la chn classifier. 2.8.1 nh gi chnh xc classifier Holdout v hp l cho l hai k thut ph bin nh gi chnh xc classifier da trn cc phn chia ly mu ngu nhin t d liu cho trc. Trong phng php holdout, d liu cho c phn chia ngu nhin vo trong hai tp c lp: mt tp hun luyn v mt tp kim nh. Hai phn ba d liu c ch nh l tp hun luyn v cn li mt phn ba c ch nh l tp

-57kim nh. Tp hun luyn c dng thu classifier, chnh xc ca n c nh gi vi tp kim nh (Hnh 2.14). Vic nh gi ny l lc quan bi ch mt phn d liu ban u c dng thu classifier. Ly mu con ngu nhin l mt s thay i ca phng php holdout trong phng php holdout c lp li k ln. chnh xc classifier bng gi tr trung bnh ca cc chnh xc c c t mi ln lp. Trong hp l cho k-fold, d liu ban u c phn chia ngu nhin vo trong k tp con ring bit ("cc fold") S1,S2,...,Sk, chng c kch thc xp x bng nhau. Hun luyn v kim nh c thc hin k ln. Trong ln lp th i, tp con Si ng vai tr nh mt tp kim nh v cc tp con cn li c dng chung hun luyn classifier. Tc l classifier ca ln lp u tin c hun luyn trn cc tp con S2,S3,...,Sk v c kim nh trn S1; classifier ca ln lp th 2 c hun luyn trn cc tp con S1,S3,...,Sk v c kim nh trn S2, v.v... chnh xc classifier l ton b s lng cc phn loi chnh xc t k ln lp chia cho tng s lng cc mu trong d liu ban u. Trong hp l cho phn tng, cc fold c phn tng s phn b lp ca cc mu trong mi fold xp x nh s phn b lp trong d liu ban u. Nhn chung, phn tng hp l cho 10-fold c ngh nh gi chnh xc classifier (thm ch nu kh nng tnh ton cho php th c th s dng nhiu fold hn). S dng cc k thut ny nh gi chnh xc classifier, lm tng tng s ln tnh ton, tuy nhin n li hu ch cho vic la chn classifier. 2.8.2 Gia tng chnh xc classifier
C_1 mu d liu i Kt hp cc phiu lp d on

D liu

C_2 . . C_T

-58Hnh 2.15: Tng chnh xc classifier Trong mc trc, ta nghin cu cc phng php nh gi chnh xc classifier. Trong mc 2.3.2 ta thy ct ta c th c p dng vo cy quyt nh gip ci thin chnh xc ca kt qu cc cy quyt nh. Bagging (hay boostrap aggregation) v boosting l hai k thut (nh hnh 2.15). Mi khi kt hp mt lot T classifier hc C1,C2,...,CT s to ra mt classifier hn hp c ci tin C*. "Cc phng php ny lm vic nh th no?" Gi s rng bn l mt bnh nhn v bn cn c mt chn on c lm da trn cc triu chng ca bn. Thay v hi bc s, bn c th t la chn. Nu mt chn on no chun hn nhng ci khc, bn s la chn l chn on cui cng hay chn on tt nht. By gi thay th mi bc s bng mt classifier v bn c kh nng trc gic ng sau bagging. Bn n nh cc trng s bng gi tr hay "tr gi" mi chn on ca bc s da trn chnh xc ca cc chn on trc chng lm. Chn on cui cng l s kt hp ca cc chn on c trng s. y l bn cht ca boosting. Ta s c mt ci nhn gn hn 2 k thut ny: Cho trc mt tp S c s mu, bagging lm vic nh sau. Ti ln lp t (t = 1,2,...,T), mt tp hun luyn St c ly mu, thay th tp cc mu gc S. T khi s dng vic ly mu vi thay th, mt vi trong s cc mu ca S c th khng c mt trong St, trong khi cc mu khc c th xut hin nhiu hn mt ln. Mt classifier Ct c hc cho mi tp hun luyn St. phn loi mt mu khng bit X, mi classifier Ct phi tr li d on lp cho n, n m nh mt phiu bu. Classifier thu c C* m cc phiu bu v cc n nh lp vi s phiu bu nhiu nht cho X. Bagging c th c p dng d on cc gi tr lin tc bng cch ly gi tr trung bnh ca cc phiu bu, hn l ly theo s ng gi tr. Trong boosting, cc trng s c n nh cho tng mu hun luyn. Mt lot cc classifier c hc. Sau khi mt classifier Ct c hc, cc trng s c cp nht cho php classifier tip theo Ct+1 "ch nhiu hn" ti cc sai

-59s phn loi sai c vi Ct. Classifier boost cui cng C* kt hp cc phiu bu ca mi classifier ring l, ti trng s ca mi phiu bu ca classifier c chc nng l chnh xc ca n. Gii thut boosting c th c m rng d on cc gi tr lin tc. 2.8.3 chnh xc c nh gi mt classifier hay khng? Thm vo chnh xc, cc classifier c th c so di phng din tc v s trng kin ca chng (v d, chnh xc trn d liu nhiu), kh nng m rng, v kh nng din dch. Kh nng m rng c th c c lng bng cch nh gi s lng cc thao tc I/O cn c cho mt gii thut phn loi cho trc trn cc tp d liu vi kch thc tng dn. Trong cc bi ton phn loi, gi s rng tt c cc i tng c phn loi duy nht, tc l mi mu hun luyn thuc v ch mt lp. Nh ta tho lun trn, cc gii thut phn loi sau c th c so snh theo chnh xc ca chng. Tuy nhin, bi tnh a dng ca d liu trong cc c s d liu ln, vic gi s rng tt c cc i tng c phn loi c duy nht khng phi lun hp l. Hn na, gi nh mi i tng thuc v nhiu hn mt lp c kh nng xy ra nhiu hn. Vic tr li mt xc sut phn b lp hu ch hn vic tr li mt nhn lp. Cc php o chnh xc sau c th s dng mt heuristic d on ln hai nh mt d on lp c nh gi chnh xc nu n thch hp vi lp c kh nng th nht hay th hai. Mc du iu ny khng c nghin cu, nhng mt mc no s phn lp cc i tng l khng duy nht. y khng phi l mt gii php y . 2.9 Kt lun Nh vy chng 2 trnh by khi nim, cc bc v cc phng php phn loi, phng php nh gi chnh xc phn loi v gia tng chnh xc ny.

-60CHNG 3: K THUT PHN CM TRONG KHAI PH D LIU 3.1 Phn cm l g X l nhm mt tp cc i tng vo trong cc lp cc i tng ging nhau c gi l phn cm. Mt cm l mt tp hp cc i tng d liu ging nhau trong phm vi cng mt cm v khng ging nhau vi cc i tng trong cc cm khc. Php phn tch cm l mt hot ng quan trng. Thi k u, n hc lm th no phn bit gia mo v ch hay gia ng vt v thc vt, bng cch trau di lin tc tim thc cc lc phn loi. Php phn tch cm c dng rng ri trong nhiu ng dng, bao gm nhn dng, php phn tch d liu, x l nh, nghin cu th trng, v.v... Bng phn cm, ta c th nhn bit cc vng ng c v tha tht, bi vy tm ra ton b cc mu phn b v cc tng quan th v gia cc thuc tnh d liu. Trong kinh doanh, phn cm c th gip cho cc nh nghin cu th trng tm ra cc nhm ring bit da trn khch hng ca h v m t cc nhm khch hng da trn cc mu mua sm. Trong sinh vt hc, n c th c dng c c cc nguyn tc phn loi thc vt v ng vt, phn loi gien theo chc nng ging nhau v c c s hiu bit thu o cc cu trc k tha trong cc mu. Phn cm cng c th c dng nhn bit cc vng t ging nhau dng trong c s d liu quan st tri t v nhn bit cc nhm c hp ng bo him t vi mc chi ph trung bnh cao, cng nh nhn bit cc nhm nh trong thnh ph theo kiu nh, gi tr v khu vc a l. N c th cng gip cho vic phn loi d liu trn WWW khai thc thng tin. Nh mt hm khai ph d liu, php phn tch cm c dng nh l mt cng c c lp c th nhn thu c bn trong s phn b d liu, quan st cc c im ca mi cm v tp trung trn mt tp c bit cc cm cho php phn tch xa hn. Tip theo, n phc v nh l mt bc tin x l cho cc gii thut khc nh phn loi v m t, thao tc trn cc cm d c.

-61Phn cm d liu l mt mn khoa hc tr ang pht trin mnh m. C mt s lng ln cc bi bo nghin cu trong nhiu hi ngh, hu ht trong cc lnh vc ca khai ph d liu: thng k, hc my, c s d liu khng gian, sinh vt hc, kinh doanh, v.v...vi tm quan trng v cc k thut khc nhau. Do s lng ln cc d liu thu thp trong c s d liu nn php phn tch cm gn y tr thnh mt ch tch cc cao trong nghin cu khai ph d liu. Nh l mt nhnh ca thng k, php phn tch cm c nghin cu m rng nhiu nm, tp trung chnh trn php phn tch cm da trn khong cch. Cc cng c phn tch cm da trn k-means, k-medoids v mt s cc phng php khc cng c xy dng trong nhiu gi phn mm hay h thng phn tch thng k nh S-Plus, SPSS v SAS. Trong hc my, php phn tch cm thng da trn hc khng gim st. Khng ging nh phn loi, phn cm khng da trn cc lp nh ngha trc v cc mu d liu hun luyn gn nhn lp. Bi l do ny nn n c dng l hc bng s quan st, hn l hc bng cc mu. Trong phn cm khi nim, mt nhm i tng hnh thnh nn mt lp ch khi no n c m t bi mt khi nim. iu ny khng ging vi phn cm theo cch truyn thng - cch m o tnh ging nhau da trn khong cch hnh hc. Phn cm truyn thng bao gm hai thnh phn: (1) N khm ph cc lp thch hp. (2) N thit lp cc m t cho mi lp nh trong phn loi. Nguyn tc ch o vn l lm sao cho ging nhau trong cng mt lp l cao v ging nhau gia cc lp l thp. Trong khai ph d liu, ngi ta thng nghin cu cc phng php php phn cm ngy cng hiu qu trong cc c s d liu ln. Cc ch tch cc ca nghin cu tp trung trn kh nng m rng ca cc phng php phn cm, hiu qu ca cc phng php phn cm d liu c hnh dng v kiu phc tp, cc k thut phn cm cho d liu vi s chiu cao v cc phng php phn cm c s pha trn ca d liu s v d liu xc thc trong cc c s d liu ln.

-62Phn cm l mt lnh vc nghin cu c nhiu thch thc, ti cc ng dng tim nng ca n a ra cc yu cu c bit. Sau y l cc yu cu in hnh ca phn cm trong khai ph d liu: 1. Kh nng m rng: Nhiu gii thut phn cm lm vic tt trong cc tp d liu nh cha t hn 200 i tng d liu, tuy nhin mt c s d liu ln c th cha hng triu i tng. Phn cm cho mt mu ca mt tp d liu ln cho trc c th dn ti cc kt qu b lch. Ta c th pht trin cc gii thut phn cm c kh nng m rng cao trong cc c s d liu ln nh th no? 2. Kh nng gii quyt cc kiu khc nhau ca cc thuc tnh: Nhiu gii thut c thit k phn cm d liu s da trn khong cch. Tuy nhin, nhiu ng dng c th yu cu phn cm cc kiu khc nhau ca d liu nh nh phn, xc thc (tn) v d liu c th t hay s pha trn cc kiu d liu ny. 3. Pht hin ra cc cm vi hnh dng tu : Nhiu gii thut phn cm nh r cc cm da trn cc php o khong cch Euclidean v Manhattan. Cc gii thut da trn cc php o khong cch nh th ny c khuynh hng tm cc cm hnh cu vi kch thc v mt ging nhau. Tuy nhin, mt cm c th c hnh dng bt k. iu ny rt quan trng pht trin cc gii thut - cc gii thut ny c th pht hin ra cc cm c hnh dng tu . 4. Cc yu cu ti thiu cho min tri thc xc nh r cc tham s u vo: Nhiu gii thut phn cm yu cu ngi dng nhp vo cc tham s no trong php phn tch cm (nh s lng cc cm ngh). Kt qu phn cm thng rt nhy cm vi cc tham s u vo. Nhiu tham s kh xc nh, c bit i vi cc tp d liu cha cc i tng s chiu cao. iu ny khng ch l gnh nng cho cc user m cn lm cho cht lng phn cm kh iu khin. 5. Kh nng gii quyt d liu nhiu: Hu ht cc c s d liu th gii thc cha cc outlier hay cc d liu khuyt, d liu khng bit hay d liu sai.

-63Nhiu gii thut phn cm nhy cm vi d liu nh th ny v c th dn ti cht lng cc cm km. 6. S khng nhy cm khi sp xp cc bn ghi u vo: Nhiu gii thut phn cm nhy cm vi trt t ca d liu u vo, v d: cng mt tp d liu, khi trnh din vi cc trt t khc nhau trong cng mt gii thut, c th pht sinh t xut cc cm khc nhau. Do vy, vic pht trin cc gii thut nhy cm vi trt t u vo thc s quan trng. 7. S chiu cao: Mt c s d liu hay mt kho d liu c th cha cc chiu hay thuc tnh khc nhau. Nhiu gii thut phn cm c cht lng rt tt khi vn dng d liu vi s chiu thp, khong hai ti ba chiu. Mt ngi rt gii xt on cht lng phn cm cho ti ba chiu. Thch thc ang t ra i vi vic phn cm cc i tng d liu trong khng gian c s chiu cao, c bit lu n d liu trong khng gian s chiu cao c th rt tha tht v b lch nhiu. 3. Phn cm da trn rng buc: Cc ng dng th gii thc c th cn thc hin phn cm di rt nhiu loi rng buc. Gi s cng vic ca bn l la chn v tr t mt s lng cho trc cc trm tin tr tin t ng (ATMs) mi trong thnh ph. gii quyt iu ny, bn c th phn cm cc h gia nh trong khi xem xt cc con sng v mng li ng quc l ca thnh ph v cc yu cu khch hng trn tng vng nh l cc rng buc. Mt nhim v t ra l tm cc nhm d liu vi cht lng phn cm tt v tho rt nhiu rng buc khc nhau. 9. Kh nng din dch v tnh tin li: Cc user c th trng ch cc kt qu phn cm kh nng din dch, tnh ton din v tin li. Phn cm c th cn c lin kt vi cc cch hiu ng ngha c th v cc ng dng c th. Vic nghin cu mc ch ca ng dng nh hng nh th no n vic la chn cc phng php phn cm l thc s quan trng. Vi cc yu cu ny, ta s ln lt nghin cu cc x l php phn tch cm nh sau: Trc tin ta nghin cu cc kiu khc nhau ca d liu v chng

-64c nh hng ti cc phng php phn cm nh th no. Th hai, ta a ra mt phn loi tng qut cc phng php phn cm. Sau ta nghin cu mi phng php phn cm mt cch chi tit, bao gm cc phng php phn chia, cc phng php phn cp, cc phng php da trn mt , cc phng php da trn li v cc phng php da trn m hnh. Ta cng kim tra phn cm trong khng gian c s chiu cao v tho lun s khc nhau ca cc phng php khc nhau. 3.2 Cc kiu d liu trong php phn cm Trong phn ny, ta nghin cu cc kiu d liu thng xut hin trong cc php phn cm v tin x l chng nh th no cho php phn tch ny. Gi s rng mt tp d liu c phn cm cha n i tng, n c th i din cho ngi, nh, vn bn, t nc, v.v... Cc gii thut phn cm da trn b nh chnh thao tc trn mt trong hai cu trc d liu sau: 1. Ma trn d liu (hay cu trc: i tng x bin): c i din bi n i tng, v d nh ngi vi p bin (cn c gi l cc php o hay cc thuc tnh) nh tui, chiu cao, gii tnh, v.v... Cu trc c dng bng quan h, hay ma trn n x p (n i tng x p bin) nh trong (3.1)
x11 ... xi1 ... xn1 ... x1 f ... ... ... xif ... ... ... xnf ... x1 p ... ... ... xip ... ... ... xnp

(3.1)

2. Ma trn khng tng ng (hay cu trc i tng x i tng): N lu tr mt tp hp cc trng thi (v mt khng gian, thi gian,...) cho tt c n cp i tng. N thng c biu din bi bng n x n nh hnh (3.2)
0 d (2,1) 0 d (3,1) d (3,2) 0 M M M d (n,1) d (n,2 ) ... ... 0

(3.2)

-65vi d(i,j) c o bi s khc nhau hay khng tng ng gia cc i tng i v j. Do vy d(i,j) = d(j,i) v d(i,i) = 0, ta c ma trn trong hnh (3.2). Cc php o khng tng ng c tho lun trong sut phn ny. Ma trn d liu thng c gi l ma trn 2-mode (2 ch ), trong khi ma trn khng tng ng c gi l ma trn 1-mode (1 ch ). Nhiu gii thut phn cm thao tc trn ma trn khng tng ng. Nu d liu c a ra di dng ma trn d liu th n c th c chuyn i sang ma trn khng tng ng trc khi p dng cc gii thut phn cm. Cm cc i tng c tnh ton da trn s tng ng hay khng tng ng ca chng. Trong phn ny, trc tin ta tho lun cht lng phn cm c th c nh gi da trn cc h s tng quan - c th chuyn i thnh cc h s khng tng ng hay tng ng. Sau ta tho lun lm th no tnh khng tng ng ca cc i tng c m t bi cc bin da trn khong cch, cc bin nh phn, cc bin da trn tn, c th t v t l (ratio) hay s kt hp ca cc kiu bin ny. 3.2.1 khng tng ng v tng ng: o cht lng phn cm Php o ca cc h s khng tng ng hay tng ng c dng o cht lng phn cm. khng tng ng d(i,j) l mt s khng m, n gn bng 0 khi i, j gn nhau v s ln hn khi chng khc bit nhau nhiu hn. Khng tng ng c c bng cc nh gi ch quan n gin bi mt tp cc observer (quan st vin) hay cc chuyn gia trn cc i tng khc nhau no . S khng tng ng c tnh ton t cc h s tng quan. Cho trc n i tng phn cm, tng quan Pearson product-moment gia hai bin f v g c nh ngha trong (3.3), ti f v g l cc bin m t cc i tng, mf v mg l cc gi tr trung bnh ca f v g v xif l gi tr ca f cho i tng th i, xig l gi tr ca g cho i tng th i.
R( f , g ) =

(x m )(x m ) (x m ) (x m )
n i =1 if if f ig n g n 2 i =1 f i =1 ig g

(3.3)

-66Cng thc chuyn i (3.4) c dng tnh h s khng tng quan d(f,g) t cc h s tng quan R(f,g): d(f,g) = (1 - R(f,g))/2 (3.4) Cc bin vi mt tng quan dng cao s n nh h s khng tng ng gn bng 0. Cc bin vi mt tng quan m mnh s n nh h s khng tng ng gn bng 1 (ngha l cc bin rt khc nhau). Trong nhiu ng dng, ngi dng thch dng cng thc chuyn i (3.5) hn, ti cc bin vi tng quan m hay dng cao n nh cng mt gi tr tng ng cao. d(f,g) = 1 - |R(f,g)| tng ng. Cng thc (3.6) c dng chuyn i gia hai h s. s(i,j) = 1 - d(i,j) (3.6) Lu rng khng phi tt c cc bin u cn trong php phn tch cm. Mt bin l v ngha vi mt phn cm cho trc th tnh hu ch s t hn, do vy n n i thng tin hu ch cung cp bi cc bin khc. V d, s in thoi ca mt ngi thng v ch trong phn cm ngi theo m t v h nh tui, chiu cao, cn nng, v.v...Kiu bin "rc" nh vy nn c trng s 0, tr khi n c php phn cm x l. 3.2.2 Cc bin t l khong cch Phn ny tho lun cc bin t l khong cch v chun ho chng. Sau m t cc php o khong cch ph bin c dng trong tnh ton khng tng ng ca cc i tng c m t bi cc bin t l khong cch. Cc php o ny bao gm cc khong cch Euclidean, Mahattan v Minkowski. Cc bin t l khong cch l cc php o lin tc ca mt t l tuyn tnh th. Cc mu in hnh nh trng lng v chiu cao, s kt hp v v kinh (v d khi phn cm nh) v nhit kh hu. n v php o dng c th nh hng n php phn cm. V d, thay i cc n v o, nh thay i t meter ti inche cho chiu cao hay t kilogram (3.5) Ngi dng c th s dng h s tng ng s(i,j) thay cho h s khng

-67ti pound cho trng lng, c th dn ti mt cu trc phn cm rt khc bit. Nhn chung, biu din mt bin di cc n v nh hn s dn ti mt phm vi ln hn cho bin v do vy mt hiu ng ln hn trn kt qu cu trc phn cm. trnh s ph thuc vo vic la chn n v o, d liu nn c chun ho. Chun ho cc php o c gng mang li cho tt c cc bin mt trng s nh nhau. Tuy nhin, trong nhiu ng dng, ngi ta c th c mun mang ti trng s ln hn cho mt tp cc bin no so vi cc bin khc. V d, khi phn cm cc cu th chi bng r, ngi ta c th thch mang ti trng s hn cho bin chiu cao. chun ho cc php o, mt la chn l chuyn i cc php o gc sang cc bin khng n v (unitless). Cho trc cc php o i vi bin f. iu ny c th c biu din nh sau: 1. Tnh trung bnh lch tuyt i sf
sf = 1 x1 f m f + x 2 f m f + L + x nf m f n

(3.7)

vi x1f ,..., xnf l n php o ca f, mf l gi tr trung bnh ca f, tc l


mf = 1 (x1 f + x2 f + L + xnf n

)
xif m f sf

2. Tnh php o chun ho, gi l z-score nh sau:


z if =

(3.8)

Thun li ca vic s dng lch tuyt i trung bnh l z-scores ca cc outlier khng tr nn qu nh, do vy cc outlier vn d nhn thy. Tuy nhin la chn vic chun ho v biu din chun ho nh th no l thuc v pha ngi dng. Sau khi chun ho hay khng cn chun ho trong mt s ng dng no , ta tnh khng tng ng (hay tng ng) gia cc i tng. Cho trc cc bin t l khong cch, da trn khong cch gia tng cp i tng. C mt s tip cn nh ngha khong cch gia cc i tng. Php o khong cch ph bin nht l khong cch Euclidean, n c nh ngha nh sau:

-68d (i, j ) = xi1 x j1 + xi 2 x j 2 + ... + xip x jp


2 2 2

(3.9)

vi i = (xi1, xi2,..., xip) v j =(xj1,xj2,...,xjp) l hai i tng d liu p chiu. Mt metric ni ting khc l khong cch Mahattan (hay city block) c nh ngha bi:
d (i, j ) = xi1 x j1 + xi 2 x j 2 + ... + xip + x jp

(3.10)

C khong cch Euclidean v khong cch Mahattan tho cc yu cu ton hc ca mt hm khong cch: 1. d(i,j) 0 cho bit khong cch l mt s khng m 2. d(i,i) = 0 cho bit khong cch ca mt i tng ti chnh n th bng 0 3. d(i,j) = d(j,i) cho bit khong cch l mt hm i xng 4. d(i,j) d(i,h) + d(h,j) bt ng thc tam gic ny cho bit khong cch trc tip t i ti j khng ln hn khong cch i theo ng vng qua bt k mt im h no. Khong cch Minkowski l tng qut ho ca c hai khong cch Euclidean v Mahattan. N c nh ngha nh sau:
d (i, j ) = ( xi1 x j1 + xi 2 x j 2 + ... + xip + x jp )1 / q
q q q

(3.11)

vi q l mt s nguyn dng, n i din cho khong cch Mahattan khi q = 1 v Euclidean khi q = 2. Nu mi bin c n nh mt trng s theo quan trng nhn bit ca n, khong cch Euclidean c nh trng s c th c tnh nh sau:
d (i, j ) = w1 xi1 x j1 + w2 xi 2 x j 2 + ... + w p xip x jp
2 2 2

(3.12)

nh trng s cng c p dng cho khong cch Mahattan v Minkowski. 3.2.3 Cc bin nh phn Phn ny m t lm th no tnh ton khng tng ng gia cc i tng c m t bi cc bin nh phn i xng hoc khng i xng.

-69Mt bin nh phn ch c hai trng thi 0 hay 1, vi 0 l bin vng mt, 1 l bin c mt. Cho trc bin ht thuc m t mt bnh nhn, v d, 1 ch rng bnh nhn ht thuc, 0 cho bit bnh nhn khng ht thuc. X l cc bin nh phn ging nh cc bin t l khong cch c th dn ti lc li cc kt qu phn cm. Bi vy, cc phng php ch nh cho d liu nh phn cn phi tnh ton khng tng ng. Mt tip cn tnh ton ma trn khng tng ng t d liu nh phn cho. Nu tt c cc bin nh phn c xem nh l c cng trng s, ta c bng ngu nhin 2 x 2, bng 3.1, vi a l s cc bin bng 1 cho c hai i tng i v j, b l s cc bin bng 1 cho i tng i v 0 cho i tng j, c l s cc bin bng 0 cho i tng i v 1 cho i tng j, d l s cc bin bng 0 cho c i tng i v j. Tng s lng ca cc bin l p, p = a + b + c + d. Bng 3.1: Bng ngu nhin cho cc bin nh phn
i tng j 1 0 a b c d a+c b+d tng a+b c+d p

i tng i

1 0 tng

Mt bin nh phn l i xng nu nh c hai trng thi ca n c cng tr gi v mang cng trng s, do vy khng c s u tin nn kt qu m ho l 0 hay 1. V d, gii tnh c th l nam hay n. tng ng da trn cc bin nh phn i xng c gi l tng ng bt bin trong kt qu khng thay i khi mt s hay tt c cc bin nh phn c m ho khc nhau. i vi cc o tng ng bt bin, h s c bit n nhiu nht l h s i snh n gin (simple matching coefficient) c nh ngha trong (3.13).
d (i, j ) = b+c a+b+c+d

(3.13)

Mt bin nh phn l khng i xng nu nh kt qu ca cc trng thi quan trng khng bng nhau. Ta s m ho nh sau: kt qu c tm quan trng nht l 1 (v d dng tnh HIV) v nhng ci cn li bng 0 (v d nh m tnh

-70HIV). Mt bin nh phn nh vy c xem nh l "bin unary". tng ng da trn cc bin c gi l tng ng khng bt bin. i vi cc tng ng khng bt bin, h s c bit n nhiu nht l h s Jaccard, c nh ngha trong (3.14), ti cc i snh m d c xem l khng quan trng v do vy b l i khi tnh ton.
d (i, j ) = b+c a+b+c

(3.14)

Khi c bin nh phn i xng v khng i xng xut hin trong cng tp d liu, tip cn cc bin pha trn c m t trong mc 3.2.5 c th c p dng. V d 3.1 khng tng ng gia cc bin nh phn: Gi s rng mt bng cc bn ghi bnh nhn, bng 3.2 cha cc thuc tnh tn, gii tnh, st, ho, test-1,test-2,test-3 v test-4 (test: xt nghim), vi tn l mt object-id, gii tnh l mt thuc tnh i xng v cc thuc tnh cn li l khng i xng. Bng 3.2: Bng quan h cha hu ht cc thuc tnh nh phn
Tn Gii tnh Jack M Mary F Jim M
M M

St Y Y Y
M

Ho N N P
M

test-1 P P N
M

test-2 test-3 test-4 N N N N P N N N N


M M M

i vi cc gi tr thuc tnh khng i xng, cho cc gi tr Y v P l 1; N l 0. Gi s rng khong cch gia cc i tng (bnh nhn) c tnh ton da trn ch cc bin khng i xng. Theo cng thc h s Jaccard (3.14), khong cch gia mi cp 3 bnh nhn, Jack, Mary v Jim s l:
d ( jack , mary ) = d ( jack , jim ) = d ( jim, mary ) = b+c 0 +1 = = 0.33 a + b + c 2 + 0 +1

(3.15) (3.16) (3.17)

1+1 b+c = = 0.67 a + b + c 1+1+1 b+c 1+ 2 = = 0.75 a + b + c 1+1+ 2

-71Cc php o ny cho thy Jim v Mary khng c ha hn l c bnh ging nhau. Trong 3 bnh nhn ny, Jack v Mary c th c bnh ging nhau nht. 3.2.4 Cc bin tn, c th t v da trn t l Phn ny tho lun lm th no tnh khng tng ng gia cc i tng c m t bi cc bin tn, c th t v da trn t l. Cc bin tn: Bin tn l s suy rng ca bin nh phn, trong n c th mang nhiu hn hai trng thi. V d, bn mu l mt bin tn c th c 5 trng thi: , vng, xanh l cy, hng v xanh da tri. Cho s cc trng thi ca mt bin tn l M. Cc trng thi c th c ch ra bi cc k t, cc biu tng hay mt tp cc s nguyn nh 1,2,...,M. Lu rng cc s nguyn nh th ny ch c dng cho d liu iu khin v khng i din cho bt k mt trt t c th no. khng tng ng gia hai i tng i v j c th c tnh bng cch s dng tip cn i snh n gin nh trong (3.18):
d (i, j ) = pm p

(3.18)

vi m l s lng cc i snh (tc l s lng cc bin m i v j c cng trng thi) v p l tng s ca cc bin. Cc trng s c th c n nh lm tng hiu qu ca m, hay n nh trng s ln hn cho cc i snh trong cc bin c s lng cc trng thi ln hn. Cc bin tn c th c m ho bi mt s lng ln cc bin nh phn khng i xng bng cch to mt bin nh phn mi cho mi trng thi tn. i vi mt i tng vi gi tr trng thi cho trc, bin nh phn miu t trng thi t l 1, trong khi cc bin nh phn cn li t l 0. V d, m ho bin tn bn mu, mt bin nh phn c th c to lp cho tng mu trong danh sch 5 mu trn. Cho mt i tng c mu vng, bin vng t l 1, trong khi bn bin cn li t l 0. H s khng tng ng cho dng ny khi m ho c tnh nh cc phng php trong mc 3.2.3.

-72 Cc bin c th t: Bin c th t ri rc tng t nh mt bin tn, loi tr M trng thi ca gi tr c th t c sp xp theo mt trt t c ngha. Cc bin c th t rt hu ch cho vic th hin cc nh gi cht lng mt cch ch quan m khng th o c bng cch khch quan. Mt bin c th t lin tc trng ging nh mt tp d liu lin tc vi mt t l cha bit, l mi quan h c th t ca cc gi tr, l yu t cn thit nhng khng phi l tnh cht trng yu thc s ca chng. V d, sp xp quan h trong mt mn th thao c th thng cn thit hn cc gi tr thc t ca mt o c th. Cc bin c th t c th cng t c t vic ri rc ho cc con s t l khong cch bng cch chia phm vi gi tr vo trong mt s cc lp hu hn. Cc gi tr ca mt bin c th t c th c nh x ti cc hng (rank). Gi s rng mt bin c th t f c Mf trng thi. Cc trng thi c sp xp nh ngha c th t l 1,...,Mf. Nghin cu cc bin tn hon ton ging vi nghin cu cc bin t l khong cch khi tnh ton khng tng ng gia cc i tng. Gi s f l mt bin trong tp cc bin c th t m t n i tng. khng tng ng tnh ton i vi f bao gm cc bc sau: 1. Gi tr ca f cho i tng th i l xif v f c Mf trng thi c sp xp, miu t bi th t 1,...,Mf. Thay th mi xif bi hng (rank) tng ng ca n rif {1,...,Mf}. 2. T mi mt bin c th t c mt s lng cc trng thi khc nhau, nh x phm vi ca mi bin ln trn [0-1] bng cch thay th hng rif ca i tng th i trong bin th f bi
zif = rif 1 M f 1

(3.19)

3. Tnh khng tng ng, s dng bt k o khong cch no m t trong mc 3.2.2, s dng zif i din cho gi tr f cho i tng th i. Cc bin da trn t l:

-73Mt bin da trn t l lm mt php o dng trn mt t l khng tuyn tnh, nh t l s m, xp x cng thc di y: AeBt hay Ae-Bt vi A v B l cc hng s dng. C ba phng php s dng cc bin da trn t l vic tnh khng tng ng gia cc i tng. 1. X l cc bin da trn t l ging nh cc bin t l khong cch. Tuy nhin iu ny khng phi lun l la chn tt bi t l c th b bp mo. 2. p dng php bin i loga cho mt bin da trn t l f c gi tr xif cho i tng i bng cch s dng cng thc yif = log(xif). Cc gi tr yif c x l nh gi tr t l khong cch trong mc 3.2.2. Lu rng i vi nhiu bin da trn t l, ta cng c th p dng php bin i log hay cc php bin i khc, tu thuc vo nh ngha v ng dng. 3. X l xif nh d liu c th t lin tc v x l cc hng ca chng nh gi tr t l khong cch. Hai phng php sau c hiu qu nht, mc du vic la chn phng php dng cn ph thuc vo ng dng cho trc. 3.2.5 Cc bin c s pha trn ca cc kiu Mc 3.2.2 ti 3.2.4 a ra cch tnh khng tng ng gia cc i tng c m t bi cc bin cng kiu, ti , cc kiu ny c th l t l khong cch, nh phn i xng, nh phn khng i xng, tn, c th t hay da trn t l. Tuy nhin, trong nhiu c s d liu thc, cc i tng c m t bi mt s pha trn cc kiu bin. Nhn chung, mt c s d liu c th cha tt c 6 kiu bin trong danh sch trn. Ta cn mt phng php tnh khng tng ng gia cc i tng ca cc kiu bin hn hp. Mt tip cn l nhm mi loi bin vi nhau, thc hin mt php phn tch cm ring bit cho mi kiu bin. iu ny l kh thi nu nh cc php phn tch ny nhn c cc kt qu thch hp. Tuy nhin, trong cc ng dng thc, (3.20)

-74thng khng th xy ra mt php phn tch cm tch bit cho mi kiu bin s sinh ra cc kt qu thch hp. Mt tip cn c a thch hn l x l tt c cc kiu bin vi nhau, thc hin mt php phn cm n. Mt k thut nh vy c xut bi (Ducker et al. 1965) v m rng bi (Kaufman and Rousseeuw 1990) kt hp cc bin khc nhau vo trong mt ma trn khng tng ng v mang tt c cc bin c ngha ln trn mt t l chung trong khong [0,1]. Gi s rng tp d liu cha p bin kiu hn hp. khng tng ng d(i,j) gia i tng i v j c nh ngha:

( )d ( d (i, j ) = ( )
p f f =1 ij p f f =1 ij

ij

(3.21)

vi indicator ij( f ) = 0 nu xif hoc xjf khuyt (tc l khng c php o ca bin f cho i tng i hay i tng j) hoc (2) xif = xjf = 0 v bin f l nh phn khng i xng, cc trng hp cn li ij( f ) = 1 . d ij( f ) c tnh ton tu thuc vo kiu ca n: 1. Nu f l nh phn hay tn: d ij( f ) = 0 nu xif = xjf, cc trng hp cn li
( d ij f ) = 1

2. Nu f l t l khong cch: d ij( f ) =

xif x jf max h x hf min h x hf

vi h chy qu tt

c cc i tng khng khuyt i vi bin f. 3. Nu f l c th t hay da trn t l: tnh ton cc hng rif v z if = v xem xt zif nh t l khong cch. Do , khng tng ng gia cc i tng c tnh ngay c khi cc bin m t cc i tng c kiu khc nhau. 3.3 Phn loi cc phng php phn cm chnh Hin c mt s lng ln cc gii thut phn cm trong cc ti liu. Vic la chn gii thut phn cm tu thuc vo kiu d liu cho sn, mc ch ring
rif 1 M f 1

-75v ng dng. Nu nh php phn tch cm c dng nh mt cng c m t hay thm d th c th th mt vi gii thut trn cng d liu xem xem d liu c th th hin c iu g. Nhn chung, cc phng php phn cm chnh c phn thnh cc loi sau: 1. Cc phng php phn chia: Cho trc mt c s d liu vi n i tng hay cc b d liu, mt phng php phn chia c xy dng chia d liu thnh k phn, mi phn i din cho mt cm, k n. l phn loi d liu vo trong k nhm, chng tho cc yu cu sau: (1) Mi nhm phi cha t nht mt i tng, (2) Mi i tng phi thuc v chnh xc mt nhm. Lu rng yu cu th 2 c ni lng trong nhiu k thut phn chia m s c tho lun ngn gn trong chng ny. Cho trc k l s lng cc phn chia cn xy dng, phng php phn chia to lp php phn chia ban u. Sau n dng k thut lp li vic nh v, k thut ny c gng ci thin s phn chia bng cch g b cc i tng t nhm ny sang nhm khc. Tiu chun chung ca mt phn chia tt l cc i tng trong cng cm l "gn" hay c quan h vi nhau, ngc li, cc i tng ca cc cm khc nhau li "tch xa" hay rt khc nhau. C nhiu tiu chun khc nhau nh gi cht lng cc php phn chia. Trong phn cm da trn php phn chia, hu ht cc ng dng lm theo mt trong hai phng php heuristic ph bin: (1) Gii thut k-means vi mi cm c i din bi gi tr trung bnh ca cc i tng trong cm; (2) Gii thut k-medoids vi mi cm c i din bi mt trong s cc i tng nh v gn tm ca cm. Cc phng php phn cm heuristic ny lm vic tt khi tm kim cc cm c hnh cu trong cc c s d liu c kch thc t nh ti trung bnh. tm ra cc cm vi cc hnh dng phc tp v phn cm cho cc tp d liu rt ln, cc phng php da trn phn chia cn c m rng. Cc

-76phng php phn cm da trn phn chia c nghin cu su hn trong mc 3.4. 2. Cc phng php phn cp: Mt phng php phn cp to mt phn tch phn cp tp cc i tng d liu cho. Mt phng php phn cp c th c phn loi nh tch ng hay phn chia, da trn vic phn ly phn cp c hnh thnh nh th no. Tip cn tch ng cn c gi l tip cn "bottom - up", lc u mi i tng lp thnh mt nhm ring bit. N ho nhp ln lt cc i tng hay cc nhm gn nhau vi nhau cho ti khi tt c cc nhm c ho nhp thnh mt (mc cao nht ca h thng phn cp), hay cho ti khi mt gp mt iu kin kt thc. Tip cn phn ly cn c gi l tip cn "top - down", lc u tt c cc i tng trong cng mt cm. Trong mi ln lp k tip, mt cm c chia vo trong cc cm nh hn cho ti khi cui cng mi i tng trong mt cm hay cho ti khi gp mt iu kin kt thc. S kt hp ca vic lp li vic nh v v phn ly phn cp s thun li bi trc tin s dng gii thut phn ly phn cp v sau ci tin kt qu s dng nh v lp. Nhiu gii thut phn cm m rng nh BIRCH v CURE c pht trin da trn mt tip cn tch hp nh vy. Cc phng php phn cm phn cp c nghin cu trong mc 3.5. 3. Cc phng php da trn mt : Hu ht cc phng php phn chia cm cc i tng da trn khong cch gia cc i tng. Cc phng php nh vy c th ch tm c cc cm c hnh cu v s gp kh khn khi cc cm ang khm ph li c hnh dng tu . Cc phng php phn cm c pht trin da trn khi nim mt . tng chung l tip tc pht trin cm cho trc vi iu kin l mt (s cc i tng hay cc im d liu) trong "ln cn" vt qu ngng, tc l i vi mi im d liu trong phm vi mt cm cho trc th ln cn trong vng bn knh cho cha t nht mt s lng im ti thiu. Mt phng php nh

-77vy c th c dng lc ra nhiu (cc outlier) v khm ph ra cc cm c hnh dng bt k. DBSCAN l mt phng php da trn mt in hnh, n tng trng cc cm theo mt ngng mt . OPTICS l mt phng php da trn mt , n tnh ton mt th t phn cm tng dn cho php phn tch cm t ng v tng tc. Cc phng php phn cm da trn mt c nghin cu trong mc 3.6. 4. Cc phng php da trn li: Mt phng php da trn li lng t ho khng gian i tng vo trong mt s hu hn cc hnh thnh nn mt cu trc li. Sau n thc hin tt c cc thao tc phn cm trn cu trc li (tc l trn khng gian lng t ho). Thun li chnh ca tip cn ny l thi gian x l nhanh chng ca n c lp vi s cc i tng d liu v ch tu thuc vo s lng cc trong mi chiu ca khng gian lng t. STING l mt v d in hnh ca phng php da trn li. WaveCluster v CLIQUE l hai gii thut phn cm da trn c li v mt . Cc phng php phn cm da trn li c nghin cu trong mc 3.7. Nhiu gii thut phn cm tch hp cc tng ca mt vi phng php phn cm, bi vy vic phn loi gii thut khng d nh loi gii thut ch ph thuc vo duy nht mt loi phng php phn cm. Hn na, nhiu ng dng c th c gii hn phn cm vi yu cu tch hp mt s k thut phn cm. Trong mc di y ta xem xt tng phng php phn cm trn mt cch chi tit. Cc gii thut tch hp cc tng ca mt s phng php phn cm cng c gii thiu. 3.4 Cc phng php phn chia Cho trc mt c s d liu vi n i tng, k l s cc cm cn thit lp, mt gii thut phn chia t chc cc i tng vo trong k phn phn chia (k n), vi mi mt phn phn chia i din cho mt cm. Cc cm c thit lp

-78theo mt tiu chun phn chia khch quan, thng c gi l mt hm tng ng, nh khong cch, cc i tng trong phm vi mt cm l "ging nhau", ngc li, cc i tng ca cc cm khc nhau l "khng ging nhau" v mt cc thuc tnh c s d liu. 3.4.1 Cc phng php phn chia kinh in: k-means v k-medoids Cc phng php phn chia ni ting v thng c dng nht l k-means (MacQueen 1967), k-medoids (Kaufman v Rousseew 1987) v cc dng bin i ca chng. 3.4.1.1 K thut da trn trng tm: phng php k-means Gii thut k-means ly tham s u vo k v phn chia mt tp n i tng vo trong k cm cho kt qu tng ng trong cm l cao trong khi tng ng ngoi cm l thp. tng ng cm c o khi nh gi gi tr trung bnh ca cc i tng trong cm, n c th c quan st nh l "trng tm" ca cm. Gii thut x l nh sau: trc tin n la chn ngu nhin k i tng, mi i tng i din cho mt trung bnh cm hay tm cm. i vi nhng i tng cn li, mt i tng c n nh vo mt cm m n ging nht da trn khong cch gia i tng v trung bnh cm. Sau cn tnh gi tr trung bnh mi cho mi cm. X l ny c lp li cho ti khi hm tiu chun hi t. Bnh phng sai s tiu chun thng c dng, nh ngha nh sau:
E = i =1 xC x mi
k
i

(3.22)

vi x l im trong khng gian, i din cho i tng cho trc, mi l trung bnh cm Ci (c x v mi u l nhiu chiu). Tiu chun ny c gng cho kt qu k cm cng c, cng ring bit cng tt. Th tc k-means c tm tt trong hnh 3.1. Gii thut xc nh k phn phn chia tho mn ti thiu ho bnh phng hm sai s. N lm vic tt khi cc cm l cc m my c tch bit so vi nhng cm khc. Phng php ny c th m rng c hiu qu khi x l cc tp d liu ln bi phc tp tnh ton ca gii thut l O(nkt), vi n l s i

-79tng, k l s cm, t l s ln lp. Thng thng k << n v t << n. Phng php thng kt thc ti mt im ti u cc b. Gii thut 3.4.1 (k-means) Gii thut k-means i vi vic phn chia da trn gi tr trung bnh ca cc i tng trong cm. u vo: S cm k v mt c s d liu cha n i tng. u ra: Mt tp k cm - cm ti thiu ho bnh phng sai s tiu chun. Gii thut: 1) Chn tu k i tng vi t cch l cc tm cm ban u 2) repeat 3) n nh (li) mi i tng v mt cm m i tng ging nht, da trn gi tr trung bnh ca cc i tng trong cm; 4) Cp nht cc trung bnh cm, tc l tnh gi tr trung bnh ca cc i tng trong cm ; 5) Until khng c s thay i no; Hnh 3.1: Gii thut k-means

Hnh 3.2: Phn cm mt tp cc im da trn phng php k-means Tuy nhin, phng php k-means ch p dng khi trung bnh ca mt cm c xc nh. Khng phi ng dng no cng c th p dng k thut ny, v d nhng d liu bao hm cc thuc tnh xc thc. V pha cc user, h phi ch r k - s cm, cn sm pht hin ra s bt li. Phng php k-means khng thch hp vi vic tm cc cm c hnh dng khng li hay cc cm c kch thc khc xa nhau. Hn na, n nhy cm vi cc im d liu nhiu v outlier, mt s lng nh d liu nh vy v cn bn c nh hng ti gi tr trung bnh.

-80V d 3.2: Gi s c mt tp i tng c nh v trong mt hnh ch nht nh hnh 3.2. Cho k = 3, ngi dng cn phi phn cm cc i tng vo trong 3 cm. Theo gii thut 3.4.1, ta chn 3 i tng tu (nh du l "+") vi vai tr l 3 tm cm u tin. Sau , mi i tng c phn vo trong cc cm chn da trn tm cm gn nht. Mi phn b hnh thnh nn mt hnh chiu c bao quanh bi ng cong nt chm, hnh 3.2 a). Cp nht li cc tm cm. l gi tr trung bnh ca mi cm c tnh ton li da trn cc i tng trong cm. Tu theo cc tm mi ny, cc i tng c phn b li vo trong cc cm la chn da trn tm cm gn nht. Mi phn b li hnh thnh nn mt hnh chiu c bao quanh bi ng cong nt gch, hnh 3.2 b). X l ny lp li dn ti hnh 3.2 c). Cui cng, khng c s phn b li cc i tng vo trong bt k cm no, v x l kt thc. Cc cm cui cng l kt qu ca x l phn cm. Mt bin th khc ca k-means l phng php k-modes (Huang 1998) m rng m hnh k-means - phn cm d liu xc thc bng cch thay gi tr trung bnh cc cm bng cc mode (ch hay kiu), s dng o khng tng ng mi gii quyt i tng xc thc, s dng phng php da trn tn s cp nht cc mode ca cc cm. Phng php k-means v k-modes c th c tch hp phn cm d liu vi cc gi tr hn hp s v xc thc, ngi ta gi l phng php k-prototypes. Mt bin th khc ca k-means l gii thut EM (Expectation Maximization) (Lauritzen 1995), n m rng m hnh k-means theo mt cch khc: Thay v n nh mi im ti mt cm cho trc, n n nh mi im ti mt cm theo trng s i din cho xc sut l thnh vin. Hay ni mt cch khc, khng c cc ranh gii tuyt i gia cc cm. Bi vy, cc gi tr trung bnh mi sau c tnh da trn cc php o c trng s. 3.4.1.2 K thut da trn im i din: phng php k-medoids

-81Gii thut k-means rt nhy vi cc outlier, do vy mt i tng vi gi tr cc ln v c bn c th bp mo phn b ca d liu. Thay v ly gi tr trung bnh ca cc i tng trong mt cm nh mt im tham kho, k-medoids ly mt i tng i din trong cm, gi l medoid, n l im i din c nh v trung tm nht trong cm. Do vy, phng php phn chia vn c thc hin da trn nguyn tc ti thiu ho tng ca cc khng tng ng gia mi i tng vi im tham kho tng ng ca n, im ny thit lp nn c s ca phng php k-medoids. PAM (partition around medoids)- phn chia xung quanh cc medoid: y l mt gii thut phn cm kiu k-medoids. N tm k cm trong n i tng bng cch trc tin tm mt i tng i din (medoid) cho mi cm. Tp cc medoid ban u c la chn tu . Sau n lp li cc thay th mt trong s cc medoid bng mt trong s nhng ci khng phi medoid min l tng khong cch ca kt qu phn cm c ci thin. Gii thut chi tit ca PAM c trnh by trong hnh 3.3. Gii thut th xc nh k phn phn chia cho n i tng. Sau khi la chn c k-medoids ban u, gii thut lp li vic th c mt s la chn cc medoid tt hn bng cch phn tch tt c cc cp i tng c th mt i tng l medoid v i tng kia th khng phi. Php o cht lng phn cm c tnh cho mi s kt hp nh vy. La chn cc im tt nht trong mt ln lp c chn vi t cch l cc medoid cho ln lp tip theo. Chi ph ca mt ln lp n l O(k(n - k)2). i vi cc gi tr n v k ln, chi ph tnh ton nh vy c th l cao. Gii thut 3.4.2: Gii thut k-medoids i vi vic phn chia da trn cc i tng trung tm u vo: S cm k v mt c s d liu cha n i tng u ra: Mt tp k cm ti thiu ho tng cc o khng tng ng ca tt c cc i tng ti medoid gn nht ca chng.

-82Gii thut: 1) Chn tu k i tng gi vai tr l cc medoid ban u; 2) repeat 3) n nh mi i tng vo cm c medoid gn n nht; 4) 5) Tnh hm mc tiu - l tng cc o khng tng ng ca tt c cc i tng ti medoid gn nht ca chng; i medoid x bng mt i tng y nu nh vic thay i ny lm gim hm mc tiu; 6) until khng c s thay i no; Hnh 3.3: Gii thut k-medoids V d 3.3: Gi s c mt tp i tng c nh v trong mt hnh ch nht c biu din nh hnh 3.4. Cho k = 3, tc l ngi dng cn phn cc i tng vo trong 3 cm.

Hnh 3.4: Phn cm mt tp cc im da trn phng php k-medoids Theo gii thut 3.4.2, ta chn 3 i tng tu (nh du "+") vi vai tr l 3 tm cm ban u. Sau mi i tng c phn b vo cc cm chn da trn tm cm gn n nht. Mt phn b nh vy hnh thnh nn mt hnh chiu c bao quanh bi ng cong nt chm, hnh 3.2 a). Kiu nhm ny s cp nht cc tm cm. l medoid ca mi cm c tnh li da trn cc i tng trong cm. Vi cc tm mi, cc i tng c phn b li ti cc cm chn da trn tm cm gn nht. S phn b li ny thit lp mt hnh chiu mi bi ng cong nt t, hnh 3.4 b). Lp li vic x l ny dn ti hnh 3.4 c). Cui cng, khng xy ra s phn b li cc i tng trong bt k cm no v x l kt thc. Cc cm cui cng l kt qu ca x l phn cm.

-83Khi c s hin din ca nhiu v cc outlier, phng php k-medoids mnh hn k-means bi so vi gi tr trung bnh (mean), medoid t b nh hng hn bi cc outlier hay cc gi tr rt xa khc na. Tuy nhin, x l ca n c chi ph tn km hn phng php k-means v n cng cn ngi dng ch ra k - s cm. 3.4.2 Cc phng php phn chia trong cc c s d liu ln: t k-medoids ti CLARANS Gii thut phn chia k-medoids in hnh nh PAM lm vic hiu qu i vi cc tp d liu nh nhng khng c kh nng m rng tt i vi cc tp d liu ln. gii quyt vi cc tp d liu ln, mt phng php da trn vic ly mu gi l CLARA (Clustering large applications) c pht trin bi Kaufman v Rousseeuw, 1990. tng ca CLARA nh sau: thay v ly ton b tp d liu vo xem xt, ch mt phn nh d liu thc c chn vi vai tr l mt i din ca d liu, v cc medoid c chn t mu ny bng cch s dng PAM. Nu nh mu c chn la kh ngu nhin, n i din ph hp cho ton b tp d liu, v cc i tng i din (cc medoid) c chn do vy s ging vi nhng ci c chn la t ton b tp d liu. CLARA a ra nhiu mu ca tp d liu, p dng PAM trn tng mu, v mang li phn cm tt nht cho u ra. ng nh trng ch, CLARA c th gii quyt vi cc tp d liu ln hn PAM. phc tp ca mi ln lp by gi tr thnh O(kS2+k(n - k)), vi S l kch thc mu, k l s cm, n l tng s cc im. Hiu qu ca CLARA tu thuc vo kch thc mu. Lu rng PAM tm kim cho k medoids tt nht gia mt tp d liu cho trc, trong khi CLARA tm kim cho k medoids tt nht gia cc mu la chn ca tp d liu. CLARA khng th tm c phn cm tt nht nu nh bt k mt medoid c ly mu khng nm trong k medoids tt nht. V d, nu nh mt i tng Oi l mt trong s cc medoid trong k medoids tt nht nhng n khng c chn trong sut qu trnh ly mu, CLARA s khng bao gi tm thy

-84phn cm tt nht. Mt phn cm tt da trn cc mu cha chc i din cho mt phn cm tt cho ton b tp d liu nu mu b lch (bias). ci thin cht lng v kh nng m rng ca CLARA, mt gii thut phn cm khc gi l CLARANS (Clustering Large Applications based upon RANdomized Search) c gii thiu bi Ng v Han, 1994. N cng l mt gii thut kiu k-medoids v kt hp k thut ly mu vi PAM. Tuy vy, khng ging nh CLARA, CLARANS khng hn ch bn thn n cho bt k mt mu no ti bt k thi im no cho trc. Trong khi CLARA li c mt mu c n nh ti mi giai on tm kim, CLARANS a ra mt mu mt cch ngu nhin trong mi bc tm kim. X l phn cm c thc hin nh tm kim mt th ti mi nt l gii php tim nng, tc l mt tp k-medoids. Phn cm c c sau khi thay th mt medoid c gi l lng ging ca phn cm hin thi. S lng cc lng ging c th ngu nhin b hn ch bi mt tham s. Nu nh mt lng ging tt hn c tm thy, CLARANS di chuyn ti lng ging v x l li bt u li; ngc li, phn cm hin thi a ra mt ti u cc b. Nu nh ti u cc b c tm thy, CLARANS bt u vi cc nt c chn la ngu nhin mi tm kim mt ti u cc b mi. Bng thc nghim, CLARANS ch ra l hiu qu hn PAM v CLARA. phc tp tnh ton ca mi ln lp trong CLARANS t l tuyn tnh vi s lng cc i tng. CLARANS c th c dng tm s lng ln nht cc cm t nhin s dng h s hnh chiu - y l mt c tnh ca cc outlier, tc l cc im m khng thuc v bt k cm no. Vic biu din ca gii thut CLARANS c th c ci thin xa hn na bng cch kho st cc cu trc d liu khng gian, nh R*-trees, v nhiu k thut tp trung c c mt trong cc bi bo ca Ester, Kriegel v Xu 1995. 3.5 Cc phng php phn cp Phng php phn cm phn cp lm vic bng cch nhm cc i tng d liu vo trong mt cy cc cm. Cc phng php phn cm phn cp c th c phn loi xa hn trong phn cm phn cp tch ng v phn ly, tu thuc

-85vo s phn ly phn cp c thit lp theo cch bottom-up hay top-down. Cc nghin cu gn y thng cp ti s tch hp ca tch ng phn cp vi cc phng php lp li vic nh v. 3.5.1 Phn cm phn cp tch ng v phn ly Nhn chung c hai kiu phng php phn cm phn cp: 1. Phn cm phn cp tch ng: N bt u bng cch t mi i tng vo trong cm ca bn thn n v sau ho nhp cc cm nguyn t ny vo trong cc cm cng ngy cng ln hn cho ti khi tt c cc i tng nm trong mt cm n hay cho ti khi tho iu kin dng cho trc. Hu ht cc phng php phn cm phn cp thuc v loi ny. Chng ch khc nhau trong nh ngha tng ng gia cc cm ca chng. V d, phng php AGNES (Agglomerative Nesting) - tch ng lng (Kaufman v Rousseeuw 1990). Phng php ny s dng phng php kt ni n, ti mi cm c i din bi tt c cc im d liu trong cm, v tng ng gia hai cm c o bi tng ng ca cp im d liu gn nht thuc v cc cm khc nhau. AGNES ho nhp cc nt (tc l cc i tng hay cc cm ring l) c khng tng ng t nht, c th cho ti khi ho nhp thnh mt cm duy nht. 2. Phn cm phn cp phn ly: N ngc li bng cch bt u vi tt c cc i tng trong mt cm, chia nh n vo trong cc phn ngy cng nh hn cho ti khi mi mt i tng hnh thnh nn mt cm hay cho ti khi tho mt iu kin dng cho trc, v d nh s lng cc cm c yu cu cn phi c hay khong cch gia hai cm gn nht phi tho mt ngng cho trc. Cc phng php phn ly nhn chung khng nhiu v him khi c p dng bi kh a ra mt quyt nh ng ca vic phn chia mt mc cao. Phng php phn cm phn cp phn ly nh DIANA (Divisia Analysis) - Php phn tch phn ly (Kaufman v Rousseeuw 1990).

-86Ho nhp cc cm thng da trn khong cch gia cc cm. Cc php o c dng rng ri cho khong cch gia cc cm nh sau, vi mi l gi tr trung bnh cho cm Ci, ni l s lng cc im trong Ci, v |p-p'| l khong cch gia hai im p v p'.
d min (C i , C j ) = min pCi , p 'C j p p ' d mean (C i , C j ) = mi m j d avg (Ci , C j ) = 1 /(ni n j ) pC
i

p 'C j

p p'

d max (C i , C j ) = max pCi , p 'C j p p '

V d 3.4: Gi s c mt tp i tng c nh v trong mt hnh ch nht nh hnh 3.5. Phng php phn cm phn cp tch ng AGNES lm vic nh sau: Ban u mi i tng c t vo trong mt cm ca bn thn n. Sau cc cm ny c ho nhp tng bc theo mt s nguyn tc nh ho nhp cc cm vi khong cch Euclidean ti thiu gia cc i tng gn nht trong cm. Hnh 3.5 a) ch ra rng cc cm i tng n gn nht (tc l vi khong cch Euclidean ti thiu) trc tin c ho nhp vo trong hai cm i tng. X l ho nhp cm ny c lp li v cc cm gn nht li c ho nhp sau , nh hnh 3.5 b) v c). Cui cng, tt c cc i tng c ho nhp vo trong mt cm ln.

Hnh 3.5: Phn cm mt tp cc im da trn phng php "Tch ng lng" Phng php phn cm phn cp phn ly DIANA lm vic theo trt t ngc li. l, trc tin tt c cc i tng c t vo trong mt cm. Sau cm c chia theo mt s nguyn tc, nh l chia cc cm theo khong cch Euclidean cc i gia cc i tng lng ging gn nht trong cm. Hnh 3.5 c) c th c quan st nh l kt qu ca php phn chia u tin. X l phn chia cm ny c lp li v mi cm li tip tc c chia theo cng tiu

-87chun. Hnh 3.5 b) v a) c th c quan st nh l snapshot ca phn chia. Cui cng mi cm s cha ch mt i tng n. Trong phn cm phn cp tch ng hay phn ly, ta c th ch nh s lng cc cm cn c nh mt iu kin kt thc x l phn cm phn cp dng khi x l tin n s lng cm cn thit. Phng php phn cm phn cp mc du n gin nhng thng gp kh khn khi ra cc quyt nh ti hn cho vic la chn ca cc im ho nhp hay phn chia mt cch chnh xc. Quyt nh nh vy gi l ti hn bi mt khi mt nhm cc i tng c ho nhp hay chia, x l ti bc tip theo s lm vic trn cc cm mi c sinh ra. N s khng bao gi hu nhng g lm trc v cng khng thc hin chuyn i i tng gia cc cm. Do vy cc quyt nh ho nhp hay phn chia nu khng sng sut mi bc th c th dn ti cht lng ca cc cm s km. Hn na, phng php ny kh nng m rng khng c tt nn quyt nh ho nhp hay phn chia cn kim nh v nh gi mt s lng tt cc i tng hay cc cm. Mt hng ha hn ci thin cht lng phn cm ca phng php phn cp l tch hp phn cm phn cp vi cc k thut phn cm khc c phn cm nhiu pha. Mt vi phng php nh vy c gii thiu trong cc mc con di y. Th nht l BIRCH, trc tin s dng cu trc cy phn chia phn cp cc i tng, sau p dng cc gii thut phn cm khc hnh thnh nn cc cm ci tin. Th hai l CURE, i din cho mi cm l mt s lng no cc im i din c n nh, sau co chng li v pha tm cm bi mt phn s ch nh. Th ba l ROCK, ho nhp cc cm da trn lin kt ni ca chng. Th t l CHAMELEON, kho st m hnh ho ng trong phn cm phn cp. 3.5.2 BIRCH: Dng cc cp, cn bng gia gim s ln lp v phn cm Mt phng php phn cm phn cp c tch hp th v gi l BIRCH (Balanced Iterative Reducing and Clustering using Hierachies) (Zhang, Ramakrishnan v Livny 1996). N a ra hai khi nim: c trng phn cm

-88(CF - Clustering Feature) v cy CF (Clustering Feature tree), s dng cy CF i din mt cm tm tt c c tc v kh nng m rng phn cm tt trong cc c s d liu ln. N cng tt i vi phn cm tng trng ng ca cc im d liu u vo. Mt c trng phn cm CF l mt b ba thng tin tm tt v cm con cc im. Cho trc N im c hng {Xi} trong mt cm con, CF c nh ngha nh sau:
CF = ( N , LS , SS )

(3.23)
r2

vi N l s cc im trong cm con, LS l tng tuyn tnh trn N im i =1 X i v SS l tng bnh phng ca cc im d liu i =1 X i .
N

Mt cy CF l mt cy cn bng chiu cao, n lu tr cc c trng phn cm. N c hai tham s: h s phn nhnh B v ngng T. H s phn nhnh ch r s lng ti a cc con. Tham s ngng ch r ng knh ti a ca cc cm con c lu tr ti cc nt l. Bng cch thay i gi tr ngng, n c th thay i kch thc ca cy. Cc nt khng phi l l lu tr tng cc CFs ca cc nt con, do vy, tm tt thng tin v cc con ca chng. Gii thut BIRCH c hai pha sau y: Pha 1: Qut c s d liu xy dng mt cy CF b nh trong ban u, n c th c xem nh l nn a mc ca d liu m n c gng bo ton cu trc phn cm vn c ca d liu. Pha 2: p dng mt gii thut phn cm ( la chn) phn cm cc nt l ca cy CF. Trong pha 1, cy CF c xy dng ng khi cc im d liu c chn vo. Do vy, phng php ny l mt phng php tng trng. Mt im c chn vo ti entry (cm con) l gn nht. Nu nh ng knh ca cm con lu tr nt l sau khi chn ln hn gi tr ngng, th nt l v cc nt c th khc b chia. Sau khi chn mt im mi, thng tin v n c a qua theo hng gc ca cy. Ta c th thay i kch thc cy CF bng cch thay i

-89ngng. Nu nh kch thc b nh cn thit lu tr cy CF l ln hn kch thc b nh chnh th mt gi tr nh hn ca ngng c ch nh v cy CF c xy dng li. X l xy dng li ny c biu din bng cch xy dng mt cy mi t cc nt l ca cy c. Do vy, x l xy dng li cy c lm m khng cn c li tt c cc im. Bi vy, xy dng cy, d liu ch phi c mt ln. Nhiu heuristic v cc phng php cng c gii thiu gii quyt cc outlier v ci thin cht lng cy CF bi cc ln qut thm vo ca d liu. Sau khi cy CF c xy dng, bt k mt gii thut phn cm no, v d nh gii thut phn chia in hnh c th c dng vi cy CF trong pha 2. BIRCH c gng a ra cc cm tt nht vi cc ti nguyn c sn. Vi s lng gii hn ca b nh chnh, mt xem xt quan trng l cn ti thiu ho thi gian yu cu i vi I/O. N p dng k thut phn cm nhiu pha: qut n tp d liu mang li mt c s phn cm tt, v mt hay nhiu ln qut thm vo (tu ) c dng ci thin xa hn cht lng. Bi vy phc tp tnh ton ca gii thut l O(N), vi N l s cc i tng c phn cm. Bng cc th nghim thy c kh nng m rng tuyn tnh ca gii thut v mt s lng cc im v cht lng tt ca phn cm d liu. Tuy nhin, mi nt trong cy CF c th ch nm gi mt s lng gii hn cc entry bi kch thc ca n, mt nt cy CF khng phi lun lun tng ng vi mt cm t nhin. Hn na, nu cc cm khng phi c hnh cu, BIRCH s khng thc hin tt bi n s dng khi nim bn knh hay ng knh iu khin ng bao mt cm. 3.5.3 CURE: Phn cm s dng cc i din Hu ht cc gii thut phn cm hoc l c u i cc cm c dng hnh cu v kch thc ging nhau, hoc l rt mong manh vi s hin din ca cc outlier. Mt phng php th v gi l CURE (Clustering Using REpresentatives) (Guha, Rastogi v Shim 1998), tch hp cc gii thut phn

-90chia v phn cp, khc phc vn u i cc cm c dng hnh cu v kch thc ging nhau. CURE cung cp mt gii thut phn cm phn cp mi l theo v tr gia (middle ground) gia vic da trn trng tm v tt c cc cc im. Thay v s dng mt trng tm n i din mt cm, CURE n nh mt s lng cc im i din c la chn miu t mt cm. Cc im i din ny c sinh ra bng cch trc tin la chn cc im ri rc u trong cm, sau co chng li v pha tm cm bi mt phn s (h s co). Cc cm vi cp cc im i din gn nht s c ho nhp ti mi bc ca gii thut. Mi cm c hn mt im i din cho php CURE iu chnh tt hnh hc ca cc hnh khng phi hnh cu. Vic co li gip lm gim i hiu qu ca cc outlier. Bi vy, CURE thc s mnh hn i vi cc outlier v nhn bit cc cm khng c dng hnh cu vi kch thc khc nhau nhiu. vn dng cc c s d liu ln, CURE dng kt hp ly mu v phn chia ngu nhin: Mt mu ngu nhin trc tin c phn chia v mi phn chia c phn cm cc b. Cc cm cc b sau c phn cm ln th hai c c cc cm mong mun. Cc bc chnh ca gii thut CURE c phc ho vn tt nh sau: (1) Ly mt mu ngu nhin s; (2) Phn chia mu s thnh p phn, mi phn c kch thc s/p; (3) Cm cc b phn chia thnh s/pq cm q>1; (4) Kh cc outlier bng cch ly mu ngu nhin: Nu mt cm tng trng qu chm, loi b n; (5) Phn cm cc cm cc b, mt x l co nhiu im i din v pha trng tm bng mt phn s c ch nh bi ngi dng, ti cc i din c c hnh dng ca cm; (6) nh du d liu vi nhn cm tng ng. Sau y ta biu din mt v d thy cch lm vic ca CURE. V d 3.5: Gi s c mt tp cc i tng c nh v trong mt hnh ch nht. Cho p = 2, ngi dng cn phn cm cc i tng vo trong hai cm.

-91-

Hnh 3.6: Phn cm mt tp cc im bng CURE Trc tin, 50 i tng c ly mu nh hnh 3.6 a). Sau , cc i tng ny c phn chia ban u vo trong hai cm, mi cm cha 50 im. Ta phn cm cc b cc phn chia ny thnh 10 cm con da trn khong cch trung bnh ti thiu. Mi i din cm c nh du bi mt ch thp nh, nh hnh 3.6 b). Cc i din ny c di chuyn v pha trng tm bi mt phn s

, nh hnh 3.6 c).Ta c c hnh dng ca cm v thit lp thnh 2 cm. Do


vy, cc i tng c phn chia vo trong hai cm vi cc outlier c g b nh biu din hnh 3.6 d). CURE a ra cc cm cht lng cao vi s hin hu ca cc outlier, cc hnh dng phc tp ca cc cm vi cc kch thc khc nhau. N c kh nng m rng tt cho cc c s d liu ln m khng cn hy sinh cht lng phn cm. CURE cn mt t cc tham s c ch nh bi ngi dng, nh kch thc ca mu ngu nhin, s lng cc cm mong mun v h s co . nhy mt php phn cm c cung cp da trn kt qu ca vic thay i cc tham s. Mc du nhiu tham s b thay i m khng nh hng ti cht lng phn cm nhng tham s thit lp nhn chung c nh hng ng k. Mt gii thut phn cm phn cp tch ng khc c pht trin bi (Guha, Rastogi v Shim 1999) gi l ROCK, n ph hp cho vic phn cm cc thuc tnh xc thc. N o tng ng ca 2 cm bng cch so snh ton b lin kt ni ca 2 cm da trn m hnh lin kt ni tnh c ch nh bi ngi dng, ti lin kt ni ca hai cm C1 v C2 c nh ngha bi s

-92lng cc lin kt cho gia hai cm v lin kt link(pi, pj) l s lng cc lng ging chung gia hai im pi v pj. ROCK trc tin xy dng th tha t mt ma trn tng ng d liu cho trc, s dng mt ngng tng ng v khi nim cc lng ging chia s, v sau biu din mt gii thut phn cm phn cp trn th tha. 3.5.4 CHAMELEON: Mt gii thut phn cm phn cp s dng m hnh ng Mt gii thut phn cm th v khc gi l CHAMELEON, n kho st m hnh ho ng trong phn cm phn cp, c pht trin bi Karypis, Han v Kumar (1999). Khi x l phn cm, 2 cm c ho nhp nu lin kt ni v cht ( gn) gia hai cm c lin kt cao vi lin kt ni v cht ni ti ca cc i tng nm trong phm vi cc cm. X l ho nhp da trn m hnh ng to iu kin thun li cho s khm ph ra cc cm t nhin v ng nht, n p dng cho tt c cc kiu d liu min l hm tng ng c ch nh. CHAMELEON c c da trn quan st cc yu im ca hai gii thut phn cm phn cp: CURE v ROCK. CURE v cc lc quan h b qua thng tin v lin kt ni tng th ca cc i tng trong 2 cm; ngc li, ROCK, cc lc quan h l i thng tin v cht ca 2 cm trong khi nhn mnh lin kt ni ca chng. CHAMELEON trc tin s dng mt gii thut phn chia th phn cm cc mc d liu vo trong mt s lng ln cc cm con tng i nh. Sau dng gii thut phn cm phn cp tp hp tm ra cc cm xc thc bng cch lp li vic kt hp cc cm ny vi nhau. xc nh cc cp cm con ging nhau nht, cn nh gi c lin kt ni cng nh cht ca cc cm, c bit l cc c tnh ni ti ca bn thn cc cm. Do vy n khng tu thuc vo mt m hnh tnh c cung cp bi ngi dng v c th t ng thch ng vi cc c tnh ni ti ca cc cm ang c ho nhp.

-93-

Hnh 3.7: CHAMELEON: Phn cm phn cp da trn k-lng ging gn v m hnh ho ng Nh hnh 3.7, CHAMELEON miu t cc i tng da trn tip cn th c dng ph bin: k-lng ging gn nht. Mi nh ca th k-lng ging gn nht i din cho mt i tng d liu, ti tn ti mt cnh gia hai nh (i tng), nu mt i tng l gia k i tng ging nhau so vi cc i tng khc. th k-lng ging gn nht Gk c c khi nim lng ging ng: Bn knh lng ging ca mt im d liu c xc nh bi mt ca min m trong cc i tng c tr. Trong mt min dy c, lng ging c nh ngha hp, v trong mt min tha tht, lng ging c nh rng hn. So snh vi m hnh nh ngha bi phng php da trn mt nh DBSCAN (gii thiu mc sau), DBSCAN dng mt lng ging ton cc, Gk c c lng ging t nhin hn. Hn na, mt min c ghi nh trng s ca cc cnh. Cnh ca mt min dy c theo trng s ln hn so vi ca mt min tha tht. CHAMELEON ch r s tng ng gia mi cp cc cm Ci v Cj theo lin kt ni tng i RI(Ci,Cj) v cht tng i RC(Ci,Cj) ca chng. Lin kt ni tng i RI(Ci,Cj) gia hai cm Ci v Cj c nh ngha nh lin kt ni tuyt i gia Ci v Cj tiu chun ho i vi lin kt ni ni ti ca hai cm Ci v Cj. l:
RI (Ci , C j ) = 1 ECCi + ECC j 2

EC{Ci ,C j }

(3.24)

vi EC{C ,C } l cnh ct (edge-cut) ca cm cha c Ci v Cj cm ny


i j

c ri vo trong Ci v Cj, v tng t nh vy, ECCi (hay ECCj) l kch thc

-94ca min-cut bisector (tc l tng trng s ca cc cnh m chia th thnh hai phn th bng nhau). cht tng i gia mt cp cc cm Ci v Cj l RC(Ci,Cj) c nh ngha nh l cht tuyt i gia Ci v Cj c tiu chun ho i vi lin kt ni ni ti ca hai cm Ci v Cj. l:
RC (Ci , C j ) = S EC {Ci ,C j } Ci Ci + C j S ECC i + Cj Ci + C j S ECC j

(3.25)

vi S EC{

Ci , C j

} l

trng s trung bnh ca cc cnh kt ni cc nh trong Ci ti


Ci

cc nh Cj v S EC

(hay S EC ) l trng s trung bnh ca cc cnh thuc v


Cj

min-cut bisecter ca cm Ci (hay Cj). Nh vy, CHAMELEON c nhiu kh nng khm ph ra cc cm c hnh dng tu vi cht lng cao hn so vi DBSCAN v CURE. Tuy vy, thi gian chi ph x l cho d liu c chiu cao c th l O(n2) cho n i tng trong tnh hung xu nht. 3.6 Cc phng php phn cm da trn mt tm ra cc cm vi hnh dng tu , cc phng php phn cm da trn mt c pht trin, n kt ni cc min vi mt cao vo trong cc cm hay phn cm cc i tng da trn phn b hm mt . 3.6.1 DBSCAN: Phng php phn cm da trn mt trn cc min c kt ni vi mt cao DBSCAN (Density-Based Spatial Clustering of Applications with Noise) l mt gii thut phn cm da trn mt , c pht trin bi Ester, Kriegel, Sander v Xu (1996). Gii thut ny tng trng cc min vi mt cao vo trong cc cm v tm ra cc cm vi hnh dng tu trong c s d liu khng gian c nhiu. Mt cm c nh ngha nh l mt tp cc i cc im c kt ni da trn mt .

-95 tng c bn ca phn cm da trn mt nh sau: i vi mi i tng ca mt cm, lng ging trong mt bn knh cho trc () (gi l -lng ging) phi cha cha t nht mt s lng ti thiu cc i tng (MinPts). Mt i tng nm trong mt bn knh cho trc () cha khng t hn mt s lng ti thiu cc i tng lng ging (MinPts), c gi l i tng nng ct (core object) (i vi bn knh v s lng ti thiu cc im MinPts). Mt i tng p l mt trc tip tin (directly density-reachable) t i tng q vi bn knh v s lng ti thiu cc im MinPts trong mt tp cc i tng D nu p trong phm vi -lng ging ca q vi q cha t nht mt s lng ti thiu cc im MinPts. Mt i tng p l mt tin (density-reachable) t i tng q vi bn knh v MinPts trong mt tp cc i tng D nu nh c mt chui i tng p1,p2,...,pn, p1=q v pn=p vi 1 i n, pi D v pi+1 l mt trc tip tin t pi i vi v MinPts. Mt i tng p l mt lin kt vi i tng q i vi v MinPts trong mt tp i tng D nu nh c mt i tng o D c p v q l mt tin t o i vi v MinPts. V d 3.6: Trong hnh 3.8, cho trc i din cho bn knh cc ng trn, cho MinPts=3, M l mt trc tip tin t P; Q l mt (khng trc tip) tin t P. Tuy nhin P khng phi l mt tin t Q. Tng t nh vy, R v S l mt tin t O; v O, R v S tt c l mt lin kt.

Hnh 3.8: Mt tin v mt lin kt trong phn cm da trn mt

-96Lu rng mt tin l bc cu ng (transitive closure) ca mt trc tip tin, v quan h ny l khng i xng. Ch cc i tng nng ct l mt tin ln nhau (giao hon). Mt lin kt l mt quan h i xng. Mt cm da trn mt l mt tp cc i tng mt lin kt l ti a i vi mt tin; mi i tng khng cha trong bt k mt cm no l nhiu. Da trn khi nim mt tin, gii thut phn cm da trn mt DBSCAN c pht trin phn cm d liu trong c s d liu. N kim sot -lng ging ca mi im trong c s d liu. Nu nh -lng ging ca mt im p cha nhiu hn MinPts, mt cm mi vi p l i tng nng ct c thit lp. Sau lp li vic tp hp cc i tng trc tip t cc i tng nng ct ny, n c th bao gm vic ho nhp mt vi cm mt tin. X l ny dng khi khng c im mi no c thm vo bt k cm no. 3.6.2 OPTICS: Sp xp cc im nhn bit cu trc phn cm Mc du gii thut phn cm da trn mt DBSCAN c th tm ra cm cc i tng vi vic la chn cc tham s u vo nh v MinPts, ngi dng vn chu trch nhim la chn cc gi tr tham s tt tm ra cc cm chnh xc. Trn thc t, y l bi ton c s kt hp ca nhiu gii thut phn cm khc. Cc thit lp tham s nh vy thng kh kh xc nh, c bit trong th gii thc, cc tp d liu s chiu cao. Hu ht cc gii thut rt nhy vi cc gi tr tham s: cc thit lp c s khc bit nh c th dn ti cc phn chia d liu rt khc nhau. Hn na, cc tp d liu thc s chiu cao thng c phn b rt lch, thm ch khng tn ti mt thit lp tham s ton cc cho u vo, kt qu ca mt gii thut phn cm c th m t bn cht cu trc phn cm mt cch chnh xc. khc phc kh khn ny, mt phng php sp xp cm gi l OPTICS (Ordering Points To Identify the Clustering Structure) c pht trin bi (Ankerst, Breunig, Kriegel v Sander 1999). N tnh mt sp xp phn cm tng dn cho php phn tch cm t ng v tng tc. Sp xp phn cm ny cha

-97ng thng tin tng ng vi phn cm da trn mt ph hp vi mt phm vi rng cc thit lp tham s. Bng cch kho st gii thut phn cm da trn mt , DBSCAN c th d dng thy rng i vi mt gi tr hng s MinPts, cc cm da trn mt i vi mt cao hn (tc l mt gi tr thp hn) c cha hon ton trong cc tp mt lin kt i vi mt mt thp hn. Bi vy, a ra cc cm da trn mt vi mt tp cc tham s khong cch, gii thut cn la chn cc i tng x l theo mt trt t c th i tng l mt tin i vi gi tr thp nht c kt thc trc tin. Da trn tng ny, hai gi tr cn c lu tr i vi mi i tng: khong cch nng ct (core-distance) v khong cch tin (reachabilitydistance). Khong cch nng ct ca mt i tng p l khong cch nh nht ' gia p v mt i tng trong - lng ging ca n p s l mt i tng nng ct i vi ' nu nh lng ging ny c cha trong - lng ging ca p. Nu khng th khong cch nng ct l khng xc nh. Khong cch tin ca mt i tng p i vi mt i tng o khc l khong cch nh nht p l mt trc tip tin t o nu o l mt i tng nng ct. Nu o khng phi l mt i tng nng ct, ngay c ti khong cch pht sinh , khong cch tin ca mt i tng p i vi o l khng xc nh. Gii thut OPTICS to lp trt t ca mt c s d liu, thm vo l lu tr khong cch nng ct v mt khong cch tin ph hp vi mi i tng. Thng tin nh vy l cho s rt trch ca tt c cc phn cm da trn mt i vi bt k mt khong cch ' nh hn khong cch pht sinh t trt t ny. Sp xp cm ca mt tp d liu c th c trnh by v hiu bng th. V d, hnh 3.9 l mt biu tin cho mt tp d liu hai chiu n gin, n biu din mt ci nhn tng qut v d liu c cu trc v phn cm nh th

-98no. Cc phng php cng c pht trin quan st cc cu trc phn cm cho d liu s chiu cao.

Hnh 3.9: Sp xp cm trong OPTICS Bi tng ng cu trc ca gii thut OPTICS ti DBSCAN, gii thut OPTICS c cng phc tp thi gian chy nh ca DBSCAN. Cc cu trc nh ch s khng gian c th c dng nng cao kh nng biu din ca n. 3.6.3 DENCLUE: Phn cm da trn cc hm phn b mt DENCLUE (DENsity -based CLUstEring - phn cm da trn mt ) (Hinneburg v Keim 1998) l phng php phn cm da trn mt tp cc hm phn b mt . Phng php c da trn tng sau: (1) Tc ng ca mi im d liu c th c lm m hnh chnh thc s dng mt hm ton hc gi l hm tc ng, hm tc ng c xem nh l mt hm m t tc ng ca mt im d liu trong phm vi lng ging ca n; (2) Ton b mt ca khng gian d liu c th c lm m hnh theo php phn tch tng cc hm tc ng ca tt c cc im d liu; (3) Cc cm sau c th c xc nh chnh xc bng cch nhn bit cc attractor mt , ti cc attractor mt cc i cc b ca ton b hm mt . Hm tc ng ca mt im d liu y Fd, vi Fd l mt khng gian c trng d chiu, l mt hm c bn f By : F d R0+ , c nh ngha di dng mt hm tc ng c bn fB:

-99f By = f B ( x, y )

(3.26)

Theo nguyn tc, hm tc ng c th l mt hm tu nhng n nn l phn x v i xng. N c th l mt hm khong cch Euclidean, mt hm tc ng wave bnh phng:
0 f Square ( x, y ) = 1 if d ( x, y ) > otherwise

(3.27)

hay mt hm tc ng Gaussian:
f Gause ( x, y ) = e

d ( x , y )2 2 2

(3.28)

Hnh 3.10: Hm mt v attractor mt Mt hm mt c nh ngha l tng cc hm tc ng ca tt c cc im d liu. Cho trc N i tng d liu c m t bi mt tp cc vect c trng D = {x1,...,xN} FD, hm mt c nh ngha nh sau:
f BD = i =1 f Bxi (x )
N

(3.29)

V d, hm mt cho kt qu t hm tc ng Gaussian (3.28) l:


f
D Gaussian

( x) = i =1 e
N

d ( x , y )2 2 2

(3.30)

T hm mt , ta c th nh ngha dc (gradient) ca mt hm v attractor mt (attractor mt l cc i cc b ca ton b hm mt ). i vi mt hm tc ng lin tc v phn bit, mt gii thut leo i (hill climbing), c ch ra bi dc (gradient), c th c dng xc nh attractor mt ca mt tp cc im d liu.

-100Da trn cc khi nim ny, c cm c nh ngha trung tm v cm hnh dng tu c th c nh ngha chnh thc. Mt cm c nh ngha trung tm l mt tp con C ang l mt c rt trch, vi hm mt khng t hn mt ngng , ngc li (tc l nu hm mt nh hn ngng ) th n l mt outlier. Mt cm hnh dng tu l mt tp ca tp con ca C, mi tp ang l mt c rt trch, vi hm mt khng t hn mt ngng , v tn ti mt ng i P t mi min ti nhng min khc v hm mt cho mi im dc theo ng i khng t hn . DENCLUE c cc thun li chnh sau y khi so snh vi cc gii thut phn cm khc: (1) N c mt nn tng ton hc vng chc, tng qut ho cc phng php phn cm khc, bao gm cc phng php da trn phn chia, phn cp v da trn v tr; (2) N c cc c tnh phn cm tt i vi cc tp d liu vi s lng nhiu ln; (3) N cho php mt m t ton hc c ng ca cc cm c hnh dng tu trong cc tp d liu s chiu cao; (4) N s dng cc li nhng ch gi thng tin v cc li m thc s cha ng cc im d liu v qun l cc ny trong mt cu trc truy cp da trn cy v do vy n nhanh hn ng k so vi cc gii thut tc ng, nh n nhanh hn DBSCAN ti 45 ln. Tuy vy, phng php cn s chn la cn thn cc tham s, tham s mt v ngng nhiu , vic la chn cc tham s nh vy c nh hng ng k cht lng ca cc kt qu phn cm.

Hnh 3.11: Cc cm c nh ngha trung tm v cc cm c hnh dng tu

-1013.7 Cc phng php phn cm da trn li Mt tip cn da trn li dng cu trc d liu li a phn gii. Trc tin n lng t ho khng gian vo trong mt s hu hn cc m hnh thnh nn cu trc li, sau thc hin tt c cc thao tc trong cu trc li . Thun li chnh ca tip cn ny l thi gian x l nhanh, in hnh l c lp ca s lng cc i tng d liu nhng c lp ch trn s lng cc trong mi chiu trong khng gian lng t ha. Cc v d in hnh ca tip cn da trn li bao gm STING - kho st thng tin thng k c lu tr trong cc li; WaveCluster - cc cm i tng s dng phng php bin i wavelet; CLIQUE - miu t mt tip cn da trn li v mt cho phn cm trong khng gian d liu s chiu cao. 3.7.1 STING: Mt tip cn li thng tin thng k STING (STatistical INformation Grid) (Wang, Yang v Munz 1997) l mt tip cn a phn gii da trn li. Trong tip cn ny, min khng gian c chia thnh cc hnh ch nht. Thng c mt vi mc cc hnh ch nht tng ng vi cc mc khc nhau ca phn gii v cc ny thit lp nn mt cu trc phn cp: mi ti mt mc cao c phn chia hnh thnh nn mt s lng cc ti mc thp hn tip theo. Hn na, cc phn quan trng ca thng tin thng k nh mean, max, min, count, lch chun (standard deviation), v.v... kt hp vi cc gi tr thuc tnh trong mi li c tnh ton trc v c lu tr trc khi mt truy vn c submit ti mt h thng. Hnh 3.12 cho thy mt cu trc phn cp i vi phn cm STING.

Hnh 3.12: Mt cu trc phn cp i vi phn cm STING

-102Tp cc tham s da trn thng k bao gm: - tham s c lp vi thuc tnh n (count) v cc tham s ph thuc thuc tnh m (mean), s ( lch chun), min (minimum), max (maximum), v kiu ca phn b m gi tr thuc tnh trong tip theo nh normal- bnh thng, uniform-ng nht, exponential- s m, hay none (nu phn b khng c bit). Khi d liu c ti vo trong c s d liu, tp cc tham s n, m, s, min, max ca cc mc y c tnh ton trc tip t d liu. Gi tr ca phn b c th c n nh bi ngi dng nu nh kiu phn b khng c bit trc hay c c bi cc kim nh gi thuyt nh kim nh 2. Cc tham s ca cc mc cao hn c th d dng c tnh t cc tham s cc mc thp hn. Kiu phn b ca cc mc cao hn c th c tnh ton da trn cc kiu phn b theo s ng ca cc tng ng mc thp hn ca n cng vi mt ngng x l lc. Nu nh cc phn b ca mc thp hn khng ging nhau v thiu ngng kim nh, kiu phn b ca mc cao c t l "none". Thng tin thng k c c s rt hu ch khi tr li cc truy vn. Topdown l phng php tr li truy vn da trn li thng tin thng k c th khi qut nh sau: Trc tin n c th xc nh mt lp bt u, n thng bao gm mt s lng nh cc . i vi mi trong lp hin thi, ta tnh ton khong tin cy (hay phm vi c nh gi) kh nng m ny c lin quan ti truy vn. Cc khng lin quan s c g b khi xem xt sau ny, v x l mc su hn s ch xem xt cc lin quan. X l ny c lp li cho ti khi n tin n lp y. Ti thi im ny, nu t c truy vn ch nh th s tr li cc min cc lin quan p ng yu cu ca truy vn; mt khc, ly ra d liu nm trong cc lin quan, tip tc x l; v tr li cc kt qu tho mn yu cu ca truy vn. Tip cn ny a ra mt s thun li so vi cc phng php phn cm khc: (1) Tnh ton da trn li l truy vn c lp, t thng tin thng k c lu tr trong mi i din cho thng tin tm tt ca d liu trong li, c lp vi truy vn; (2) Cu trc li lm cho x l song song v cp nht tng

-103trng c thun li; (3) Thun li ch yu ca phng php ny hiu qu ca phng php: STING xuyn sut d liu mt ln tnh ton cc tham s thng k ca cc , v do vy phc tp thi gian pht sinh cc cm l O(N), vi N l tng s cc i tng. Sau khi pht sinh cu trc phn cp ny, thi gian x l truy vn l O(G), vi G l tng s cc li ti mc thp nht, n thng nh hn nhiu so vi N - tng s cc i tng. Tuy vy, t khi STING s dng tip cn a phn gii thc hin php phn tch cm, cht lng ca phn cm STING s tu thuc vo sn (granularity) ca mc thp nht ca cu trc li. Nu sn l rt tt, chi ph x l v c bn s tng ln; tuy nhin nu nh mc y ca cu trc li qu th, n c th gim cht lng tt ( mn) ca php phn cm. Hn na, STING khng xem xt mi quan h khng gian gia cc con v cc lng ging ca chng xy dng cc cha. Kt qu l hnh dng ca cc cm kt qu l nht qun (isothetic), tt c cc ng bao cm theo chiu ngang hoc theo chiu dc, khng c chiu cho no c d thy. iu ny c th dn ti cht lng v chnh xc cc cm thp hn nhng c thi gian x l nhanh hn. 3.7.2 WaveCluster: Phn cm s dng php bin i wavelet WaveCluster (Sheikholeslami, Chatterjee v Zhang 1998) l mt tip cn phn cm a phn gii, trc tin tm tt d liu bng cch li dng cu trc li a phn gii trn khng gian d liu, sau bin i khng gian c trng gc bng php bin i wavelet v tm cc min ng c trong khng gian bin i. Trong tip cn ny, mi li tm tt thng tin ca mt nhm cc im, thng tin tm tt ny va a vo trong b nh chnh cho php bin i wavelet a phn gii v php phn tch cm sau . Trong cu trc li, cc thuc tnh s ca mt i tng khng gian c th c i din bi mt vect c trng, ti mi phn t ca vect tng ng vi mt thuc tnh s, hay

-104c trng. Cho mt i tng vi n thuc tnh s, vect c trng s l mt im trong khng gian c trng n chiu. Php bin i wavelet l mt k thut x l tn hiu, n phn tch mt tn hiu vo trong cc di tn s con. M hnh wavelet cng lm vic trn cc tn hiu n chiu bng cch p dng php bin i 1 chiu n ln. Trong php bin i wavelet, d liu khng gian c chuyn i vo trong min tn s. Kt hp vi mt hm nng ct thch hp cho kt qu trong mt khng gian bin i, ti cc cm t nhin trong d liu tr nn d phn bit hn. Cc cm sau c th c nhn bit bng cch tm ra cc min ng c trong vng bin i. Php bin i wavelet cung cp cc c trng th v sau: Trc tin n cung cp phn cm khng gim st. Cc lc dng nn lm ni bt cc min m ti cc im phn cm, nhng ng thi cng c khuynh hng ngn chn cc thng tin yu hn trong ng bao ca chng. Do vy, cc min ng c trong khng gian c trng gc ng vai tr nh l cc min thu ht (attractor) i vi cc im gn v nh l min hn ch (inhibitor) i vi cc im khng gn. iu ny ngha l cc cm trong d liu t ng ni bt ln v lm sch cc min xung quanh chng. Th hai, cc lc thng thp c dng trong php bin i wavelet s t ng loi b cc outlier. Hn na, c tnh a phn gii ca php bin i wavelet c th gip d cc cm ti cc chnh xc khc nhau. Cui cng, ng dng php bin i wavelet l rt nhanh v vic x l nh vy c th cng c thc hin song song. Gii thut phn cm da trn wavelet phc tho nh sau: Gii thut 3.7.1: Gii thut phn cm da trn wavelet i vi phn cm a phn gii bng php bin i wavelet. u vo: Cc vect c trng ca cc i tng d liu a chiu u ra: Cc i tng phn cm Gii thut: 1) Lng t ho khng gian c trng, sau phn cc i tng vo cc

-105unit; 2) p dng php bin i wavelet trong khng gian c trng; 3) Tm cc phn hp thnh kt ni (cc cm) trong cc di con ca khng gian c trng bin i ti cc mc khc nhau; 4) Gn cc nhn vo cc unit; 5) Lm cc bng tra cu v nh x cc i tng vo cc cm. Hnh 3.13: Gii thut phn cm da trn wavelet phc tp tnh ton ca gii thut ny l O(N) vi N l s cc i tng trong c s d liu.

Hnh 3.14: Mt mu khng gian c trng 2 chiu V d: Hnh 3.14 (ly t Sheikholeslami, Chatterjee v Zhang (1998)) cho thy mt mu khng gian c trng 2 chiu, ti , mi im trong nh i din cho cc gi tr c trng ca mt i tng trong cc tp d liu khng gian. Hnh 3.15 (ly t Sheikholeslami, Chatterjee v Zhang (1998)) cho thy kt qu ca cc php bin i wavelet ti cc t l khc nhau, t mn (t l 1) cho ti th (t l 3). Ti mi mc, di con LL (bnh thng) ch ra ti cung phn t pha trn bn tri, di con LH (cc cnh nm ngang) ch ra ti cung phn t pha trn bn phi v di con HL (cc cnh nm dc) ch ra ti cung phn t pha di bn tri v di con HH (cc gc) ch ra ti cung phn t pha di bn phi. WaveCluster l mt gii thut da trn mt v li. WaveCluster thch hp vi tt c cc yu cu ca cc gii thut phn cm tt: n x l cc tp d liu ln mt cch hiu qu, tm ra cc cm vi hnh dng tu , thnh cng trong vic x l cc outlier, v khng nhy cm i vi trt t u vo. So vi

-106BIRCH, CLARANS v DBSCAN, WaveCluster lm tt hn cc phng php ny c hiu sut v cht lng phn cm.

Hnh 3.15: a phn gii ca khng gian c trng trong hnh 3.14. a) t l 1; b) t l 2; c) t l 3 3.7.3 CLIQUE: Phn cm khng gian s chiu cao Mt gii thut phn cm khc, CLIQUE, Agrawal et al. (1998), tch hp phng php phn cm da trn li v mt theo mt cch khc. N rt hu ch cho phn cm d liu vi s chiu cao trong cc c s d liu ln. Cho trc mt tp ln cc im d liu a chiu, cc im d liu ny thng nm khng ng nht trong khng gian d liu. Phn cm d liu nhn bit cc v tr tha tht hay ng c, do vy tm ra ton b cc mu phn b ca tp d liu. Mt unit l dy c nu nh phn nh ca cc im d liu cha trong unit vt qu mt tham s m hnh u vo. Mt cm l mt tp ln nht cc unit dy c c kt ni. CLIQUE phn chia khng gian d liu m chiu thnh cc unit hnh ch nht khng chng ln nhau, nhn bit cc unit dy c, v tm ra cc cm trong ton b cc khng gian con ca khng gian d liu gc, s dng phng php pht sinh candidate (ng c) ging vi gii thut Apriori cho khai ph cc lut kt hp. CLIQUE thc hin phn cm a chiu theo hai bc: Trc tin, CLIQUE nhn bit cc cm bng cch xc nh cc unit dy c trong ton b cc khng gian con ca cc interest v sau xc nh cc unit dy c c kt ni trong ton b cc khng gian con ca cc interest.

-107Mt heuristic quan trng m CLIQUE thng qua l nguyn l Apriori trong phn cm s chiu cao: Nu mt unit k chiu l dy c th cc hnh chiu (project) ca n trong khng gian (k-1) chiu cng vy. l nu bt k unit th (k-1) khng phi l dy c, th unit th k tng ng ca n khng phi l mt unit ng c dy c (candidate dense). Bi vy, tt c cc unit dy c k chiu ng c c th c sinh t cc unit dy c (k-1) chiu. Th hai, CLIQUE sinh ra m t ti thiu cho cc cm nh sau: Trc tin n xc nh cc min ti a ph mt cm cc unit dy c c kt ni cho mi cm v sau xc nh ph ti thiu cho mi cm. CLIQUE t ng tm cc khng gian con s chiu cao nht cc cm mt cao tn ti trong cc khng gian con ny. N khng nhy cm vi trt t cc bn ghi trong u vo v khng on c phn b d liu tiu chun. N t l tuyn tnh vi kch thc ca u vo v c mt kh nng m rng tt nh s cc chiu trong d liu c tng ln. Tuy nhin, chnh xc ca kt qu phn cm c th b suy gim ti ph ph bi tnh n gin ca phng php. 3.8 Kt lun Chng ny cp ti cc phng php phn cm truyn thng v cc ci tin phng php phn cm truyn thng. Ngoi ra chng ny cn cp ti khi nim khng tng ng (hay tng ng) ca cc i tng. Qua ta c th thy c kh nng phn cm ca tng phng php, kh nng p dng vo cc bi ton thc tin.

-108CHNG 4: CI T TH NGHIM Chng ny a ra kt qu ci t th nghim bng cc gii thut Kmeans v Kmedoids trn cc b d liu ca UCI v nh gi kt qu thc nghim. 4.1 Thit k tng th Chng trnh gm cc khi chc nng chnh sau: Khi chc nng tin x l Khi chc nng phn cm

4.1.1 Khi chc nng tin x l Nhim v ca khi chc nng ny l c d liu, xc nh s mu, s thuc tnh, s lp, cc gi tr thuc tnh ca tng mu d liu. 4.1.2 Khi chc nng phn cm Khi chc nng ny tin hnh phn cm cc mu d liu. D liu c hc khng gim st (unsupervised learning) theo hai gii thut khc nhau: Kmeans v Kmedoids. Cui cng gn nhn lp cho cc cm. Sau khi gn nhn lp cho cc cm s tin hnh xc nh hiu qu phn lp, phn loi. 4.2 Chun b d liu D liu u vo chng trnh l cc tp vn bn v c chia thnh hai loi: - Tp nh dng d liu (*.names): nh ngha tn cc lp, tn cc thuc tnh, cc gi tr ca tng thuc tnh, kiu thuc tnh. - Tp mu d liu (*.data): Gm cc mu d liu cha y thng tin gi tr cc thuc tnh v gi tr lp. 4.2.1 Tp nh dng d liu - Dng 1: lit k cc gi tr lp. Cc gi tr ny cch nhau bi du phy "," v kt thc bng du chm ".". - T dng 2: + Mi mu mt dng

-109+ Bt u bng tn mt thuc tnh, du ":", sau l cc gi tr ri rc ca thuc tnh (nu thuc tnh l xc thc hay nh phn) hoc kiu thuc tnh (nu thuc tnh c kiu lin tc). - Tt c cc phn ch thch c t sau du "|" Bng 4.1: Mt v d tp nh dng d liu *.names 1, 2, 3. 1: continuous. 2: 1, 2, 3, 4. 3: continuous. 4: 0, 1. 4.2.2 Tp mu d liu Mi mu mt dng. Cc gi tr thuc tnh ca mu ghi trc, cui cng l gi tr lp. Mi mt gi tr ny cch nhau bi du ",". Bng 4.2: Mt v d tp d liu *.data 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1 0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 0,1,1,0,1,1,0,0,1,1,0,0,2,1,0,0,2 1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,1,2 4.2.3 Ngun d liu Trong khun kh lun vn, d liu c ly t a ch web site: - ftp://ftp.ics.uci.edu/pub/ 4.3 Thit k chng trnh Vi cc khi chc nng v d liu trn, chng trnh c thit k nh sau: |binary |categorical

-110Cc thng tin: - S lp, tn cc lp - S thuc tnh, tn thuc tnh, kiu thuc tnh hay cc gi tr ri rc ca thuc tnh - S mu, gi tr cc thuc tnh v tn lp ca mi mu Cc module phn cm

Tp nh dng d liu

Module GetNames

Tp mu d liu

Module GetData

Ci tin phn lp

Phn lp

Hin th kt qu

Kt qu phn lp, phn loi Hnh 4.1: Thit k chng trnh

Phn loi

4.4 Kt qu thc nghim v nh gi 4.4.1 Cc bc tin hnh thc nghim - Phn cm d liu bng gii thut Kmeans v Kmedoids - Gn nhn cho cc cm, nh gi, so snh hiu qu gn nhn gia hai gii thut trn cho cc b s liu UCI (ch dng cc d liu c thuc tnh lin tc). - Gn nhn cho cc cm, nh gi hiu qu gn nhn cho d liu c thuc tnh hn hp - Ci tin hiu qu phn lp - So snh cht lng phn loi vi chng trnh See5. Chng trnh See5 (phin bn 2.03) l cng c s dng k thut cy quyt nh vi gii thut C5.0 dng phn loi d liu c vit bi Ross Quinlan. Tnh hiu qu ca chng trnh ny c nhiu ngi cng nhn. V th, lun vn s dng n lm cng c so snh vi cc kt qu phn loi thc hin. Hn ch ca See5 (phin bn 2.03) ch dng c ti a 400 mu d liu.

-1114.4.2 Thc nghim Di y l cc kt qu t c: 4.4.2.1 Bi ton phn lp: c thc hin vi s lng cc cm l K = 2, 4, 6,8,10, 16. (Kmeans: ma; Kmedoids: md) Bng 4.3: Kt qu th nghim phn lp
Tn DL Brea Haber Iris Pima Glass Wine Balan S mu 500 306 150 768 214 178 625 S mu phn lp ng K=2 ma md 328 480 225 225 100 51 532 504 78 82 107 72 293 369 K=4 ma md 485 481 229 226 126 53 539 537 105 84 173 80 407 423 K=6 ma md 484 481 230 231 126 55 541 528 117 86 169 85 448 441 K=8 ma md 481 482 228 232 125 57 525 554 117 88 168 84 503 451 K=10 ma md 481 482 233 234 121 59 554 558 125 90 166 87 438 453 K=16 ma md 481 482 237 240 142 65 558 561 140 96 173 93 483 459

So snh Kmeans v Kmedoids


120 Phn trm phn lp ng 100 80 60 40 20 0
br ea st ca nc el ha be rm an e ba la nc e a s gl as s pi m wi n iri

Kmeans Kmedoids

Cc b d liu

Hnh 4.2: Biu so snh Kmeans v Kmedoids trong bi ton phn lp vi K=10 Biu trn cho thy vi d liu kiu lin tc kh nng phn lp ca Kmedoids trong b d liu UCI thng thp hn so vi Kmeans bi im i din trong Kmedoids l mt im i tng gn tm cm, tm cm trong

-112Kmeans l gi tr trung bnh ca cc phn t trong cm. Nu nh d liu t nhiu th Kmeans s cho kt qu hiu qu hn Kmedoids, trong trng hp ngc li, nu mt nhiu vi gi tr cc ln, v c bn n s bp mo phn b d liu nu nh dng Kmeans, lc ny dng Kmeadoids s hiu qu hn. Theo biu so snh trn ta nhn thy d liu t nhiu. Tuy nhin, php o tng ng ca cc i tng trong Kmedoids dng nh cha c hiu qu lm, do vy phn trm phn lp ng cha c cao. ci thin chnh xc phn lp, lun vn a ra phng php sau: Vi mi mu b phn lp sai trong mi cm, ta s a n vo cm thch hp (gi s l cm A) nu tho mn iu kin: + Khong cch t n ti cm hin thi bng khong cch ti cm A + Nhn lp cm A ging nhn lp ca mu + Nu thm mu ny vo cm A, tm cm khng thay i (hoc thay i mt khong cch epsilon b cho trc). Thc nghim cho thy chnh xc phn lp c tng ln. V d mt s b d liu sau: (C: C; Mi: M) Bng 4.4: Kt qu ci thin cht lng phn lp Tn DL C K=4 M C K=6 M C K=8 M C K=10 M C K=16 M C K=20 M C K=25 M Iris 53 54 55 57 57 61 59 65 65 77 69 85 74 95 Wine 80 89 85 85 84 86 87 90 93 102 97 110 102 120 Balance 423 447 441 459 451 475 453 477 459 483 463 487 468 492 Haberman 226 226 231 231 232 233 234 237 240 249 244 257 249 267

-1134.4.2.2 Bi ton phn loi Bng 4.5: Kt qu th nghim phn loi ca Kmeans v Kmedoids Tn d liu Breastcancel Haberman Iris Pima Glass Soybean Wine Balance S mu 500 306 150 768 214 47 178 625 S mu phn loi ng ma md 318 480 179 115 125 52 532 504 93 72 32 22 172 70 313 336 T l phn loi ng (%) ma md 63.6 96 58.4967 50.6536 83.3333 34.6667 69.2708 65.625 43.4579 33.6449 68.0851 46.8085 96.6292 39.3258 50.08 53.76

So snh Kmeans v Kmedoids


120 Phn trm phn loi ng 100 80 60 40 20 0
br ea st ca nc el ha be rm an so yb ea n e ba la nc e a s gl as s pi m wi n iri

Kmeans Kmedoids

Cc b d liu

Hnh 4.3: Biu so snh Kmeans v Kmedoids trong bi ton phn loi Bng 4.6: Kt qu th nghim phn loi ca Kmedoids v See5 Tn d liu Breastcancel Haberman S mu 400 306 S mu phn loi ng See5 md 391 344 236 115 T l phn loi ng (%) See5 md 97.75 86 77.12418 50.6536

-114Iris Pima Car Balance 150 400 298 400 106 307 289 336 52 262 202 238 83.3333 76.75 72.25 84 34.6667 65.5 67.7852 64.8501

So snh Kmedoids v See5 120 100 Phn trm phn loi ng 80 60 40 20 0


Br ea st ca nc el Ha be rm an Ca r Ba la nc e Pi m a Iri

Kmedoids See5

Hnh 4.4: Biu so snh Kmedoids v See5 trong bi ton phn loi Theo biu trn ta nhn thy hiu qu phn loi ca See5 tt hn bi n c mt m hnh phn loi dng cy thc s hiu qu, m hnh ny hn ch c nhng nhnh phn nh nhiu nn cht lng phn loi cao. Cn Kmedoids tuy x l c d liu kiu hn hp nhng cht lng tnh tng ng ca cc i tng cha cao nn kh nng phn loi km hn See5. 4.5 Kt lun Nh vy, sau khi tin hnh thc nghim trn mt s b d liu ca UCI ta nhn thy kt qu phn lp, phn loi cc d liu c thuc tnh lin tc ca Kmeans tt hn so vi Kmedoids. Vi d liu c thuc tnh hn hp, Kmeans khng x l c. Kmedoids vi phng php tnh tng ng gia hai mu do Ducker (1965) xut, Kaufman v Rousseeuw ci tin (1990) x l c d liu ny vi chnh xc trn trung bnh v chi ph tnh ton l O(k(n-k)2).

Cc b d liu

-115i vi cc gi tr n v k ln, chi ph nh vy s cao. Vy nn vic ci tin chnh xc v tc tnh ton l hng pht trin sau ny.

-116KT LUN Lun vn tp trung nghin cu l thuyt v p dng mt s k thut khai ph d liu trn b d liu ca UCI. y l bc khi u trong qu trnh tm hiu nhng vn cn quan tm khi gii quyt cc bi ton khai ph d liu trong thc t. Trong khun kh lun vn cha p dng c th vo mt CSDL thc t no, mi ch dng li trn b d liu UCI nn kt qu thc nghim cha mang ngha thc t. Tuy nhin cng c mt s kt qu ban u l pht hin tri thc t b d liu ny. Nhng kt qu m lun vn thc hin: + V l thuyt, lun vn tp trung tm hiu cc k thut phn loi, phn cm truyn thng v cc phng php ci tin chng. + V thc tin, lun vn a ra cc kt qu ci t th nghim trn b d liu UCI bao gm cc kt qu phn loi, phn lp, ci tin cht lng phn lp. Qua qu trnh thc nghim v nghin cu l thuyt c th a ra mt s kt lun nh sau: Mi mt gii thut phn loi, phn cm p dng cho mt s mc tiu v kiu d liu nht nh. Mi gii thut c mt mc chnh xc ring v kh nng thc hin trn tng kch thc d liu l khc nhau. iu ny cn tu thuc vo cch thc t chc d liu b nh chnh, b nh ngoi... ca cc gii thut. Khai ph d liu s hiu qu hn khi bc tin x l, la chn thuc tnh, m hnh c gii quyt tt. Vi nhng g m lun vn thc hin, cc hng pht trin sau ny ca lun vn nh sau:

-117 chnh xc phn lp, phn loi ph thuc vo nhiu yu t nh cht lng d liu, thut ton ci t, phng php tnh tng ng ca cc i tng d liu. Ngoi ra, cc gi tr khuyt hay cc thuc tnh d tha cng phn no lm nh hng n chng. V vy hng pht trin sau ny l x l cc gi tr khuyt, pht hin v loi b cc thuc tnh d tha, ci tin phng php tnh tng ng,... nhm nng cao cht lng v tc phn lp, phn loi. Tin hnh ci t v tip tc nghin cu nhiu k thut khai ph d liu hn na, c bit l trin khai gii quyt cc bi ton c th trong thc t.

-118TI LIU THAM KHO

1. Anil K. Jain and Richard C. Dubes (1988), Algorithms for clustering data, Prentice-Hall, Inc., USA. 2. Ho Tu Bao (1998), Introduction to knowledge discovery and data mining. 3. Jiawei Han and Micheline Kambel (2000), Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers. 4. Joydeep Ghosh (2003), Scalable Clustering, Chapter 10, pp. 247-278, Formal version appears in: The Handbook of Data Mining, Nong Ye (Ed). 5. J.Ross Quinlan (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers. 6. Mercer (2003), Clustering large datasets, Linacre College. 7. Pavel Berkhin, Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose.

You might also like