You are on page 1of 110

Khai ph d liu Web bng k thut phn cm

Hong Vn Dng

i
B GIO DC V O TO
TRNG I HC S PHM H NI


Hong Vn Dng



KHAI PH D LIU WEB
BNG K THUT PHN CM


Lun vn thc s khoa hc















H Ni, 2007
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

ii
MC LC
MC LC ...................................................................................................... i
DANH SCH CC HNH ............................................................................ v
DANH SCH CC BNG BIU ............................................................... vi
CC CM T VIT TT.......................................................................... vii
LI M U ................................................................................................ 1
Chng 1. TNG QUAN V KHAI PH D LIU .................................. 3
1.1. Khai ph d liu v pht hin tri thc ......................................................... 3
1.1.1. Khai ph d liu .................................................................................... 3
1.1.2. Qu trnh khm ph tri thc .................................................................. 4
1.1.3. Khai ph d liu v cc lnh vc lin quan .......................................... 5
1.1.4. Cc k thut p dng trong khai ph d liu ........................................ 5
1.1.5. Nhng chc nng chnh ca khai ph d liu ...................................... 7
1.1.6. ng dng ca khai ph d liu ............................................................. 9
1.2. K thut phn cm trong khai ph d liu ................................................ 10
1.2.1. Tng quan v k thut phn cm ........................................................ 10
1.2.2. ng dng ca phn cm d liu ......................................................... 13
1.2.3. Cc yu cu i vi k thut phn cm d liu ................................. 13
1.2.4. Cc kiu d liu v o tng t ..................................................... 15
1.2.4.1. Phn loi kiu d liu da trn kch thc min ......................... 15
1.2.4.2. Phn loi kiu d liu da trn h o .......................................... 15
1.2.4.3. Khi nim v php o tng t, phi tng t........................ 17
1.3. Khai ph Web ............................................................................................ 20
1.3.1. Li ch ca khai ph Web ................................................................... 20
1.3.2. Khai ph Web ..................................................................................... 21
1.3.3. Cc kiu d liu Web .......................................................................... 22
1.4. X l d liu vn bn ng dng trong khai ph d liu Web .................. 23
1.4.1. D liu vn bn ................................................................................... 23
1.4.2. Mt s vn trong x l d liu vn bn ......................................... 23
1.4.2.1. Loi b t dng ............................................................................ 24
1.4.2.2. nh lut Zipf ............................................................................... 25
1.4.3. Cc m hnh biu din d liu vn bn .............................................. 26
1.4.3.1. M hnh Boolean .......................................................................... 26
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

iii
1.4.3.2. M hnh tn s ............................................................................. 27
1.5. Tng kt chng 1 ..................................................................................... 30
Chng 2. MT S K THUT PHN CM D LIU ....................... 31
2.1. Phn cm phn hoch ................................................................................ 31
2.1.1. Thut ton k-means ............................................................................. 32
2.1.2. Thut ton PAM .................................................................................. 34
2.1.3. Thut ton CLARA ............................................................................. 38
2.1.4. Thut ton CLARANS........................................................................ 39
2.2. Phn cm phn cp .................................................................................... 41
2.2.1. Thut ton BIRCH .............................................................................. 42
2.2.2. Thut ton CURE ................................................................................ 45
2.3. Phn cm da trn mt ......................................................................... 47
2.3.1 Thut ton DBSCAN ........................................................................... 47
2.3.2. Thut ton OPTICS ............................................................................ 51
2.3.3. Thut ton DENCLUE ....................................................................... 52
2.4. Phn cm da trn li.............................................................................. 54
2.4.1 Thut ton STING ............................................................................... 55
2.4.2 Thut ton CLIQUE............................................................................. 56
2.5. Phn cm d liu da trn m hnh........................................................... 57
2.5.1. Thut ton EM .................................................................................... 58
2.5.2. Thut ton COBWEB ......................................................................... 59
2.6. Phn cm d liu m ................................................................................. 59
2.7. Tng kt chng 2 ..................................................................................... 60
Chng 3. KHAI PH D LIU WEB ..................................................... 62
3.1. Khai ph ni dung Web ............................................................................. 62
3.1.1. Khai ph kt qu tm kim .................................................................. 63
3.1.2. Khai ph vn bn Web ........................................................................ 63
3.1.2.1. La chn d liu .......................................................................... 64
3.1.2.2. Tin x l d liu ......................................................................... 64
3.1.2.3. Biu in vn bn ......................................................................... 65
3.1.2.4. Trch rt cc t c trng ............................................................. 65
3.1.2.5. Khai ph vn bn ......................................................................... 66
3.1.3. nh gi cht lng mu ................................................................ 68
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

iv
3.2. Khai ph theo s dng Web ...................................................................... 69
3.2.1. ng dng ca khai ph theo s dng Web ......................................... 70
3.2.2. Cc k thut c s dng trong khai ph theo s dng Web ........... 71
3.2.3. Nhng vn trong khai kh theo s dng Web. .............................. 71
3.2.3.1. Chng thc phin ngi dng ..................................................... 71
3.2.3.2. ng nhp Web v xc nh phin chuyn hng ngi dng ... 72
3.2.3.3. Cc vn i vi vic x l Web log ........................................ 72
3.2.3.4. Phng php chng thc phin lm vic v truy cp Web ......... 73
3.2.4. Qu trnh khai ph theo s dng Web ................................................ 73
3.2.4.1. Tin x l d liu ......................................................................... 73
3.2.4.2. Khai ph d liu ........................................................................... 73
3.2.4.3. Phn tch nh gi ........................................................................ 75
3.2.5. V d khai ph theo s dng Web ...................................................... 75
3.3. Khai ph cu trc Web .............................................................................. 77
3.3.1. Tiu chun nh gi tng t ........................................................ 79
3.3.2. Khai ph v qun l cng ng Web .................................................. 80
3.3.2.1. Thut ton PageRank ................................................................... 81
3.3.2.2. Phng php phn cm nh thut ton HITS ............................. 82
3.4. p dng thut ton phn cm d liu trong tm kim v PCDL Web ...... 85
3.4.1. Hng tip cn bng k thut phn cm ............................................ 85
3.4.2. Qu trnh tm kim v phn cm ti liu ............................................ 87
3.4.2.1. Tm kim d liu trn Web .......................................................... 87
3.4.2.2. Tin x l d liu ......................................................................... 88
3.4.2.3. Xy dng t in .......................................................................... 89
3.4.2.4. Tch t, s ha vn bn v biu din ti liu ............................... 90
3.4.2.5. Phn cm ti liu .......................................................................... 90
3.4.6. Kt qu thc nghim ........................................................................... 92
3.5. Tng kt chng 3 ..................................................................................... 93
KT LUN V HNG PHT TRIN ................................................... 94
PH LC ................................................................................................... 96
TI LIU THAM KHO ......................................................................... 102

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

v

DANH SCH CC HNH
Hnh 1.1. Qu trnh khm ph tri thc ........................................................... 4
Hnh 1.2. Cc lnh vc lin quan n khm ph tri thc trong CSDL .......... 6
Hnh 1.3. Trc quan ha kt qu KPDL trong Oracle ................................ 10
Hnh 1.4. M phng s PCDL ..................................................................... 11
Hnh 1.5. Phn loi d liu Web.................................................................. 22
Hnh 1.6. Lc thng k tn s ca t theo nh lut Zipf ................... 26
Hnh 1.7. Cc o tng t thng dng ................................................. 29
Hnh 2.1. Thut ton k-means ..................................................................... 32
Hnh 2.2. Hnh dng cm d liu c khm ph bi k-means ................. 33
Hnh 2.3. Trng hp C
jmp
=d(O
j
,O
m,2
) d(O
j
, O
m
) khng m .................... 35
Hnh 2.4. Trng hp C
jmp
= (O
j
,O
p
)- d(O
j
, O
m
) c th m hoc dng ..... 36
Hnh 2.5. Trng hp C
jmp
bng khng ....................................................... 36
Hnh 2.6. Trng hp C
jmp
= (O
j
,O
p
)- d(O
j
, O
m,2
) lun m .......................... 37
Hnh 2.7. Thut ton PAM .......................................................................... 37
Hnh 2.8. Thut ton CLARA ..................................................................... 38
Hnh 2.9. Thut ton CLARANS ................................................................ 40
Hnh 2.10. Cc chin lc phn cm phn cp ........................................... 42
Hnh 2.11. Cy CF c s dng bi thut ton BIRCH ........................... 43
Hnh 2.12. Thut ton BIRCH ..................................................................... 44
Hnh 2.13. V d v kt qu phn cm bng thut ton BIRCH ................. 44
Hnh 2.14. Cc cm d liu c khm ph bi CURE ............................. 45
Hnh 2.15. Thut ton CURE ...................................................................... 46
Hnh 2.16. Mt s hnh dng khm ph bi phn cm da trn mt ..... 47
Hnh 2.17. Ln cn ca P vi ngng Eps .................................................. 48
Hnh 2.18. Mt - n c trc tip ....................................................... 49
Hnh 2.19. Mt n c ........................................................................ 49
Hnh 2.20. Mt lin thng ....................................................................... 49
Hnh 2.21. Cm v nhiu ............................................................................. 50
Hnh 2.22. Thut ton DBSCAN ................................................................. 51
Hnh 2.23. Th t phn cm cc i tng theo OPTICS .......................... 52
Hnh 2.24. DENCLUE vi hm phn phi Gaussian .................................. 53
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

vi
Hnh 2.25. M hnh cu trc d liu li .................................................... 55
Hnh 2.26. Thut ton CLIQUE .................................................................. 56
Hnh 2.27. Qu trnh nhn dng cc ca CLIQUE ................................... 57
Hnh 3.1. Phn loi khai ph Web ............................................................... 62
Hnh 3.2. Qu trnh khai ph vn bn Web ................................................. 64
Hnh 3.3. Thut ton phn lp K-Nearest Neighbor ................................... 67
Hnh 3.4. Thut ton phn cm phn cp .................................................... 67
Hnh 3.5. Thut ton phn cm phn hoch ................................................ 68
Hnh 3.6. Kin trc tng qut ca khai ph theo s dng Web .................. 70
Hnh 3.7. Minh ha ni dung logs file ......................................................... 72
Hnh 3.8. Phn tch ngi dng truy cp Web ............................................ 77
Hnh 3.9. thi lin kt Web ...................................................................... 78
Hnh 3.10. Quan h trc tip gia 2 trang ................................................... 79
Hnh 3.11. tng t ng trch dn ....................................................... 79
Hnh 3.12. tng t ch mc .................................................................. 79
Hnh 3.13. Cng ng Web ......................................................................... 80
Hnh 3.14. Kt qu ca thut ton PageRank .............................................. 81
Hnh 3.15. th phn i ca Hub v Authority ...................................... 82
Hnh 3.16. S kt hp gia Hub v Authority ............................................ 83
Hnh 3.17. th Hub-Authority ................................................................ 84
Hnh 3.18. Gi tr trng s cc Hub v Authority ....................................... 84
Hnh 3.19. Thut ton nh trng s cm v trang ..................................... 86
Hnh 3.20. Cc bc phn cm kt qu tm kim trn Web ....................... 87
Hnh 3.21. Thut ton k-means trong phn cm ni dung ti liu Web ..... 91

DANH SCH CC BNG BIU
Bng 1.1. Bng tham s thuc tnh nh phn ............................................... 18
Bng 1.2. Thng k cc t tn s xut hin cao .......................................... 24
Bng 3.1. Thng k s ngi dng ti cc thi gian khc nhau ................. 76
Bng 3.2. Bng o thi gian thc hin thut ton phn cm ...................... 92

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

vii

CC CM T VIT TT

STT Vit tt Cm t ting Anh Cm t ting Vit
1 CNTT Information Technology Cng ngh thng tin
2 CSDL Database C s d liu
3 KDD
Knowledge Discovery in
Database
Khm ph tri thc trong
c s d liu
4 KPDL Data mining Khai ph d liu
5 KPVB Text Mining Khai ph vn bn
6 PCDL Data Clustering Phn cm d liu



Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

1

LI M U

Trong nhng nm gn y cng vi pht trin nhanh chng ca khoa hc
k thut l s bng n v tri thc. Kho d liu, ngun tri thc ca nhn loi
cng tr nn s, v tn lm cho vn khai thc cc ngun tri thc ngy
cng tr nn nng bng v t ra thch thc ln cho nn cng ngh thng tin
th gii.
Cng vi nhng tin b vt bc ca cng ngh thng tin l s pht trin
mnh m ca mng thng tin ton cu, ngun d liu Web tr thnh kho d liu
khng l. Nhu cu v tm kim v x l thng tin, cng vi yu cu v kh nng
kp thi khai thc chng mng li nhng nng sut v cht lng cho cng
tc qun l, hot ng kinh doanh, tr nn cp thit trong x hi hin i.
Nhng vn tm kim v s dng ngun tri thc nh th no phc v
cho cng vic ca mnh li l mt vn kh khn i vi ngi s dng.
p ng phn no yu cu ny, ngi ta xy dng cc cng c tm kim v
x l thng tin nhm gip cho ngi dng tm kim c cc thng tin cn thit
cho mnh, nhng vi s rng ln, s ca ngun d liu trn Internet lm
cho ngi s dng cm thy kh khn trc nhng kt qu tm c.
Vi cc phng php khai thc c s d liu truyn thng cha p ng
c cc yu cu . gii quyt vn ny, mt hng i mi l nghin
cu v p dng k thut khai ph d liu v khm ph tri thc trong mi trng
Web. Do , vic nghin cu cc m hnh d liu mi v p dng cc phng
php khai ph d liu trong khai ph ti nguyn Web l mt xu th tt yu va
c ngha khoa hc va mang ngha thc tin cao.
V vy, tc gi chn ti Khai ph d liu Web bng k thut phn cm
lm lun vn tt nghip cho mnh.
B cc lun vn gm 3 chng:
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

2
Chng 1 trnh by mt cch tng quan cc kin thc c bn v khai ph d
liu v khm ph tri thc, khai ph d liu trong mi trng Web; mt s vn
v biu din v x l d liu vn bn p dng trong khai ph d liu Web.
Chng 2 gii thiu mt s k thut phn cm d liu ph bin v thng
c s dng trong lnh vc khai ph d liu v khm ph tri thc.
Chng 3 trnh by mt s hng nghin cu trong khai ph d liu Web
nh khai ph ti liu Web, khai ph theo s dng Web, khai ph cu trc Web
v tip cn theo hng s dng cc k thut phn cm d liu gii quyt bi
ton khai ph d liu Web. Trong phn ny cng trnh by mt m hnh p dng
k thut phn cm d liu trong tm kim v phn cm ti liu Web.
Phn kt lun ca lun vn tng kt li nhng vn nghin cu, nh
gi kt qu nghin cu, hng pht trin ca ti.
Phn ph lc trnh by mt s on m lnh x l trong chng trnh v
mt s giao din trong chng trnh m phng.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

3
Chng 1. TNG QUAN V KHAI PH D LIU

1.1. Khai ph d liu v pht hin tri thc
1.1.1. Khai ph d liu
Cui thp k 80 ca th k 20, s pht trin rng khp ca cc CSDL
to ra s bng n thng tin trn ton cu, vo thi gian ny ngi ta bt u
cp n khi nim khng hong trong vic phn tch d liu tc nghip cung
cp thng tin vi yu cu cht lng ngy cng cao cho ngi lm quyt nh
trong cc t chc chnh ph, ti chnh, thng mi, khoa hc,
ng nh John Naisbett cnh bo Chng ta ang chm ngp trong d
liu m vn i tri thc. Lng d liu khng l ny thc s l mt ngun ti
nguyn c nhiu gi tr bi thng tin l yu t then cht phc v cho mi hot
ng qun l, kinh doanh, pht trin sn xut v dch v, n gip ngi iu
hnh v qun l c nhng hiu bit v mi trng v tin trnh hot ng ca t
chc mnh trc khi ra quyt nh tc ng n qu trnh hot ng nhm t
c cc mc tiu mt cch hiu qu v bn vng.
KPDL l mt lnh vc mi c nghin cu, nhm t ng khai thc thng
tin, tri thc mi hu ch, tim n t nhng CSDL ln cho cc n v, t chc,
doanh nghip,. t lm thc y kh nng sn xut, kinh doanh, cnh tranh
cho cc n v, t chc ny. Cc kt qu nghin cu khoa hc cng nhng ng
dng thnh cng trong KDD cho thy KPDL l mt lnh vc pht trin bn
vng, mang li nhiu li ch v c nhiu trin vng, ng thi c u th hn hn
so vi cc cng c tm kim phn tch d liu truyn thng. Hin nay, KPDL
ng dng ngy cng rng ri trong cc lnh vc nh thng mi, ti chnh, y
hc, vin thng, tin sinh,.
Cc k thut chnh c p dng trong lnh vc KPDL phn ln c tha
k t lnh vc CSDL, hc my, tr tu nhn to, l thuyt thng tin, xc sut
thng k v tnh ton hiu nng cao,...
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

4
Nh vy ta c th khi qut ha khi nim KPDL l mt qu trnh tm
kim, pht hin cc tri thc mi, hu ch, tim n trong CSDL ln.
KDD l mc tiu chnh ca KPDL, do vy hai khi nim KPDL v KDD
c cc nh khoa hc trn hai lnh vc xem l tng ng vi nhau. Th
nhng nu phn chia mt cch chi tit th KPDL l mt bc chnh trong qu
trnh KDD.
1.1.2. Qu trnh khm ph tri thc
Qu trnh kh ph tri thc c th chia thnh 5 bc nh sau [10]:

Hnh 1.1. Qu trnh khm ph tri thc
Qu trnh KPDL c th phn thnh cc giai on sau [10]:
Trch chn d liu: y l bc trch chn nhng tp d liu cn c khai
ph t cc tp d liu ln ban u theo mt s tiu ch nht nh.
Tin x l d liu: y l bc lm sch d liu (x l nhng d liu
khng y , nhiu, khng nht qun,...), rt gn d liu (s dng hm nhm
v tnh tng, cc phng php nn d liu, s dng histograms, ly mu,...), ri
rc ha d liu (ri rc ha da vo histograms, da vo entropy, da vo phn
khong,...). Sau bc ny, d liu s nht qun, y , c rt gn v c
ri rc ha.
Bin i d liu: y l bc chun ha v lm mn d liu a d liu
v dng thun li nht nhm phc v qu trnh khai ph bc sau.
Cc mu
Tri
thc
D liu
tin x l
D liu
bin i
D liu
th
D liu
la chn

Trch
chn
Tin
x l
Bin
i
Khai
ph
nh gi,
biu din
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

5
Khai ph d liu: y l bc p dng nhng k thut phn tch (nh cc
k thut ca hc my) nhm khai thc d liu, trch chn c nhng mu
thng tin, nhng mi lin h c bit trong d liu. y c xem l bc quan
trng v tn nhiu thi gian nht ca ton qu trnh KDD.
nh gi v biu din tri thc: Nhng mu thng tin v mi lin h trong
d liu c khm ph bc trn c bin i v biu din mt dng
gn gi vi ngi s dng nh th, cy, bng biu, lut,... ng thi bc
ny cng nh gi nhng tri thc khm ph c theo nhng tiu ch nht nh.
1.1.3. Khai ph d liu v cc lnh vc lin quan
KPDL l mt lnh vc lin quan ti thng k, hc my, CSDL, thut ton,
tnh ton song song, thu nhn tri thc t h chuyn gia v d liu tru tng.
c trng ca h thng khm ph tri thc l nh vo cc phng php, thut
ton v k thut t nhng lnh vc khc nhau KPDL.
Lnh vc hc my v nhn dng mu trong KDD nghin cu cc l thuyt
v thut ton ca h thng trch ra cc mu v m hnh t d liu ln. KDD
tp trung vo vic m rng cc l thuyt v thut ton cho cc vn tm ra cc
mu c bit (hu ch hoc c th rt ra tri thc quan trng) trong CSDL ln.
Ngoi ra, KDD c nhiu im chung vi thng k, c bit l phn tch d
liu thm d (Exploratory Data Analysis - EDA). H thng KDD thng gn
nhng th tc thng k cho m hnh d liu v tin trnh nhiu trong khm ph
tri thc ni chung.
Mt lnh vc lin quan khc l phn tch kho d liu. Phng php ph
bin phn tch kho d liu l OLAP (On-Line Analytical Processing). Cc
cng c OLAP tp trung vo phn tch d liu a chiu.
1.1.4. Cc k thut p dng trong khai ph d liu
KDD l mt lnh vc lin ngnh, bao gm: T chc d liu, hc my, tr
tu nhn to v cc khoa hc khc. S kt hp ny c th c din t nh sau:
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

6

Hnh 1.2. Cc lnh vc lin quan n khm ph tri thc trong CSDL

ng trn quan im ca hc my, th cc k thut trong KPDL, bao gm:
Hc c gim st: L qu trnh gn nhn lp cho cc phn t trong CSDL
da trn mt tp cc v d hun luyn v cc thng tin v nhn lp bit.
Hc khng c gim st: L qu trnh phn chia mt tp d liu thnh cc
lp hay cm d liu tng t nhau m cha bit trc cc thng tin v lp hay
tp cc v d hun luyn.
Hc na gim st: L qu trnh phn chia mt tp d liu thnh cc lp da
trn mt tp nh cc v d hun luyn v cc thng tin v mt s nhn lp
bit trc.
+ Nu cn c vo lp cc bi ton cn gii quyt, th KPDL bao gm cc
k thut p dng sau [10]:
Phn lp v d bo: Xp mt i tng vo mt trong nhng lp bit
trc. V d nh phn lp cc d liu bnh nhn trong h s bnh n. Hng
tip cn ny thng s dng mt s k thut ca hc my nh cy quyt nh,
mng nron nhn to,... Phn lp v d bo cn c gi l hc c gim st.
Lut kt hp: L dng lut biu din tri thc dng kh n gin. V d:
60 % n gii vo siu th nu mua phn th c ti 80% trong s h s mua thm
son. Lut kt hp c ng dng nhiu trong lnh vc kinh doanh, y hc, tin-
sinh, ti chnh v th trng chng khon,...
Cc lnh vc
khoa hc khc
T chc d liu
Hc my v
tr tu nhn to
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

7
Phn tch chui theo thi gian: Tng t nh khai ph lut kt hp nhng
c thm tnh th t v tnh thi gian. Hng tip cn ny c ng dng nhiu
trong lnh vc ti chnh v th trng chng khon v n c tnh d bo cao.
Phn cm: Xp cc i tng theo tng cm d liu t nhin. Phn cm
cn c gi l hc khng c gim st.
M t v tm tt khi nim: Thin v m t, tng hp v tm tt khi nim,
v d nh tm tt vn bn.
Do KPDL c ng dng rng ri nn n c th lm vic vi rt nhiu kiu
d liu khc nhau. Sau y l mt s dng d liu in hnh: D liu quan h,
d liu a chiu, d liu dng giao dch, d liu quan h - hng i tng, d
liu khng gian v thi gian, d liu chui thi gian, d liu a phng tin, d
liu vn bn v Web,
1.1.5. Nhng chc nng chnh ca khai ph d liu
Hai mc tiu chnh ca KPDL l m t v d bo. D bo l dng mt s
bin hoc trng trong CSDL d on ra cc gi tr cha bit hoc s c ca
cc bin quan trng khc. Vic m t tp trung vo tm kim cc mu m con
ngi c th hiu c m t d liu. Trong lnh vc KDD, m t c quan
tm nhiu hn d bo, n ngc vi cc ng dng hc my v nhn dng mu
m trong vic d bo thng l mc tiu chnh. Trn c s mc tiu chnh
ca KPDL, cc chc nng chnh ca KDD gm:
M t lp v khi nim: D liu c th c kt hp trong lp v khi
nim. Th d, trong kho d liu bn hng thit b tin hc, cc lp mt hng bao
gm my tnh, my in,v khi nim khch hng bao gm khch hng mua s
v khch mua l. Vic m t lp v khi nim l rt hu ch cho giai on tng
hp, tm lc v chnh xc ho. M t lp v khi nim c bt ngun t c
trng ho d liu v phn bit d liu. c trng ho d liu l qu trnh tng
hp nhng c tnh hoc cc thnh phn chung ca mt lp d liu mc tiu.
Phn bit d liu l so snh lp d liu mc tiu vi nhng lp d liu i chiu
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

8
khc. Lp d liu mc tiu v cc lp i chiu l do ngi dng ch ra v
tng ng vi cc i tng d liu nhn c nh truy vn.
Phn tch s kt hp: Phn tch s kt hp l khm ph cc lut kt hp
th hin mi quan h gia cc thuc tnh gi tr m ta nhn bit c nh tn
sut xut hin cng nhau ca chng. Cc lut kt hp c dng Y X , tc l
. .
n
A A ....
1

m
B B . ......
1
, trong A
i
(i=1,..., n) v B
j
(j=1,...,m) l cc cp
thuc tnh gi tr. Lut kt hp dng Y X c th c hiu l d liu tho
mn cc iu kin ca X th cng s tho cc iu kin ca Y.
Phn lp v d bo: Phn lp l qu trnh tm kim mt tp cc m hnh
hoc chc nng m n m t v phn bit n vi cc lp hoc khi nim khc.
Cc m hnh ny nhm mc ch d bo v lp ca mt s i tng. Vic xy
dng m hnh da trn s phn tch mt tp cc d liu c hun luyn c
nhiu dng th hin m hnh nh lut phn lp (IF-THEN), cy quyt nh, cng
thc ton hc hay mng nron,... S phn lp c s dng d on nhn
lp ca cc i tng trong d liu. Tuy nhin trong nhiu ng dng, ngi ta
mong mun d on nhng gi tr khuyt thiu no . Thng thng l
trng hp d on cc gi tr ca d liu kiu s. Trc khi phn lp v d
bo, c th cn thc hin phn tch thch hp xc nh v loi b cc thuc
tnh khng tham gia vo qu trnh phn lp v d bo.
Phn cm: Khng ging nh phn lp v d bo, phn cm phn tch cc
i tng d liu khi cha bit nhn ca lp. Nhn chung, nhn lp khng tn
ti trong sut qu trnh hun luyn d liu, n phn cm c th c s dng
a ra nhn ca lp. S phn cm thc hin nhm cc i tng d liu theo
nguyn tc: Cc i tng trong cng mt nhm th ging nhau hn cc i
tng khc nhm. Mi cm c to thnh c th c xem nh mt lp cc
i tng m cc lut c ly ra t . Dng ca cm c hnh thnh theo mt
cu trc phn cp ca cc lp m mi lp l mt nhm cc s kin tng t nhau.
Phn tch cc i tng ngoi cuc: Mt CSDL c th cha cc i
tng khng tun theo m hnh d liu. Cc i tng nh vy gi l i tng
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

9
ngoi cuc. Hu ht cc phng php KPDL u coi cc i tng ngoi cuc
l nhiu v loi b chng. Tuy nhin trong mt s ng dng, chng hn nh pht
hin nhiu, th s kin him khi xy ra li c ch hn nhng g thng
xuyn gp phi. S phn tch d liu ngoi cuc c coi nh l s khai ph cc
i tng ngoi cuc. Mt s phng php c s dng pht hin i tng
ngoi cuc: s dng cc test mang tnh thng k trn c s mt phn phi d
liu hay mt m hnh xc sut cho d liu, dng cc o khong cch m theo
cc i tng c mt khong cch ng k n cm bt k khc c coi l
i tng ngoi cuc, dng cc phng php da trn lch kim tra s
khc nhau trong nhng c trng chnh ca cc nhm i tng.
Phn tch s tin ho: Phn tch s tin ho thc hin vic m t v m
hnh ho cc qui lut hay khuynh hng ca nhng i tng m hnh vi ca
chng thay i theo thi gian. Phn tch s tin ho c th bao gm c c trng
ho, phn bit, tm lut kt hp, phn lp hay PCDL lin quan n thi gian,
phn tch d liu theo chui thi gian, so snh mu theo chu k v phn tch d
liu da trn tng t.
1.1.6. ng dng ca khai ph d liu
KPDL l mt lnh vc c quan tm v ng dng rng ri. Mt s ng
dng in hnh trong KPDL c th lit k nh sau: Phn tch d liu v h tr ra
quyt nh, iu tr y hc, KPVB, khai ph Web, tin-sinh, ti chnh v th trng
chng khon, bo him,...
Thng mi: Nh phn tch d liu bn hng v th trng, phn tch u
t, pht hin gian ln, chng thc ha khch hng, d bo xu hng pht trin,...
Thng tin sn xut: iu khin, lp k hoch, h thng qun l, phn tch
th nghim,...
Thng tin khoa hc: D bo thi tit, bo lt, ng t, tin sinh hc,...
Hin nay cc h qun tr CSDL tch hp nhng modul KPDL nh
SQL Server, Oracle, n nm 2007 Microsoft cung cp sn cng c KPDL
tch hp trong c MS-Word, MS-Excel,..
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

10

Hnh 1.3. Trc quan ha kt qu KPDL trong Oracle
1.2. K thut phn cm trong khai ph d liu
1.2.1. Tng quan v k thut phn cm
Mc ch chnh ca PCDL nhm khm ph cu trc ca mu d liu
thnh lp cc nhm d liu t tp d liu ln, theo n cho php ngi ta i
su vo phn tch v nghin cu cho tng cm d liu ny nhm khm ph v
tm kim cc thng tin tim n, hu ch phc v cho vic ra quyt nh. V d
nhm cc khch hng trong CSDL ngn hng c vn cc u t vo bt ng
sn cao Nh vy, PCDL l mt phng php x l thng tin quan trng v
ph bin, n nhm khm ph mi lin h gia cc mu d liu bng cch t
chc chng thnh cc cm.
Ta c th khi qut ha khi nim PCDL [10][19]: PCDL l mt k thut
trong KPDL, nhm tm kim, pht hin cc cm, cc mu d liu t nhin, tim
n, quan trng trong tp d liu ln t cung cp thng tin, tri thc hu ch
cho vic ra quyt nh.
Nh vy, PCDL l qu trnh phn chia mt tp d liu ban u thnh cc
cm d liu sao cho cc phn t trong mt cm "tng t" vi nhau v cc phn
t trong cc cm khc nhau s "phi tng t" vi nhau. S cc cm d liu c
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

11
phn y c th c xc nh trc theo kinh nghim hoc c th c t
ng xc nh ca phng php phn cm.
tng t c xc nh da trn gi tr cc thuc tnh m t i tng.
Thng thng, php o khong cch thng c s dng nh gi tng
t hay phi tng t.
Ta c th minh ho vn phn cm nh hnh sau y:

Hnh 1.4. M phng s PCDL
Trong hnh trn, sau khi phn cm ta thu c bn cm trong cc phn
t "tng t" th c xp vo mt cm, cc phn t "phi tng t" th chng
thuc v cc cm khc nhau.
Trong PCDL khi nim, hai hoc nhiu i tng cng c xp vo mt
cm nu chng c chung mt nh ngha v khi nim hoc chng xp x vi
cc khi nim m t cho trc. Nh vy, PCDL khng s dng o tng t
nh trnh by trn.
Trong hc my, PCDL c xem l vn hc khng c gim st, v n
phi gii quyt vn tm mt cu trc trong tp hp d liu cha bit trc cc
thng tin v lp hay cc thng tin v tp hun luyn. Trong nhiu trng hp,
nu phn lp c xem l vn hc c gim st th PCDL l mt bc trong
phn lp d liu, PCDL s khi to cc lp cho phn lp bng cch xc nh cc
nhn cho cc nhm d liu.
Mt vn thng gp trong PCDL l hu ht cc d liu cn cho phn
cm u c cha d liu "nhiu" do qu trnh thu thp thiu chnh xc hoc
thiu y , v vy cn phi xy dng chin lc cho bc tin x l d liu
nhm khc phc hoc loi b "nhiu" trc khi bc vo giai on phn tch
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

12
PCDL. "Nhiu" y c th l cc i tng d liu khng chnh xc hoc cc
i tng d liu khuyt thiu thng tin v mt s thuc tnh. Mt trong cc k
thut x l nhiu ph bin l vic thay th gi tr ca cc thuc tnh ca i
tng "nhiu" bng gi tr thuc tnh tng ng ca i tng d liu gn nht.
Ngoi ra, d tm phn t ngoi lai l mt trong nhng hng nghin cu
quan trng trong PCDL, chc nng ca n l xc nh mt nhm nh cc i
tng d liu "khc thng" so vi cc d liu khc trong CSDL - tc l cc i
tng d liu khng tun theo cc hnh vi hoc m hnh d liu - nhm trnh s
nh hng ca chng ti qu trnh v kt qu ca PCDL. Khm ph cc phn t
ngoi lai c pht trin v ng dng trong vin thng, d tm gian ln
thng mi
Tm li, PCDL l mt vn kh v ngi ta phi i gii quyt cc vn
con c bn nh sau:
- Biu din d liu.
- Xy dng hm tnh tng t.
- Xy dng cc tiu chun phn cm.
- Xy dng m hnh cho cu trc cm d liu.
- Xy dng thut ton phn cm v xc lp cc iu kin khi to.
- Xy dng cc th tc biu din v nh gi kt qu phn cm.
Theo cc nghin cu th n nay cha c mt phng php phn cm tng
qut no c th gii quyt trn vn cho tt c cc dng cu trc cm d liu.
Hn na, cc phng php phn cm cn c cch thc biu din cu trc cc
cm d liu khc nhau, vi mi cch thc biu din khc nhau s c mt thut
ton phn cm ph hp. PCDL ang l vn m v kh v ngi ta cn phi i
gii quyt nhiu vn c bn nh cp trn mt cch trn vn v ph
hp vi nhiu dng d liu khc nhau. c bit i vi d liu hn hp, ang
ngy cng tng trng khng ngng trong cc h qun tr d liu, y cng l
mt trong nhng thch thc ln trong lnh vc KPDL trong nhng thp k tip
theo v c bit l trong lnh vc KPDL Web.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

13
1.2.2. ng dng ca phn cm d liu
PCDL l mt trong nhng cng c chnh ca KPDL c ng dng trong
nhiu lnh vc nh thng mi v khoa hc. Cc k thut PCDL c p
dng cho mt s ng dng in hnh trong cc lnh vc sau [10][19]:
Thng mi: PCDL c th gip cc thng nhn khm ph ra cc nhm
khch hng quan trng c cc c trng tng ng nhau v c t h t cc
mu mua bn trong CSDL khch hng.
Sinh hc: PCDL c s dng xc nh cc loi sinh vt, phn loi cc
Gen vi chc nng tng ng v thu c cc cu trc trong cc mu.
Phn tch d liu khng gian: Do s s ca d liu khng gian nh d
liu thu c t cc hnh nh chp t v tinh, cc thit b y hc hoc h thng
thng tin a l (GIS), lm cho ngi dng rt kh kim tra cc d liu
khng gian mt cch chi tit. PCDL c th tr gip ngi dng t ng phn
tch v x l cc d liu khng gian nh nhn dng v chit xut cc c tnh
hoc cc mu d liu quan tm c th tn ti trong CSDL khng gian.
Lp quy hoch th: Nhn dng cc nhm nh theo kiu v v tr a l,
nhm cung cp thng tin cho quy hoch th.
Nghin cu tri t: Phn cm theo di cc tm ng t nhm cung
cp thng tin cho nhn dng cc vng nguy him.
a l: Phn lp cc ng vt, thc vt v a ra c trng ca chng.
Khai ph Web: PCDL c th khm ph cc nhm ti liu quan trng, c
nhiu ngha trong mi trng Web. Cc lp ti liu ny tr gip cho vic
khm ph tri thc t d liu Web, khm ph ra cc mu truy cp ca khch hng
c bit hay khm ph ra cng ng Web,
1.2.3. Cc yu cu i vi k thut phn cm d liu
Vic xy dng, la chn mt thut ton phn cm l bc then cht cho
vic gii quyt vn phn cm, s la chn ny ph thuc vo c tnh d liu
cn phn cm, mc ch ca ng dng thc t hoc xc nh u tin gia
cht lng ca cc cm hay tc thc hin thut ton,
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

14
Hu ht cc nghin cu v pht trin thut ton PCDL u nhm tho mn
cc yu cu c bn sau [10][19]:
C kh nng m rng: Mt s thut ton c th ng dng tt cho tp d
liu nh (khong 200 bn ghi d liu) nhng khng hiu qu khi p dng cho
tp d liu ln (khong 1 triu bn ghi).
Thch nghi vi cc kiu d liu khc nhau: Thut ton c th p dng hiu
qu cho vic phn cm cc tp d liu vi nhiu kiu d liu khc nhau nh d
liu kiu s, kiu nh phn, d liu nh danh, hng mc,... v thch nghi vi
kiu d liu hn hp.
Khm ph ra cc cm vi hnh th bt k: Do hu ht cc CSDL c cha
nhiu cm d liu vi cc hnh th khc nhau nh: hnh lm, hnh cu, hnh
que, V vy, khm ph c cc cm c tnh t nhin th cc thut ton
phn cm cn phi c kh nng khm ph ra cc cm d liu c hnh th bt k.
Ti thiu lng tri thc cn cho xc nh cc tham s vo: Do cc gi tr
u vo thng nh hng rt ln n thut ton phn cm v rt phc tp
xc nh cc gi tr vo thch hp i vi cc CSDL ln.
t nhy cm vi th t ca d liu vo: Cng mt tp d liu, khi a vo
x l cho thut ton PCDL vi cc th t vo ca cc i tng d liu cc
ln thc hin khc nhau th khng nh hng ln n kt qu phn cm.
Kh nng thch nghi vi d liu nhiu cao: Hu ht cc d liu phn cm
trong KPDL u cha ng cc d liu li, d liu khng y , d liu rc.
Thut ton phn cm khng nhng hiu qu i vi cc d liu nhiu m cn
trnh dn n cht lng phn cm thp do nhy cm vi nhiu.
t nhy cm vi cc tham s u vo: Ngha l gi tr ca cc tham s u
vo khc nhau t gy ra cc thay i ln i vi kt qu phn cm.
Thch nghi vi d liu a chiu: Thut ton c kh nng p dng hiu qu
cho d liu c s chiu khc nhau.
D hiu, d ci t v kh thi.
Cc yu cu ny ng thi l cc tiu ch nh gi hiu qu ca cc
phng php PCDL, y l nhng thch thc cho cc nh nghin cu trong lnh
vc PCDL.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

15
1.2.4. Cc kiu d liu v o tng t
Trong phn ny ta phn tch cc kiu d liu thng c s dng trong
PCDL. Trong PCDL, cc i tng d liu cn phn tch c th l con ngi,
nh ca, tin lng, cc thc th phn mm, Cc i tng ny thng c
din t di dng cc thuc tnh ca n. Cc thuc tnh ny l cc tham s cn
cho gii quyt vn PCDL v s la chn chng c tc ng ng k n cc
kt qu ca phn cm. Phn loi cc kiu thuc tnh khc nhau l mt vn
cn gii quyt i vi hu ht cc tp d liu nhm cung cp cc phng tin
thun li nhn dng s khc nhau ca cc phn t d liu. Di y l cch
phn lp da trn hai c trng l: kch thc min v h o.
1.2.4.1. Phn loi kiu d liu da trn kch thc min
- Thuc tnh lin tc: Nu min gi tr ca n l v hn khng m c,
ngha l gia hai gi tr tn ti v s gi tr khc. Th d nh cc thuc tnh v
mu, nhit hoc cng m thanh.
- Thuc tnh ri rc: Nu min gi tr ca n l tp hu hn hoc m
c. Th d nh cc thuc tnh v s serial ca mt cun sch, s thnh vin
trong mt gia nh,
1.2.4.2. Phn loi kiu d liu da trn h o
Gi s ta c hai i tng x, y v cc thuc tnh x
i
, y
i
tng ng vi thuc
tnh th i ca chng. Chng ta c cc lp kiu d liu nh sau:
- Thuc tnh nh danh: Dng thuc tnh khi qut ho ca thuc tnh nh
phn, trong min gi tr l ri rc khng phn bit th t v c nhiu hn hai
phn t - ngha l nu x v y l hai i tng thuc tnh th ch c th xc nh l
x = y hoc x = y. Th d nh thuc tnh v ni sinh.
- Thuc tnh c th t: L thuc tnh nh danh c thm tnh th t, nhng
chng khng c nh lng. Nu x v y l hai thuc tnh th t th ta c th
xc nh l x= y hoc x = y hoc x > y hoc x < y. Th d nh thuc tnh Huy
chng ca vn ng vin th thao.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

16
- Thuc tnh khong: Nhm o cc gi tr theo xp x tuyn tnh. Vi
thuc tnh khong, ta c th xc nh mt thuc tnh l ng trc hoc ng
sau thuc tnh khc vi mt khong l bao nhiu. Nu x
i
> y
i
th ta ni x cch y
mt khong |x
i
y
i
| tng ng vi thuc tnh th i. V d: thuc tnh s Serial
ca mt u sch trong th vin hoc thuc tnh s knh trn truyn hnh.
- Thuc tnh t l: L thuc tnh khong nhng c xc nh mt cch
tng i so vi im mc, th d nh thuc tnh chiu cao hoc cn nng ly
gi tr 0 lm mc.
Trong cc thuc tnh d liu trnh by trn, thuc tnh nh danh v thuc
tnh c th t gi chung l thuc tnh hng mc, thuc tnh khong v thuc tnh
t l c gi l thuc tnh s.
Ngi ta cn c bit quan tm n d liu khng gian. y l loi d liu
c cc thuc tnh s khi qut trong khng gian nhiu chiu, d liu khng gian
m t cc thng tin lin quan n khng gian cha ng cc i tng, th d nh
thng tin v hnh hc, D liu khng gian c th l d liu lin tc hoc ri rc:
D liu khng gian ri rc: C th l mt im trong khng gian nhiu
chiu v cho php ta xc nh c khong cch gia cc i tng d liu
trong khng gian.
D liu khng gian lin tc: Bao gm mt vng trong khng gian.
Thng thng, cc thuc tnh s c o bng cc n v xc nh nh l
Kilogams hoc Centimeter. Tuy nhin, cc n v o c nh hng n cc kt
qu phn cm. Th d nh thay i o cho thuc tnh cn nng t Kilogams
sang Pound c th mang li cc kt qu khc nhau trong phn cm. khc
phc iu ny ngi ta phi chun ho d liu, tc l s dng cc thuc tnh d
liu khng ph thuc vo n v o. Thc hin chun ho ph thuc vo ng
dng v ngi dng, thng thng chun ho d liu c thc hin bng cch
thay th mi mt thuc tnh bng thuc tnh s hoc thm cc trng s cho cc
thuc tnh.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

17
1.2.4.3. Khi nim v php o tng t, phi tng t
Khi cc c tnh ca d liu c xc nh, ngi ta tm cch thch hp
xc nh "khong cch" gia cc i tng (php o tng t d liu). y
l cc hm o s ging nhau gia cc cp i tng d liu, thng thng
cc hm ny hoc l tnh tng t hoc l tnh phi tng t gia cc
i tng d liu. Gi tr ca hm tnh o tng t cng ln th s ging
nhau gia i tng cng ln v ngc li, cn hm tnh phi tng t t l
nghch vi hm tnh tng t. tng t hoc phi tng t c nhiu
cch xc nh, chng thng c o bng khong cch gia cc i tng.
Tt c cc cch o tng t u ph thuc vo kiu thuc tnh m ta phn
tch. Th d, i vi thuc tnh hng mc ngi ta khng s dng o khong
cch m s dng mt hng hnh hc ca d liu.
Tt c cc o di y c xc nh trong khng o gian metric. Bt
k mt metric no cng l mt o, nhng iu ngc li khng ng.
trnh s nhm ln, thut ng o y cp n hm tnh tng t hoc
phi tng t. Mt khng gian metric l mt tp trong c xc nh cc
"khong cch" gia tng cp phn t, vi nhng tnh cht thng thng ca
khong cch hnh hc. Ngha l, mt tp X (cc phn t ca n c th l nhng
i tng bt k) gm cc i tng d liu trong CSDL D gi l mt khng
gian metric nu vi mi cp phn t x, y thuc X u xc nh mt s thc
(x,y), c gi l khong cch gia x v y tho mn h tnh cht sau: (i) (x, y)
> 0 nu x y; (ii) (x,y)= 0 nu x = y; (iii) (x, y) = (y, x) vi mi x, y; (iv) (x,
y) (x, z)+ (z,y).
Hm (x, y) c gi l mt metric ca khng gian. Cc phn t ca X
c gi l cc im ca khng gian ny.
Mt s php o tng t p dng i vi cc kiu d liu khc nhau
[10][17][27]:
+ Thuc tnh khong: Sau khi chun ho, o phi tng t ca hai i
tng d liu x, y c xc nh bng cc metric nh sau:
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

18
Khong cch Minskowski:
) | | (
1
/ 1
) , (
=

=
n
i
q
i
i
y
x
q
y x d , vi q l s nguyn dng.
Khong cch Euclide:


=
=
n
i
y
x
i
i
y x d
1
2
) ( ) , ( , (trng hp c bit ca
khong cch Minskowski trong trng hp q =2).
Khong cch Manhattan:

=
=
n
i
i
i
y
x
y x d
1
| | ) , ( , (trng hp c bit ca
khong cch Minskowski trong trng hp q=1).
Khong cch cc i: | | ) , (
1
y
x Max
i
i
n
i
y x d =
=
, y l trng hp ca
khong cch Minskowski trong trng hp q.
+ Thuc tnh nh phn: Trc ht ta c xy dng bng tham s sau:
y: 1 y: 0
x: 1 o
|
o
+|
x: 0


o

+o
o + | +o t
Bng 1.1. Bng tham s thuc tnh nh phn
Trong : t =o + +| +o , cc i tng x, y m tt c cc thuc tnh ca
n u l nh phn biu th bng 0 v 1. Bng trn cho ta cc thng tin sau:
- o l tng s cc thuc tnh c gi tr l 1 trong c hai i tng x, y.
- | l tng s cc gi tr thuc tnh c gi tr l 1 trong x v 0 trong y.
- l tng s cc gi tr thuc tnh c gi tr l 0 trong x v 1 trong y.
- o l tng s cc gi tr thuc tnh c gi tr l 0 trong x v y.
Cc php o tng t i vi d liu thuc tnh nh phn c nh
ngha nh sau:
- H s i snh n gin:
t
o o +
= ) , ( y x d , y c hai i tng x v y c
vai tr nh nhau, ngha l chng i xng v c cng trng s.
Bng 1: Bng ngu nhin
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

19
- H s Jacard:
| o
o
+ +
= ) , ( y x d , tham s ny b qua s cc i snh
gia 0-0. Cng thc tnh ny c s dng trong trng hp m trng s ca
cc thuc tnh c gi tr 1 ca i tng d liu c gi tr cao hn nhiu so vi
cc thuc tnh c gi tr 0, nh vy cc thuc tnh nh phn y l khng i
xng.
+ Thuc tnh nh danh: o phi tng t gia hai i tng x v y
c nh ngha nh sau:
p
m p
y x d

= ) , ( , trong m l s thuc tnh i snh
tng ng trng nhau v p l tng s cc thuc tnh.
+ Thuc tnh c th t: Php o phi tng t gia cc i tng d
liu vi thuc tnh th t c thc hin nh sau, y ta gi s i l thuc tnh
th t c M
i
gi tr (M
i
kch thc min gi tr):
Cc trng thi M
i
c sp th t nh sau: [1M
i
], ta c th thay th mi
gi tr ca thuc tnh bng gi tr cng loi r
i
, vi r
i
e{1,,M
i
}.
Mi mt thuc tnh th t c cc min gi tr khc nhau, v vy ta chuyn
i chng v cng min gi tr [0,1] bng cch thc hin php bin i sau cho
mi thuc tnh:
1
1
) (
) (

=
M
r
z
i
j
i
j
i
, vi i=1,..,M
i

S dng cng thc tnh phi tng t ca thuc tnh khong i vi cc
gi tr
z
j
i
) (
, y cng chnh l phi tng t ca thuc tnh c th t.
+ Thuc tnh t l: C nhiu cch khc nhau tnh tng t gia cc
thuc tnh t l. Mt trong nhng s l s dng cng thc tnh logarit cho mi
thuc tnh x
i
, th d q
i
= log(x
i
), lc ny q
i
ng vai tr nh thuc tnh khong.
Php bin i logarit ny thch hp trong trng hp cc gi tr ca thuc tnh l
s m.
Trong thc t, khi tnh o tng t d liu, ngi ta ch xem xt mt
phn cc thuc tnh c trng i vi cc kiu d liu hoc nh trng s cho
cho tt c cc thuc tnh d liu. Trong mt s trng hp, ngi ta loi b n
v o ca cc thuc tnh d liu bng cch chun ho chng hoc gn trng s
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

20
cho mi thuc tnh gi tr trung bnh, lch chun. Cc trng s ny c th s
dng trong cc o khong cch trn, th d vi mi thuc tnh d liu
c gn trng s tng ng w
i
( k i s s 1 ), tng t d liu c xc nh
nh sau:


=
=
n
i
i
y
x w
i
i
y x d
1
2
) ( ) , (
.
Ngi ta c th chuyn i gia cc m hnh cho cc kiu d liu trn, th
d d liu kiu hng mc c th chuyn i thnh d liu nh phn v ngc li.
Nhng gii php ny rt tt km v chi ph tnh ton, cn phi cn nhc khi p
dng cch thc ny.
Tu tng trng hp d liu c th m ngi ta s dng cc m hnh tnh
tng t khc nhau. Vic xc nh tng t d liu thch hp, chnh xc,
m bo khch quan l rt quan trng v gp phn xy dng thut ton PCDL
c hiu qu cao trong vic m bo cht lng cng nh chi ph tnh ton ca
thut ton.
1.3. Khai ph Web
1.3.1. Li ch ca khai ph Web
Vi s pht trin nhanh chng ca thng tin trn www, KPDL Web tng
bc tr nn quan trng hn trong lnh vc KPDL, ngi ta lun hy vng ly
c nhng tri thc hu ch thng qua vic tm kim, phn tch, tng hp, khai
ph Web. Nhng tri thc hu ch c th gip ta xy dng nn nhng Web site
hiu qu c th phc v cho con ngi tt hn, c bit trong lnh vc
thng mi in t.
Khm ph v phn tch nhng thng tin hu ch trn www bng cch s
dng k thut KPDL tr thnh mt hng quan trng trong lnh vc khm
ph tri thc. Khai ph Web bao gm khai ph cu trc Web, khai ph ni dung
Web v khai ph cc mu truy cp Web.
S phc tp trong ni dung ca cc trang Web khc vi cc ti liu vn bn
truyn thng [16]. Chng khng ng nht v cu trc, hn na ngun thng tin
Web thay i mt cch nhanh chng, khng nhng v ni dung m c v cu
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

21
trc trang. Chng hn nh tin tc, th trng chng khon, thng tin qung co,
trung tm dch v mng,... Tt c thng tin c thay i trn Web theo tng
giai on. Cc lin kt trang v ng dn truy cp cng lun thay i. Kh
nng gia tng lin tc v s lng ngi dng, s quan tm ti Web cng khc
nhau, ng c ngi dng rt a dng v phong ph. Vy lm th no c th
tm kim c thng tin m ngi dng cn? Lm th no c c nhng
trang Web cht lng cao?...
Nhng vn ny s c thc hin hiu qu hn bng cch nghin cu
cc k thut KPDL p dng trong mi trng Web. Th nht, ta s qun l cc
Web site tht tt; th hai, khai ph nhng ni dung m ngi dng quan tm;
th ba, s thc hin phn tch cc mu s dng Web.
Da vo nhng vn c bn trn, ta c th c nhng phng php hiu
qu cao cung cp nhng thng tin hu ch i vi ngi dng Web v gip
ngi dng s dng ngun ti nguyn Web mt cch hiu qu.
1.3.2. Khai ph Web
C nhiu khi nim khc nhau v khai ph Web, nhng c th tng qut
ha nh sau [5][30]: Khai ph Web l vic s dng cc k thut KPDL t
ng ha qu trnh khm ph v trch rt nhng thng tin hu ch t cc ti liu,
cc dch v v cu trc Web. Hay ni cch khc khai ph Web l vic thm d
nhng thng tin quan trng v nhng mu tim nng t ni dung Web, t thng
tin truy cp Web, t lin kt trang v t ngun ti nguyn thng mi in t
bng vic s dng cc k thut KPDL, n c th gip con ngi rt ra nhng tri
thc, ci tin vic thit k cc Web site v pht trin thng mi in t tt hn.
Lnh vc ny thu ht c nhiu nh khoa hc quan tm. Qu trnh khai ph
Web c th chia thnh cc cng vic nh nh sau:
i. Tm kim ngun ti nguyn: Thc hin tm kim v ly cc ti liu Web
phc v cho vic khai ph.
ii. La chn v tin x l d liu: La chn v tin x l t ng cc loi
thng tin t ngun ti nguyn Web ly v.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

22
iii. Tng hp: T ng khm ph cc mu chung ti cc Web site ring l
cng nh nhiu Website vi nhau.
iv. Phn tch: nh gi, gii thch, biu din cc mu khai ph c.
1.3.3. Cc kiu d liu Web
Ta c th khi qut bng s sau:

Hnh 1.5. Phn loi d liu Web
Cc i tng ca khai ph Web bao gm [5][16]: Server logs, Web pages,
Web hyperlink structures, d liu th trng trc tuyn v cc thng tin khc.
Web logs: Khi ngi dng duyt Web, dch v s phn ra 3 loi d liu
ng nhp: sever logs, error logs, v cookie logs. Thng qua vic phn tch cc
ti liu ng nhp ny ta c th khm ph ra nhng thng tin truy cp.
Web pages: Hu ht cc phng php KPDL Web c s dng trong Web
pages l theo chun HTML.
Web hyperlink structure: Cc trang Web c lin kt vi nhau bng cc
siu lin kt, iu ny rt quan trng khai ph thng tin. Do cc siu lin kt
Web l ngun ti nguyn rt xc thc.
D liu th trng trc tuyn: Nh lu tr thng tin thng mi in t
trong cc site thng mi in t.
Cc thng tin khc: Ch yu bao gm cc ng k ngi dng, n c th
gip cho vic khai ph tt hn.
Web data
Structure data
Usage data
User Profile data
Content data
Free Text
HTML file
XML file
Multimedia
Dynamic link
Static link
Dynamic content
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

23
1.4. X l d liu vn bn ng dng trong khai ph d liu Web
1.4.1. D liu vn bn
Trong cc loi d liu hin nay th vn bn l loi d liu ph bin nht v
n c mt khp mi ni, c bit l i vi d liu trn Web. Do vy, cc bi
ton x l vn bn c t ra t rt sm v hin nay n vn l vn rt
c nhiu nh nghin cu quan tm, mt trong nhng bi ton l tm kim
v trch dn vn bn, biu din v phn loi vn bn,.
CSDL vn bn c th chia lm 2 loi chnh [14][20]:
+ Dng khng c cu trc: y l nhng ti liu vn bn thng thng m
ta c thng ngay trn cc sch, bo, internet, y l dng d liu ca ngn
ng t nhin ca con ngi v n khng theo mt khun mu nh sn no c.
+ Dng na cu trc: y l nhng vn bn c t chc di dng cu trc
lng, nhng vn th hin ni dung chnh ca vn bn, nh vn bn HTML, Email,..
1.4.2. Mt s vn trong x l d liu vn bn
Mi vn bn c biu din bng mt vector Boolean hoc vector s.
Nhng vector ny c xt trong mt khng gian a chiu, trong mi chiu
tng ng vi mt t mc ring bit trong tp vn bn. Mi thnh phn ca
vector c gn mt hm gi tr f, n l mt s ch mt tng ng ca chiu
trong vn bn. Nu thay i gi tr hm f ta c th to ra nhiu trng s
khc nhau.
Mt s vn lin quan n vic biu din vn bn bng m hnh khng
gian vector:
+ Khng gian vector l mt tp hp bao gm cc t.
+ T l mt chui cc k t (ch ci v ch s); ngoi tr cc khong trng
(space, tab), k t xung dng, du cu (nh du chm, phy, chm phy, du
cm,...). Mt khc, n gin trong qu trnh x l, ta khng phn bit ch hoa
v ch thng (nu ch hoa th chuyn v ch thng).
+ Ct b t: Trong nhiu ngn ng, nhiu t c cng t gc hoc l bin
th ca t gc sang mt t khc. Vic s dng t gc lm gim ng k s
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

24
lng cc t trong vn bn (gim s chiu ca khng gian), nhng vic ct b
cc t li rt kh trong vic hiu vn bn.
Ngoi ra, nng cao cht lng x l, mt s cng trnh nghin cu
a ra mt s ci tin thut ton xem xt n c tnh ng cnh ca cc t bng
vic s dng cc cm t/vn phm ch khng ch xt cc t ring l [31].
Nhng cm t ny c th c xc nh bng cch xem xt tn s xut hin ca
c cm t trong ti liu.
Bng phng php biu din khng gian vector, ta c th thy r rng l chiu
ca mt vector s rt ln bi s chiu ca n c xc nh bng s lng cc t
khc nhau trong tp hp t. Chng hn, s lng cc t c th t 10
3
n 10
5
i
vi cc tp vn bn nh. Vn t ra l lm sao gim s chiu ca vector m
vn m bo vic x l vn bn ng v chnh xc, c bit l trong mi trng
www, ta s xem xt n mt s phng php gim s chiu ca vector.
1.4.2.1. Loi b t dng
Trc ht ta thy trong ngn ng t nhin c nhiu t ch dng biu din
cu trc cu ch khng biu t ni dung ca n. Nh cc gii t, t ni,... nhng
t nh vy xut hin nhiu trong cc vn bn m khng lin quan g ti ch
hoc ni dung ca vn bn. Do , ta c th loi b nhng t gim s chiu
ca vector biu din vn bn, nhng t nh vy c gi l nhng t dng.
Sau y l v d v tn s xut hin cao ca mt s t (ting Anh) trong
336,310 ti liu gm tng cng 125.720.891 t, 508.209 t ring bit.

(thng k ca B. Croft, UMass)
Bng 1.2. Thng k cc t tn s xut hin cao

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

25
1.4.2.2. nh lut Zipf
gim s chiu ca vector biu din vn bn hn na ta da vo mt quan
st sau: Nhiu t trong vn bn xut hin rt t ln, nu mc tiu ca ta l xc nh
tng t v s khc nhau trong ton b tp hp cc vn bn th cc t xut
hin mt hoc hai ln (tn s xut hin nh) th nh hng rt b n cc vn bn.
Tin cho vic l lun loi b nhng t c tn sut nh c a ra
bi Zipf nm 1949. Zipf pht biu di dng mt quan st nhng ngay trong
thi im , quan sat c gi l nh lut Zipf, mc d n thc s
khng phi l mt nh lut m ng hn l mt hin tng xp x ton hc.
m t nh lut Zipf, ta gi tng s tn s xut hin ca t t trong ti
liu D l f
t
. Sau sp xp tt c cc t trong tp hp theo chiu gim dn ca
tn s xut hin f v gi th hng ca mi t t l r
t
.
nh lut Zipf c pht biu di dng cng thc nh sau:
r
t
.f
t
~ K (vi K l mt hng s).
Trong ting Anh, ngi ta thy rng hng s K ~ N/10 trong N l s cc
t trong vn bn. Ta c th vit li nh lut Zipf nh sau: r
t
~ K/ f
t

Gi s t t
i
c sp xp v tr thp nht vi tn s xut hin l b no y
v t t
j
cng c sp v tr thp k tip vi mt tn s xut hin l b+1. Ta c
th thu c th hng xp x ca cc t ny l rt
i
~K/b v rt
j
~ K/(b+1), tr 2 biu
thc ny cho nhau ta xp x i vi cc t ring bit c tn s xut hin l b.
rt
i
- rt
j
~ K/b-K/(b+1)
Ta xp x gi tr ca t trong tp hp c th hng cao nht. Mt cch tng
qut, mt t ch xut hin mt ln trong tp hp, ta c r
max
=K.
Xt phn b ca cc t duy nht xut hin b ln trong tp hp, chia 2 v
cho nhau ta c K/b. Do , nh lut Zipf cho ta thy s phn b ng ch
ca cc t ring bit trong 1 tp hp c hnh thnh bi cc t xut hin t nht
trong tp hp.
Nm 1958 Luhn xut nhng t ph bin v him v khng cn thit
cho qu trnh x l nh sau.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

26

Hnh 1.6. Lc thng k tn s ca t theo nh lut Zipf
1.4.3. Cc m hnh biu din d liu vn bn
Trong cc bi ton x l vn bn, ta thy rng vai tr ca biu din vn bn
rt ln, c bit trong cc bi ton tm kim, phn cm,
Theo cc nghin cu v cch biu din khc nhau trong x l vn bn th
cch biu din tt nht l bng cc t ring bit c rt ra t ti liu gc v
cch biu din ny nh hng tng i nh i vi kt qu.
Cc cch tip cn khc nhau s dng m hnh ton hc khc nhau tnh
ton, y ta s trnh by mt s m hnh ph bin v c ng nhiu trong
cc bi bo gn y [14][22].
1.4.3.1. M hnh Boolean
y l m hnh biu din vector vi hm f nhn gi tr ri rc vi duy nht
hai gi tr ng/sai (true/false). Hm f tng ng vi thut ng t
i
s cho gi tr
ng khi v ch khi t
i
xut hin trong ti liu .
Gi s rng c mt CSDL gm m vn bn, D={d
1
, d
2
, ..., d
m
}. Mi vn bn
c biu din di dng mt vector gm n thut ng T={t
1
, t
2
,...,t
n
}. Gi
W={w
ij
} l ma trn trng s, w
ij
l gi tr trng s ca thut ng t
i
trong ti liu d
j
.
M hnh Boolean l m hnh n gin nht, n c xc nh nh sau:


Vng thp
vng
cao
Vng nhng t
mang ngha
T

n

s


x
u

t

h
i

n

Th hng ca t
r
f
W
ij
=
1 nu t
i
e d
j

0 nu t
i
e d
j

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

27
1.4.3.2. M hnh tn s
M hnh ny xc nh gi tr trng s cc phn t trong ma trn W(w
ij
) cc
gi tr l cc s dng da vo tn s xut hin ca cc t trong ti liu hoc tn
s xut hin ca ti liu trong CSDL. C 2 phng php ph bin:
1.4.3.2.1. M hnh da trn tn s xut hin cc t
Trong m hnh da trn tn s xut hin t (TF-Term Frequency) gi tr
ca cc t c tnh da vo s ln xut hin ca n trong ti liu, gi tf
ij
l s
ln xut hin ca t t
i
trong ti liu d
j
, khi w
ij
c th c tnh theo mt trong
cc cng thc sau [31]:
- W
ij
= tf
ij

- W
ij
= 1+log(tf
ij
)
- W
ij
=
ij
tf

Vi m hnh ny, trng s w
ij
ng bin vi s ln xut hin ca thut ng
t
i
trong ti liu d
j
. Khi s ln xut hin thut ng t
i
trong ti liu d
j
cng ln th
c ngha l d
j
cng ph thuc nhiu vo thut ng t
i
, ni cch khc thut ng t
i

mang nhiu thng tin hn trong ti liu d
j
.
1.4.3.2.2. Phng php da trn tn s vn bn nghch o
Trong m hnh da trn tn s vn bn nghch o (IDF-Inverse Document
Frequency) gi tr trng s ca t c tnh bng cng thc sau [31]:

Trong , n l tng s vn bn trong CSDL, h
i
l s vn bn cha thut ng t
i
.
Trng s w
ij
trong cng thc trn c tnh da vo quan trng ca
thut ng t
i
trong ti liu d
j
. Nu t
i
xut hin cng t trong cc vn bn th n
cng quan trng, do nu t
i
xut hin trong d
j
th trng s ca n cng ln,
ngha l n cng quan trng phn bit d
j
vi cc ti liu khc v lng thng
tin ca n cng ln.
W
ij
=
) log( ) log( ) log(
i
i
h n
h
n
= nu t
i
e d
j

0 nu ngc li (t
i
e d
j
)
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

28
1.4.3.2.3. M hnh kt hp TF-IDF
Trong m hnh TF-IDF [31], mi ti liu d
j
c xt n th hin bng mt
c trng ca (t
1
, t
2
,.., t
n
) vi t
i
l mt t/cm t trong d
j
. Th t ca t
i
da trn
trng s ca mi t. Cc tham s c th c thm vo ti u ha qu trnh
thc hin nhm. Nh vy, thnh phn trng s c xc nh bi cng thc sau,
n kt hp gi tr trng s tf v gi tr trng s idf.
Cng thc tnh trng s TF-IDF l:

Trong :
tf
ij
l tn s xut hin ca t
i
trong ti liu d
j

idf
ij
l nghch o tn s xut hin ca t
i
trong ti liu d
j
.
h
i
l s cc ti liu m t
i
xut hin trong CSDL.
n l tng s ti liu trong CSDL.
T cng thc ny, ta c th thy trng s ca mi phn t l da trn
nghch o ca tn s ti liu trong CSDL m t
i
v tn s xut hin ca phn t
ny trong ti liu.
Thng thng ta xy dng mt t in t ly i nhng t rt ph bin v
nhng t c tn s xut hin thp. Ngoi ra ta phi la chn m (Zemir s dng
500) phn t c trng s cao nht nh l nhng t c trng.
Phng php ny kt hp c u im ca c 2 phng php trn. Trng
s w
ij
c tnh bng tn s xut hin ca thut ng t
i
trong ti liu d
j
v
him ca thut ng t
i
trong ton b CSDL. Ty theo rng buc c th ca bi
ton m ta s dng cc m hnh biu din vn bn cho ph hp.
Tnh ton tng t gia 2 vector
Xt 2 vector X={x
1
, x
2
,..., x
m
} v Y={y
1
, y
2
,..., y
m
}.
) log( )] log( 1 [
i
ij ij ij ij
h
n
f idf tf w + = =
nu t
i
e d
j
0 nu ngc li (t
i
e d
j
)
Data
set
W
ij
=
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

29
Trong m hnh TF-IDF, ta c th la chn cng thc no tnh ton
tng t gia cc cp ti liu hoc cc cm. Sau y l cc o tng t ph
bin [5][14][31]:

= =
=
+

=
m
i
i
m
i
i
m
i
i i
y x
y x
Y X S
1
2
1
2
1
) ( 2
) , im( : Dice



= = =
=
+

=
m
i
i i
m
i
i
m
i
i
m
i
i i
y x y x
y x
Y X
1 1
2
1
2
1
) (
) (
) , Sim( : Jaccard

= =
=

=
m
i
i
m
i
i
m
i
i i
y x
y x
Y X
1
2
1
2
1
) (
) , Sim( : Cosine

) , min(
) (
) , Sim( :
1
2
1
2
1

= =
=

=
m
i
i
m
i
i
m
i
i i
y x
y x
Y X Overlap

=
= =
m
i
i i
y x Y X Y X
1
2
) ( ) , Dis( ) , Sim( : Euclidean

=
= =
m
i
i i
y x Y X Y X
1
| | ) , Dis( ) , Sim( : Manhattan

Hnh 1.7. Cc o tng t thng dng
Vi x
i
v y
j
i din mt cp t hoc cm t trong ti liu. S dng cc
cng thc ny v vi mt ngng thch hp, ta c th d dng xc nh mc
tng t ca cc ti liu trong CSDL. tng s dng m hnh TF-IDF biu
din ti liu c nhiu t thng dng gia 2 ti liu th c nhiu kh nng chng
tng t nhau.
K thut phn cm phn cp v phn cm phn hoch (k-means) l 2 k
thut phn cm thng c s dng cho phn cm ti liu vi m hnh TF-IDF.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

30
1.5. Tng kt chng 1
Chng 1 trnh by nhng kin thc c bn v khai ph d liu v khm
ph tri thc trong CSDL, cc k thut p dng trong khai ph d liu, nhng
chc nng chnh, ng dng ca n trong x hi,...
Chng ny cng trnh by mt hng nghin cu v ng dng trong khai
ph d liu l phn cm d liu, gm tng quan v k thut phn cm, cc ng
dng ca phn cm, cc yu cu i vi k thut phn cm, cc kiu d liu v
o tng t,...
Mt hng tip cn mi trong khai ph d liu l khai ph d liu trong
mi trng Web. Phn ny trnh by khi nim v li ch ca khai ph Web, mt
s m hnh biu din v x l d liu vn bn p dng trong khai ph Web nh
m hnh Boolean, m hnh tn s (TF), m hnh tn s nghch o vn bn
(IDF), m hnh kt hp TF-IDF v cc o xc nh tng t vn bn.




Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

31
Chng 2. MT S K THUT PHN CM D LIU

Cc k thut p dng gii quyt vn PCDL u hng ti hai mc
tiu chung: Cht lng ca cc cm khm ph c v tc thc hin ca
thut ton. Tuy nhin, cc k thut PCDL c th c phn loi thnh mt s
loi c bn da trn cc phng php tip cn nh sau [10][19]:
2.1. Phn cm phn hoch
tng chnh ca k thut ny l phn mt tp d liu c n phn t cho
trc thnh k nhm d liu sao cho mi phn t d liu ch thuc v mt nhm
d liu v mi nhm d liu c ti thiu t nht mt phn t d liu. Cc thut
ton phn hoch c phc tp rt ln khi xc nh nghim ti u ton cc cho
vn PCDL, v n phi tm kim tt c cc cch phn hoch c th c.
Chnh v vy, trn thc t ngi ta thng i tm gii php ti u cc b cho vn
ny bng cch s dng mt hm tiu chun nh gi cht lng ca cc
cm cng nh hng dn cho qu trnh tm kim phn hoch d liu. Vi
chin lc ny, thng thng ngi ta bt u khi to mt phn hoch ban u
cho tp d liu theo php ngu nhin hoc theo heuristic v lin tc tinh chnh
n cho n khi thu c mt phn hoch mong mun, tho mn cc iu kin
rng buc cho trc. Cc thut ton phn cm phn hoch c gng ci tin tiu
chun phn cm bng cch tnh cc gi tr o tng t gia cc i tng d
liu v sp xp cc gi tr ny, sau thut ton la chn mt gi tr trong dy
sp xp sao cho hm tiu chun t gi tr ti thiu. Nh vy, tng chnh ca
thut ton phn cm phn hoch ti u cc b l s dng chin lc n tham
tm kim nghim.
Lp cc thut ton phn cm phn hoch bao gm cc thut ton xut
u tin trong lnh vc KPDL cng l cc thut ton c p dng nhiu trong
thc t nh k-means, PAM, CLARA, CLARANS. Sau y l mt s thut ton
kinh in c k tha s dng rng ri.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

32
2.1.1. Thut ton k-means
Thut ton phn cm k-means do MacQueen xut trong lnh vc thng
k nm 1967, mc ch ca thut ton k-means l sinh ra k cm d liu {C
1
,
C
2
,, C
k
} t mt tp d liu ban u gm n i tng trong khng gian d chiu
X
i
=(x
i1,
x
i2
, ,x
id
) ( n i , 1 = ), sao cho hm tiu chun:

=
e
=
k
i
x
i
Ci
m x E
D
1
2
) (
t
gi tr ti thiu. Trong : m
i
l trng tm ca cm C
i
,

D l khong cch gia hai
i tng.
Trng tm ca mt cm l mt vector, trong gi tr ca mi phn t ca
n l trung bnh cng cc thnh phn tng ng ca cc i tng vector d liu
trong cm ang xt. Tham s u vo ca thut ton l s cm k, tp CSDL gm
n phn t v tham s u ra ca thut ton l cc trng tm ca cc cm d liu.
o khong cch D gia cc i tng d liu thng c s dng dng l
khong cch Euclide, bi v y l m hnh khong cch d ly o hm v
xc nh cc cc tr ti thiu. Hm tiu chun v o khong cch c th c
xc nh c th hn tu vo ng dng hoc cc quan im ca ngi dng.
Thut ton k-means bao gm cc bc c bn nh sau:
INPUT: Mt CSDL gm n i tng v s cc cm k.
OUTPUT: Cc cm C
i
(i=1,..,k) sao cho hm tiu chun E t gi tr ti thiu.
Bc 1: Khi to
Chn k i tng m
j
(j=1...k) l trng tm ban u ca k cm t tp d liu
(vic la chn ny c th l ngu nhin hoc theo kinh nghim).
Bc 2: Tnh ton khong cch
i vi mi i tng X
i
(1 s i s n) , tnh ton khong cch t n ti mi
trng tm m
j
vi j=1,..,k, sau tm trng tm gn nht i vi mi i tng.
Bc 3: Cp nht li trng tm
i vi mi j=1,..,k, cp nht trng tm cm m
j
bng cch xc nh trung bnh
cng ca cc vector i tng d liu.
Bc 4: iu kin dng
Lp cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i.
Hnh 2.1. Thut ton k-means
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

33
Thut ton k-means c chng minh l hi t v c phc tp tnh ton
l: O(
T
flop
d k n t ) ( ). Trong : n l s i tng d liu, k l s cm d liu,
d l s chiu, t l s vng lp,
T
flop
l thi gian thc hin mt php tnh c
s nh php tnh nhn, chia, Nh vy, do k-means phn tch phn cm n
gin nn c th p dng i vi tp d liu ln. Tuy nhin, nhc im ca k-
means l ch p dng vi d liu c thuc tnh s v khm ph ra cc cm c
dng hnh cu, k-means cn rt nhy cm vi nhiu v cc phn t ngoi lai
trong d liu. Hnh sau din t mi phng v mt s hnh dng cm d liu
khm ph c bi k-means:

Hnh 2.2. Hnh dng cm d liu c khm ph bi k-means

Hn na, cht lng PCDL ca thut ton k-means ph thuc nhiu vo
cc tham s u vo nh: s cm k v k trng tm khi to ban u. Trong
trng hp, cc trng tm khi to ban u m qu lch so vi cc trng tm
cm t nhin th kt qu phn cm ca k-means l rt thp, ngha l cc cm d
liu c khm ph rt lch so vi cc cm trong thc t. Trn thc t ngi ta
cha c mt gii php ti u no chn cc tham s u vo, gii php thng
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

1
0
0 1 2 3 4 5 6 7 8 9 0

1
0
K=2
Chn k i tng trung
tm ty
Gn mi
i tng
vo cc
cm
Cp nht
li trng
tm
0
1
2
3
4
5
6
7
8
9
0 2 3 4 5 6 7 8 9 10
Cp nht
li
trng tm
Gn li cc i
tng
Gn li cc i tng

0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Gn li cc i
tng
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

34
c s dng nht l th nghim vi cc gi tr u vo k khc nhau ri sau
chn gii php tt nht.
n nay, c rt nhiu thut ton k tha t tng ca thut ton k-means
p dng trong KPDL gii quyt tp d liu c kch thc rt ln ang c
p dng rt hiu qu v ph bin nh thut ton k-medoid, PAM, CLARA,
CLARANS, k- prototypes,

2.1.2. Thut ton PAM
Thut ton PAM (Partitioning Around Medoids) c Kaufman v
Rousseeuw xut 1987, l thut ton m rng ca thut ton k-means, nhm
c kh nng x l hiu qu i vi d liu nhiu hoc cc phn t ngoi lai.
Thay v s dng cc trng tm nh k-means, PAM s dng cc i tng
medoid biu din cho cc cm d liu, mt i tng medoid l i tng
t ti v tr trung tm nht bn trong ca mi cm. V vy, cc i tng
medoid t b nh hng ca cc i tng rt xa trung tm, trong khi cc
trng tm ca thut ton k-means li rt b tc ng bi cc im xa trung tm
ny. Ban u, PAM khi to k i tng medoid v phn phi cc i tng cn
li vo cc cm vi cc i tng medoid i din tng ng sao cho chng
tng t vi i tng medoid trong cm nht.
xc nh cc medoid, PAM bt u bng cch la chn k i tng
medoid bt k. Sau mi bc thc hin, PAM c gng hon chuyn gia i
tng medoid O
m
v mt i tng O
p
khng phi l medoid, min l s hon
chuyn ny nhm ci tin cht lng ca phn cm, qu trnh ny kt thc khi
cht lng phn cm khng thay i. Cht lng phn cm c nh gi thng
qua hm tiu chun, cht lng phn cm tt nht khi hm tiu chun t gi tr
ti thiu.
quyt nh hon chuyn hai i tng O
m
v O
p
hay khng, thut ton
PAM s dng gi tr tng chi ph hon chuyn C
jmp
lm cn c:
- O
m
: L i tng medoid hin thi cn c thay th
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

35
- O
p
: L i tng medoid mi thay th cho O
m
;
- O
j
: L i tng d liu (khng phi l medoid) c th c di chuyn
sang cm khc.
- O
m,2
: L i tng medoid hin thi khc vi O
m
m gn i tng
O
j
nht.
Bn trng hp nh m t trong th d trn, PAM tnh gi tr hon i C
jmp

cho tt c cc i tng O
j
. C
jmp
y nhm lm cn c cho vic hon
chuyn gia O
m
v O
p
. Trong mi trng hp C
jmp
c tnh vi 4 cch khc
nhau nh sau:
- Trng hp 1: Gi s O
j
hin thi thuc v cm c i din l O
m
v O
j

tng t vi O
m,2
hn O
p
(d(O
j
, O
p
)
>
d(O
j
, O
m,2
)). Trong khi , O
m,2
l i
tng medoid tng t xp th 2 ti O
j
trong s cc medoid. Trong trng hp
ny, ta thay th O
m
bi i tng medoid mi O
p
v O
j
s thuc v cm c i
tng i din l O
m,2
. V vy, gi tr hon chuyn C
jmp
c xc nh nh sau:
C
jmp
= d(O
j
, O
m,2
) d(O
j
, O
m
). Gi tr C
jmp
l khng m.

Hnh 2.3. Trng hp C
jmp
=d(O
j
,O
m,2
) d(O
j
, O
m
) khng m
- Trng hp 2: O
j
hin thi thuc v cm c i din l O
m,
nhng O
j
t
tng t vi O
m,2
so vi O
p
(d(O
j
,O
p
)< d(O
j
,O
m,2
)). Nu thay th O
m
bi O
p
th O
j

s thuc v cm c i din l O
p
. V vy, gi tr C
jmp
c xc nh nh sau:
C
jmp
=(O
j
,O
p
)- d(O
j
, O
m
). C
jmp
y c th l m hoc dng.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
O
p
O
m
O
m,2
O
j
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

36

Hnh 2.4. Trng hp C
jmp
= (O
j
,O
p
)- d(O
j
, O
m
) c th m hoc dng
- Trng hp 3: Gi s O
j
hin thi khng thuc v cm c i tng i
din l O
m
m thuc v cm c i din l O
m,2
. Mt khc, gi s O
j
tng t vi
O
m,2
hn so vi O
p
, khi , nu O
m
c thay th bi O
p
th O
j
vn s li trong
cm c i din l O
m,2
. Do : C
jmp
= 0.

Hnh 2.5. Trng hp C
jmp
bng khng
- Trng hp 4: O
j
hin thi thuc v cm c i din l O
m,2
nhng O
j
t
tng t ti O
m,2
hn

so vi O
p.
V vy, nu ta thay th O
m
bi O
p
th O
j
s
chuyn t cm O
m,2
sang cm O
p
. Do , gi tr hon chuyn C
jmp
c xc nh
l: C
jmp
= (O
j
,O
p
)- d(O
j
, O
m,2
). C
jmp
y lun m.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
O
m2
O
m
O
p
O
j
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
O
j

O
m
O
p
O
m2
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

37

Hnh 2.6. Trng hp C
jmp
= (O
j
,O
p
)- d(O
j
, O
m,2
) lun m
- Kt hp c bn trng hp trn, tng gi tr hon chuyn O
m
bng O
p

c xc nh nh sau: TC
mp
=

j
jmp C
.
Thut ton PAM gm cc bc thc hin chnh nh sau:
INPUT: Tp d liu c n phn t, s cm k
OUTPUT: k cm d liu sao cho cht lng phn hoch l tt nht.
Bc 1: Chn k i tng medoid bt k;
Bc 2: Tnh TC
mp
cho tt c cc cp i tng O
m
, O
p.
Trong O
m
l i tng
medoid v O
p
l i tng khng phi l modoid.
Bc 3: Vi mi cp i tng O
m
v O
p
. Tnh min
Om,
min
Op
, TC
mp
.
Nu TC
mp
l m, thay th O
m
bi O
p
v quay li bc 2. Nu TC
mp
dng,
chuyn sang bc 4.
Bc 4: Vi mi i tng khng phi l medoid, xc nh i tng medoid
tng t vi n nht ng thi gn nhn cm cho chng.
Hnh 2.7. Thut ton PAM

Trong bc 2 v 3, c PAM phi duyt tt c k(n-k) cp O
m
, O
p
. Vi mi
cp, vic tnh ton TC
mp
yu cu kim tra n-k i tng. V vy, phc tp tnh
ton ca PAM l O(Ik(n-k)
2
), trong I l s vng lp. Nh vy, thut ton
PAM km hiu qu v thi gian tnh ton khi gi tr ca k v n l ln.
0
1
2
3
4
5
6
7
8
9
1
0
0 1 2 3 4 5 6 7 8 9 1
0
O
m2
O
m
O
p
O
j
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

38
2.1.3. Thut ton CLARA
CLARA (Clustering LARge Application) c Kaufman v Rousseeuw
xut nm 1990, thut ton ny nhm khc phc nhc im ca thut ton PAM
trong trng hp gi tr ca k v n ln. CLARA tin hnh trch mu cho tp d
liu c n phn t v p dng thut ton PAM cho mu ny v tm ra cc cc i
tng medoid ca mu ny. Ngi ta thy rng, nu mu d liu c trch mt
cch ngu nhin, th cc medoid ca n xp x vi cc medoid ca ton b tp
d liu ban u. tin ti mt xp x tt hn, CLARA a ra nhiu cch ly
mu ri thc hin phn cm cho mi trng hp ny v tin hnh chn kt qu
phn cm tt nht khi thc hin phn cm trn cc mu ny. cho chnh xc,
cht lng ca cc cm c nh gi thng phi tng t trung bnh ca ton
b cc i tng d liu trong tp i tng ban u. Kt qu thc nghim ch
ra rng, 5 mu d liu c kch thc 40+2k cho cc kt qu tt. Cc bc thc
hin ca thut ton CLARA nh sau:
INPUT: CSDL gm n i tng, s cm k.
OUTPUT: k cm d liu
1. For i = 1 to 5 do
Begin
2. Ly mt mu c 40 + 2k i tng d liu ngu nhin t tp d liu v p
dng thut ton PAM cho mu d liu ny nhm tm cc i tng medoid i
din cho cc cm.
3. i vi mi i tng O
j
trong tp d liu ban u, xc nh i tng medoid
tng t nht trong s k i tng medoid.
4. Tnh phi tng t trung bnh cho phn hoch cc i tng dnh bc
trc, nu gi tr ny b hn gi tr ti thiu hin thi th s dng gi tr ny thay cho
gi tr ti thiu trng thi trc, nh vy tp k i tng medoid xc nh bc
ny l tt nht cho n thi im hin ti.
End;
Hnh 2.8. Thut ton CLARA
phc tp tnh ton ca thut ton l O(k(40+k)
2
+ k(n-k)), v CLARA c
th thc hin i vi tp d liu ln. Ch i vi k thut to mu trong
PCDL: kt qu phn cm c th khng ph thuc vo tp d liu khi to nhng
n ch t ti u cc b.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

39
2.1.4. Thut ton CLARANS
Thut ton CLARANS (A Clustering Algorithm based on RANdomized
Search) c Ng & Han xut nm 1994, nhm ci tin cht lng cng
nh m rng p dng cho tp d liu ln. CLARANS l thut ton PCDL kt
hp thut ton PAM vi chin lc tm kim kinh nghim mi. tng c bn
ca CLARANS l khng xem xt tt c cc kh nng c th thay th cc i
tng tm medoids bi mt i tng khc, n ngay lp tc thay th cc i
tng medoid ny nu vic thay th c tc ng tt n cht lng phn cm
ch khng cn xc nh cch thay th ti u nht. Mt phn hoch cm pht
hin c sau khi thay th i tng trung tm c gi l mt lng ging ca
phn hoch cm trc . S cc lng ging c hn ch bi tham s do ngi
dng a vo l Maxneighbor, qu trnh la chn cc lng ging ny l hon
ton ngu nhin. Tham s Numlocal cho php ngi dng xc nh s vng lp
ti u cc b c tm kim. Khng phi tt cc lng ging c duyt m ch
c Maxneighbor s lng ging c duyt.
Mt s khi nim s dng trong thut ton CLARANS c nh ngha
nh sau:
Gi s O l mt tp c n i tng v M _ O l tp cc i tng medoid,
NM= OM l tp cc i tng khng phi medoid. Cc i tng d liu s
dng trong thut ton CLARANS l cc khi a din. Mi i tng c din
t bng mt tp cc cch, mi cnh c xc nh bng 2 nh. Gi s P _R
3

l mt tp tt c cc im. Ni chung, cc i tng y l cc i tng d
liu khng gian v ta nh ngha tm ca mt i tng chnh l trung bnh cng
ton hc ca tt c cc nh hay cn gi l trng tm:
Center: O P
Gi s dist l mt hm khong cch, khong cch thng c chn y
l khong cch Euclid: dist: P x PR
0
+

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

40
Hm khong cch dist c th m rng cho cc im ca khi a din thng
qua hm tm: dist: O x OR
0
+
sao cho dist (o
i
, o
j
) = dist (center(o
i
),
center(o
j
))
Mi i tng c gn cho mt tm medoid ca cm nu khong cch t
trng tm ca i tng ti tm medoid ca n l nh nht. V vy, ta nh
ngha mt tm medoid nh sau:
medoid: OM
Sao cho medoid (o) = m
i
, m
i
eM, m
j
eM: dist (o, m
i
) sdist (o, m
j
),
oeO.
Cui cng, ta nh ngha mt cm vi medoid m
i
tng ng l mt tp con
cc i tng trong O vi medoid(o) = m
i
.
Gi s C
0
l tp tt c cc phn hoch ca O. Hm tng nh gi cht
lng mt phn hoch c nh ngha nh sau: total_distance: C
0
R
0
+
sao
cho total_distance(c) = E E dist (o, m
i
) vi m
i
eM, o e cluster(m
i
).
Thut ton CLARANS c th c din t nh sau [10][19]:
INPUT: Tp d liu gm n i tng, s cm k, O, dist, numlocal, maxneighbor;
OUTPUT: k cm d liu;
For i=1 to numlocal do
Begin
Khi to ngu nhin k medois
j = 1;
while j < maxneighbor do
Begin
Chn ngu nhin mt lng ging R ca S.
Tnh ton phi tng t v khong cch gia 2 lng ging S v R.
Nu R c chi ph thp hn th hon i R cho S v j=1
ngc li j++;
End;
Kim tra khong cch ca phn hoch S c nh hn khong cch nh nht
khng, nu nh hn th ly gi tr ny cp nht li khong cch nh nht
v phn hoch S l phn hoch tt nht ti thi im hin ti.
End.
Hnh 2.9. Thut ton CLARANS
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

41
Nh vy, qu trnh hot ng ca CLARANS tng t vi qu trnh hot
ng ca thut ton CLARA. Tuy nhin, giai on la chn cc trung tm
medoid cm d liu, CLARANS la chn mt gii php tt hn bng cch ly
ngu nhin mt i tng ca k i tng trung tm medoid ca cm v c
gng thay th n vi mt i tng c chn ngu nhin trong (n-k) i tng
cn li, nu khng c gii php no tt hn sau mt s c gng la chn ngu
nhin xc nh, thut ton dng v cho kt qu phn cm ti u cc b.
Trong trng hp xu nht, CLARANS so snh mt i tng vi tt cc
i tng Medoid. V vy, phc tp tnh ton ca CLARANS l O(kn
2
), do
vy CLARANS khng thch hp vi tp d liu ln. CLARANS c u im l
khng gian tm kim khng b gii hn nh i vi CLARA v trong cng mt
lng thi gian th cht lng ca cc cm phn c l ln hn so vi CLARA.
2.2. Phn cm phn cp
Phn cm phn cp sp xp mt tp d liu cho thnh mt cu trc c
dng hnh cy, cy phn cp ny c xy dng theo k thut quy. Cy phn
cm c th c xy dng theo hai phng php tng qut: phng php trn
xung (Top down) v phng php di ln (Bottom up).
Phng php Bottom up: Phng php ny bt u vi mi i tng
c khi to tng ng vi cc cm ring bit, sau tin hnh nhm cc i
tng theo mt o tng t (nh khong cch gia hai trung tm ca hai
nhm), qu trnh ny c thc hin cho n khi tt c cc nhm c ha nhp
vo mt nhm (mc cao nht ca cy phn cp) hoc cho n khi cc iu kin
kt thc tho mn. Nh vy, cch tip cn ny s dng chin lc n tham trong
qu trnh phn cm.
Phng php Top Down: Bt u vi trng thi l tt c cc i tng
c xp trong cng mt cm. Mi vng lp thnh cng, mt cm c tch
thnh cc cm nh hn theo gi tr ca mt php o tng t no cho n
khi mi i tng l mt cm hoc cho n khi iu kin dng tho mn. Cch
tip cn ny s dng chin lc chia tr trong qu trnh phn cm.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

42
Sau y l minh ha chin lc phn cm phn cp bottom up v Top down.

Hnh 2.10. Cc chin lc phn cm phn cp
Trong thc t p dng, c nhiu trng hp ngi ta kt hp c hai phng
php phn cm phn hoch v phng phn cm phn cp, ngha l kt qu thu
c ca phng php phn cp c th ci tin thng quan bc phn cm phn
hoch. Phn cm phn hoch v phn cm phn cp l hai phng php PCDL
c in, hin nay c nhiu thut ton ci tin da trn hai phng php ny
c p dng ph bin trong KPDL. Mt s thut ton phn cm phn cp
in hnh nh CURE, BIRCH, Chemeleon, AGNES, DIANA,...
2.2.1. Thut ton BIRCH
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
do Tian Zhang, amakrishnan v Livny xut nm 1996, l thut ton phn cm
phn cp s dng chin lc Top down. tng ca thut ton l khng cn
lu ton b cc i tng d liu ca cc cm trong b nh m ch lu cc i
lng thng k. i vi mi cm d liu, BIRCH ch lu mt b ba (n, LS, SS),
vi n l s i tng trong cm, LS l tng cc gi tr thuc tnh ca cc i
tng trong cm v SS l tng bnh phng cc gi tr thuc tnh ca cc i
tng trong cm. Cc b ba ny c gi l cc c trng ca cm CF=(n, LS,
SS) (Cluster Features - CF) v c lu gi trong mt cy c gi l cy CF.
Hnh sau y biu th mt v d v cy CF. Chng ta thy rng, tt c cc nt
Bc 0
Bc 1 Bc 2

Bc 3

Bc 4

b
d
c
e
a
a b
d e
c d e
a b c d e
Bc 4 Bc 3 Bc 2 Bc 1 Bc 0
Bottom up
Top Down
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

43
trong ca cy lu tng cc c trng cm CF ca nt con, trong khi cc nt l
lu tr cc c trng ca cc cm d liu.

Hnh 2.11. Cy CF c s dng bi thut ton BIRCH
Cy CF l cy cn bng, nhm lu tr cc c trng ca cm. Cy CF
cha cc nt trong v nt l. Nt trong lu gi tng cc c trng cm ca cc
nt con ca n. Mt cy CF c c trng bi hai tham s:
- Yu t nhnh (B): Nhm xc nh s ti a cc nt con ca mi nt trong
ca cy;
- Ngng (T): Khong cch ti a gia bt k mt cp i tng trong nt
l ca cy, khong cch ny cn gi l ng knh ca cc cm con c lu ti
cc nt l.
Hai tham s ny c nh hng ln n kch thc ca cy CF.
Thut ton BIRCH thc hin qua giai on sau:
INPUT: CSDL gm n i tng, ngng T
OUTPUT: k cm d liu
Bc 1: Duyt tt c cc i tng trong CSDL v xy dng mt cy CF khi to.
Mt i tng c chn vo nt l gn nht to thnh cm con. Nu ng knh ca
cm con ny ln hn T th nt l c tch. Khi mt i tng thch hp c chn
CF
1
child
1
CF
3
child
3
CF
2
child
2
CF
6
child
6
CF
1
child
1
CF
3
child
3
CF
2
child
2
CF
5
child
5
CF
1
CF
2
CF
6
prev next
CF
1
CF
2
CF
4
prev
next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

44
vo nt l, tt c cc nt tr ti gc ca cy c cp nht vi cc thng tin cn thit.
Bc 2: Nu cy CF hin thi khng c b nh trong th tin hnh xy dng mt
cy CF nh hn bng cch iu khin bi tham s T (v tng T s lm ho nhp mt
s cc cm con thnh mt cm, iu ny lm cho cy CF nh hn). Bc ny khng
cn yu cu bt u c d liu li t u nhng vn m bo hiu chnh cy d liu
nh hn.
Bc 3: Thc hin phn cm: Cc nt l ca cy CF lu gi cc i lng thng k
ca cc cm con. Trong bc ny, BIRCH s dng cc i lng thng k ny p
dng mt s k thut phn cm th d nh k-means v to ra mt khi to cho phn
cm.
Bc 4: Phn phi li cc i tng d liu bng cch dng cc i tng trng
tm cho cc cm c khm ph t bc 3: y l mt bc tu chn duyt li
tp d liu v gn nhn li cho cc i tng d liu ti cc trng tm gn nht. Bc
ny nhm gn nhn cho cc d liu khi to v loi b cc i tng ngoi lai
Hnh 2.12. Thut ton BIRCH
Khi ha nhp 2 cm ta c CF=CF1+CF2= (n
1
+n
2
, LS
1
+LS
2
, SS
1
+SS
2
).
Khong cch gia cc cm c th o bng khong cch Euclid,
Manhatta,....
V d: ) , LS (n, CF SS

= , n l s i tng d liu

Hnh 2.13. V d v kt qu phn cm bng thut ton BIRCH
S dng cu trc cy CF lm cho thut ton BIRCH c tc thc hin
PCDL nhanh v c th p dng i vi tp d liu ln, BIRCH c bit hiu qu
khi p dng vi tp d liu tng trng theo thi gian. BIRCH ch duyt ton b d
liu mt ln vi mt ln qut thm tu chn, ngha l phc tp ca n l O(n) (n
l s i tng d liu). Nhc im ca n l cht lng ca cc cm c khm
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

45
ph khng c tt. Nu BIRCH s dng khong cch Euclide, n thc hin tt ch
vi cc d liu s. Mt khc, tham s vo T c nh hng rt ln ti kch thc v
tnh t nhin ca cm. Vic p cc i tng d liu lm cho cc i tng ca
mt cm c th l i tng kt thc ca cm khc, trong khi cc i tng gn
nhau c th b ht bi cc cm khc nu chng c biu din cho thut ton theo
mt th t khc. BIRCH khng thch hp vi d liu a chiu.
2.2.2. Thut ton CURE
Vic chn mt cch biu din cho cc cm c th nng cao cht lng phn
cm. Thut ton CURE (Clustering Using REpresentatives) c xut bi
Sudipto Guha, Rajeev Rastogi v Kyuseok Shim nm 1998 [19] l thut ton s
dng chin lc Bottom up ca k thut phn cm phn cp.
Thay v s dng cc trng tm hoc cc i tng tm biu din cm,
CURE s dng nhiu i tng din t cho mi cm d liu. Cc i tng
i din cho cm ny ban u c la chn ri rc u cc v tr khc nhau,
sau chng c di chuyn bng cch co li theo mt t l nht nh. Ti mi
bc ca thut ton, hai cm c cp i tng i din gn nht s c trn li
thnh mt cm.
Vi cch thc s dng nhiu hn mt im i din cho cc cm, CURE
c th khm ph c cc cm c cc dng hnh th v kch thc khc nhau
trong CSDL ln. Vic co cc i tng i din li c tc dng lm gim tc
ng ca cc phn t ngoi lai. V vy, CURE c kh nng x l i vi cc
phn t ngoi lai. Hnh sau th d v cc dng v kch thc cm d liu c
khm ph bi CURE:

Hnh 2.14. Cc cm d liu c khm ph bi CURE
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

46
p dng vi CSDL ln, CURE s dng ly mu ngu nhin v phn
hoch. Mu d liu c xc nh ngu nhin l phn hoch u tin, CURE
tin hnh phn cm trn mi phn hoch. Qu trnh ny lp li cho n khi ta
thu c phn hoch tt. Cc cm thu c sau li c phn cm nhm
thu c cc cm con cn quan tm. Thut ton CURE c thc hin qua
cc bc c bn nh sau:
Bc 1. Chn mt mu ngu nhin t tp d liu ban u;
Bc 2. Phn hoch mu ny thnh nhiu nhm d liu c kch thc bng nhau:
tng chnh y l phn hoch mu thnh p nhm d liu bng nhau, kch thc
ca mi phn hoch l n'/p (vi n' l kch thc ca mu);
Bc 3. Phn cm cc im ca mi nhm: Ta thc hin PCDL cho cc nhm cho
n khi mi nhm c phn thnh n'/(pq)cm (vi q>1);
Bc 4. Loi b cc phn t ngoi lai: Trc ht, khi cc cm c hnh thnh cho
n khi s cc cm gim xung mt phn so vi s cc cm ban u. Sau , trong
trng hp cc phn t ngoi lai c ly mu cng vi qu trnh pha khi to mu
d liu, thut ton s t ng loi b cc nhm nh.
Bc 5. Phn cm cc cm khng gian: Cc i tng i din cho cc cm di
chuyn v hng trung tm cm, ngha l chng c thay th bi cc i tng gn
trung tm hn.
Bc 6. nh du d liu vi cc nhn tng ng.
Hnh 2.15. Thut ton CURE

phc tp tnh ton ca thut ton CURE l O(n
2
log(n)). CURE l thut
ton tin cy trong vic khm ph cc cm vi hnh th bt k v c th p dng
tt trn cc tp d liu hai chiu. Tuy nhin, n li rt nhy cm vi cc tham s
nh l tham s cc i tng i din, tham s co ca cc phn t i din. Nhn
chung th BIRCH tt hn so vi CURE v phc tp, nhng km v cht
lng phn cm. C hai thut ton ny c th x l cc phn t ngoi lai tt.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

47
2.3. Phn cm da trn mt
Phng php ny nhm cc i tng theo hm mt xc nh. Mt
c nh ngha nh l s cc i tng ln cn ca mt i tng d liu theo
mt ngng no . Trong cch tip cn ny, khi mt cm d liu xc nh
th n tip tc c pht trin thm cc i tng d liu mi min l s cc i
tng ln cn ca cc i tng ny phi ln hn mt ngng c xc nh
trc. Phng php phn cm da vo mt ca cc i tng xc nh cc
cm d liu v c th pht hin ra cc cm d liu vi hnh th bt k. Tuy vy,
vic xc nh cc tham s mt ca thut ton rt kh khn, trong khi cc tham
s ny li c tc ng rt ln n kt qu PCDL. Hnh minh ho v cc cm d
liu vi cc hnh th khc nhau da trn mt c khm ph t 3 CSDL
khc nhau:



Hnh 2.16. Mt s hnh dng khm ph bi phn cm da trn mt

Cc cm c th c xem nh cc vng c mt cao, c tch ra bi
cc vng khng c hoc t mt . Khi nim mt y c xem nh l cc
s cc i tng lng ging.
Mt s thut ton PCDL da trn mt in hnh nh [2][3][13][20]:
DBSCAN, OPTICS, DENCLUE, SNN,.
2.3.1 Thut ton DBSCAN
Thut ton phn cm da trn mt thng dng nht l thut ton
DBSCAN (Density - Based Spatial Clustering of Applications with noise) do
Ester, P. Kriegel v J. Sander xut nm 1996. Thut ton i tm cc i tng
CSDL 1 CSDL 2 CSDL 3
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

48
m c s i tng lng ging ln hn mt ngng ti thiu. Mt cm c xc
nh bng tp tt c cc i tng lin thng mt vi cc lng ging ca n.
Thut ton DBSCAN da trn cc khi nim mt c th p dng cho cc
tp d liu khng gian ln a chiu. Sau y l mt s nh ngha v b c
s dng trong thut ton DBSCAN.
nh ngha 1: Cc ln cn ca mt im P vi ngng Eps, k hiu
N
Eps
(p) c xc nh nh sau: N
Eps
(p) = {q e D | khong cch Dist(p,q) Eps},
D l tp d liu cho trc.

Hnh 2.17. Ln cn ca P vi ngng Eps
Mt im p mun nm trong mt cm C no th N
Eps
(p) phi c ti thiu
MinPts im.
Theo nh ngha trn, ch nhng im thc s nm trong cm mi tho
mn iu kin l im thuc vo cm. Nhng im nm bin ca cm th
khng tho mn iu kin , bi v thng thng th ln cn vi ngng Eps
ca im bin th b hn ln cn vi ngng cng Eps ca im nhn .
trnh c iu ny, ta c th a ra mt tiu chun khc nh ngha
mt im thuc vo mt cm nh sau: Nu mt im p mun thuc mt cm C
phi tn ti mt im q m p e N
Eps
(q) v s im trong N
Eps
(q) phi ln hn s
im ti thiu. iu ny c th c nh ngha mt cch hnh thc nh sau:
nh ngha 2: Mt - n c trc tip (Directly Density - reachable)
Mt im p c gi l mt -n c trc tip t im q vi ngng
Eps v MinPts trong tp i tng D nu:
1) p eN
Eps
(q) Vi N
Eps
(q) l tp con ca D
2) ||N
Eps
(q)|| MinPts iu kin i tng nhn.
p
Eps
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

49
im q gi l im nhn. Ta thy rng n l mt hm phn x v i xng
i vi hai im nhn v bt i xng nu mt trong hai im khng phi l
im nhn.

Hnh 2.18. Mt - n c trc tip
nh ngha 3: Mt - n c (Density - Reachable)
Mt im p c gi l mt - n c t mt im q vi hai tham s
Eps v MinPts nu tn ti mt dy p = p
1
, p
2
,, p
n
=q sao cho p
i+1

l mt -
n c trc tip t p
i
vi i=1,..n-1.

Hnh 2.19. Mt n c
Hai im bin ca mt cm C c th khng n c nhau bi v c hai c
th u khng tho mn iu kin nhn. Mc d vy, phi tn ti mt im nhn
trong C m c hai im u c th n c t im .
nh ngha 4: Mt - lin thng (Density - Connected)
i tng p l mt - lin thng vi im q theo hai tham s Eps vi
MinPts nu nh c mt i tng o m c hai i tng p, q iu l mt -
n c o theo tham s Eps v MinPts.

Hnh 2.20. Mt lin thng
p
q
MinPts = 5
Eps = 1 cm
p
q
p
1
p
q
o
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

50
nh ngha 5: Cm v nhiu
Cho D l mt tp cc i tng d liu. Mt tp con C khc rng ca D
c gi l mt cm theo Eps v MinPts nu tho mn hai iu kin:
1) Cc i: Vi p,qeD, nu peC v q l mt - n c p theo Eps v
MinPts th qeC.
2) Vi p,q e C, p l mt -lin thng vi q theo Eps v MinPts.
Mi i tng khng thuc cm no c th gi l nhiu.

Hnh 2.21. Cm v nhiu
Vi hai tham s Eps v MinPts cho trc, ta c th khm ph cc cm theo
hai bc:
Bc 1: Chn mt im bt k t tp d liu ban u tho mn iu
kin nhn.
Bc 2: Ly tt c cc im n c mt vi im nhn chn trn
to thnh cm.
Hai b ny c th pht biu mt cch hnh thc hn nh sau:
B 1: Gi s p l mt i tng trong D, trong ||N
Eps
(p)|| MinPts,
tp O = {o|oeD v o l mt -n c t p theo Eps v MinPts} l mt cm
theo Eps v MinPts.
Nh vy, cm C khng hon ton l duy nht, tuy nhin, mi mt im
trong C n c mt t bt c mt im nhn no ca C, v vy C cha
ng mt s im lin thng vi im nhn tu .
B 2: Gi s C l mt cm theo Eps v MinPts, p l mt im bt k
trong C vi ||N
Eps
(p)|| MinPts. Khi C trng vi tp O = {o|oeD v o l mt
-n c t p theo Eps v MinPts}.
Nhn
Bin
Nhiu
Eps = 1cm
MinPts = 5
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

51
Cc bc ca thut ton DBSCAN nh sau:
Bc 1: Chn mt i tng p tu
Bc 2: Ly tt c cc i tng mt - n c t p vi Eps v MinPts.
Bc 3: Nu p l im nhn th to ra mt cm theo Eps v MinPts.
Bc 4: Nu p l mt im bin, khng c im no l mt - n c mt t
p v DBSCAN s i thm im tip theo ca tp d liu.
Bc 5: Qu trnh tip tc cho n khi tt c cc i tng c x l.
Hnh 2.22. Thut ton DBSCAN
Nu ta chn s dng gi tr tr ton cc Eps v MinPts, DBSCAN c th
ho nhp hai cm thnh mt cm nu mt ca hai cm gn bng nhau. Gi s
khong cch gia hai tp d liu S
1
v S
2
c nh ngha l Dist(S
1
,S
2
) =
min{dist(p,q)| pe S
1
v qe S
2
}.
Thut ton DBSCAN c th tm ra cc cm vi hnh th bt k, trong khi
ti cng mt thi im t b nh hng bi th t ca cc i tng d liu
nhp vo. Khi c mt i tng c chn vo ch tc ng n mt lng ging
xc nh. Mt khc, DBSCAN yu cu ngi dng xc nh bn knh Eps ca
cc lng ging v s cc lng ging ti thiu MinPts, thng thng cc tham s
ny c xc nh bng php chn ngu nhin hoc theo kinh nghim.
Tr mt s trng hp ngoi l, kt qu ca DBSCAN c lp vi th t
duyt cc i tng d liu. Eps v MinPts l hai tham s ton cc c xc
nh bng th cng hoc theo kinh nghim. Tham s Eps c a vo l nh so
vi kch thc ca khng gian d liu, th phc tp tnh ton trung bnh ca
mi truy vn l O(nlogn).
2.3.2. Thut ton OPTICS
Thut ton OPTICS (Ordering Points To Identify the Clustering Structure)
do Ankerst, Breunig, Kriegel v Sander xut nm 1999, l thut ton m rng
cho thut ton DBSCAN, bng cch gim bt cc tham s u vo.
Thut ton thc hin tnh ton v sp xp cc i tng theo th t tng
dn nhm t ng phn cm v phn tch cm tng tc hn l a ra phn cm
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

52
mt tp d liu r rng. Th t ny din t cu trc d liu phn cm da trn
mt cha thng tin tng ng vi phn cm da trn mt vi mt dy
cc tham s u vo. OPTICS xem xt bn knh ti thiu nhm xc nh cc
lng ging ph hp vi thut ton.
Thut ton DBSCAN v OPTICS tng t vi nhau v cu trc v c cng
phc tp: O(nLogn) (n l kch thc ca tp d liu). Hnh sau th hin v
mt th d trong PCDL ca thut ton OPTICS:

Hnh 2.23. Th t phn cm cc i tng theo OPTICS
2.3.3. Thut ton DENCLUE
Thut ton DENCLUE (DENsity - Based CLUstEring) c xut bi
Hinneburg v Keim nm 1998, y l thut ton PCDL da trn mt tp cc
hm phn phi mt . tng chnh ca thut ton ny nh sau [19]:
- nh hng ca mt i tng ti lng ging ca n c xc nh bi
hm nh hng.
- Mt ton cc ca khng gian d liu c m hnh phn tch nh l
tng tt c cc hm nh hng ca cc i tng.
- Cc cm c xc nh bi cc i tng mt cao trong mt cao
l cc im cc i ca hm mt ton cc.

c
Eps

Khong cch n c
Th t tp d liu
Cha xc nh
Eps'
Eps=10, MinPts=10
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

53
nh ngha hm nh hng: Cho x, y l hai i tng trong khng gian d
chiu k hiu l F
d
, hm nh hng ca y ln x c xc nh:
+

0
: R F
d
y
B
f
, m
c nh ngha di dng mt hm nh hng c bn ) , ( ) ( : y x x
b
y
B
b
f
f
f = .
Hm nh hng l hm tu chn, min l n c xc nh bi khong cch
d(x,y) ca cc i tng, th d nh khong cch Euclide.
Mt s th d v hm nh hng c cho nh sau:
Hm sng ngang: ) , ( y x
f
square
=

s
>
o
o
) , ( 1
) , ( 0
y x d if
y x d if
, trong
o
l mt ngng.
Hm Gaussian:
e
f
y x d
y x
Gauss
o
2
2
2
) , (
) , (

=

Hm mt ca mt i tng xeF
d
c tnh bng tng tt c cc hm
nh hng tc ng ln x. Gi s ta c mt tp d liu D={x
1
, x
2
, ...,x
n
}.
Hm mt ca x c xc nh:

=
=
n
i
B
D
B
x
x
x
f f
i
1
) ( ) ( ;
Hm mt da trn hm nh hng Gauss c xc nh nh sau:

=
n
i
D
Gauss
e
f
x
i
x d
d
1
2
2
2
) , (
) (
o
.
Th d v kt qu PCDL ca thut ton DENCLUE vi hm chi phi
Gaussian c biu din nh sau. Cc cc i mt l cc gi tr ti nh ca
th. Mt cm cho mt cc i mt x
*
l tp con C, khi cc hm mt ti
x
*
khng b hn :

Hnh 2.24. DENCLUE vi hm phn phi Gaussian

Thut ton DENCLUE ph thuc nhiu vo ngng nhiu

(Noise
Threshold) v tham s mt
o
, nhng n c cc u im sau:
Data set
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

54
- C c s ton hc vng chc
- C kh nng x l cc phn t ngoi lai tt.
- Cho php khm ph ra cc cm vi hnh th bt k ngay c i vi cc d
liu a chiu.
phc tp tnh ton ca thut ton DENCLUE l O(nlogn). Cc thut
ton da trn mt khng thc hin k thut phn mu trn tp d liu nh
trong cc thut ton phn cm phn hoch, v iu ny c th lm tng thm
phc tp do c s khc nhau gia mt ca cc i tng trong mu vi mt
ca ton b d liu.
2.4. Phn cm da trn li
K thut phn cm da trn mt khng thch hp vi d liu nhiu
chiu, gii quyt cho i hi ny, ngi ta s dng phng php phn cm
da trn li. y l phng php da trn cu trc d liu li PCDL,
phng php ny ch yu tp trung p dng cho lp d liu khng gian. Th d
nh d liu c biu din di dng cu trc hnh hc ca i tng trong
khng gian cng vi cc quan h, cc thuc tnh, cc hot ng ca chng. Mc
tiu ca phng php ny l lng ho tp d liu thnh cc (Cell), cc ny
to thnh cu trc d liu li, sau cc thao tc PCDL lm vic vi cc i
tng trong tng ny. Cch tip cn da trn li ny khng di chuyn cc i
tng trong cc m xy dng nhiu mc phn cp ca nhm cc i tng
trong mt . Trong ng cnh ny, phng php ny gn ging vi phng php
phn cm phn cp nhng ch c iu chng khng trn cc . Do vy cc cm
khng da trn o khong cch (hay cn gi l o tng t i vi cc d
liu khng gian) m n c quyt nh bi mt tham s xc nh trc. u
im ca phng php PCDL da trn li l thi gian x l nhanh v c lp
vi s i tng d liu trong tp d liu ban u, thay vo l chng ph
thuc vo s trong mi chiu ca khng gian li. Mt th d v cu trc d
liu li cha cc trong khng gian nh hnh sau:
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

55

Hnh 2.25. M hnh cu trc d liu li

Mt s thut ton PCDL da trn cu trc li in hnh nh [13][20]:
STING, WaveCluster, CLIQUE,

2.4.1 Thut ton STING
STING (STatistical INformation Grid) do Wang, Yang v Muntz xut
nm 1997, n phn r tp d liu khng gian thnh s hu hn cc cell s dng
cu trc phn cp ch nht. C nhiu mc khc nhau cho cc cell trong cu trc
li, cc cell ny hnh thnh nn cu trc phn cp nh sau: Mi cell mc cao
c phn hoch thnh cc cell mc thp hn trong cu trc phn cp.
Gi tr ca cc tham s thng k (nh cc gi tr trung bnh, ti thiu, ti
a) cho cc thuc tnh ca i tng d liu c tnh ton v lu tr thng qua
cc tham s thng k cc cell mc thp hn (iu ny ging vi cy CF).
Cc tham s ny bao gm: tham s m count, tham s trung bnh means,
tham s ti a max tham s ti thiu min, lch chun s, .
Cc i tng d liu ln lt c chn vo li v cc tham s thng k
trn c tnh trc tip thng qua cc i tng d liu ny. Cc truy vn
khng gian c thc hin bng cch xt cc cell thch hp ti mi mc ca
phn cp. Mt truy vn khng gian c xc nh nh l mt thng tin khi
phc li ca d liu khng gian v cc quan h ca chng. STING c kh nng
Mc 1 (mc cao nht )
c th ch cha mt
mc i-1 c th tng
ng vi 4 ca mc i
Tng 1
.
.
.
.
.
.

Tng i-1



Tng i
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

56
m rng cao, nhng do s dng phng php a phn gii nn n ph thuc
cht ch vo trng tm ca mc thp nht. a phn gii l kh nng phn r tp
d liu thnh cc mc chi tit khc nhau. Khi ho nhp cc cell ca cu trc li
hnh thnh cc cm, cc nt ca mc con khng c ho nhp ph hp (do
chng ch tng ng vi cc cha ca n) v hnh th ca cc cm d liu khm
ph c c cc bin ngang v dc, theo bin ca cc cell. STING s dng cu
trc d liu li cho php kh nng x l song song, STING duyt ton b d
liu mt ln nn phc tp tnh ton tnh ton cc i lng thng k cho
mi cell l O(n), trong n l tng s i tng. Sau khi xy dng cu trc d
liu phn cp, thi gian x l cho cc truy vn l O(g) vi g l tng s cell ti
mc thp nht (g<<n).

2.4.2 Thut ton CLIQUE
Thut ton CLIQUE do Agrawal, Gehrke, Gunopulos, Raghavan xut
nm 1998, l thut ton t ng phn cm khng gian con vi s chiu ln, n
cho php phn cm tt hn khng gian nguyn thy. Cc bc chnh ca thut
ton nh sau:
Bc 1: Phn hoch tp d liu thnh cc hnh hp ch nht v tm cc hnh hp
ch nht c (ngha l cc hnh hp ny cha mt s cc i tng d liu trong s cc
i tng lng ging cho trc).
Bc 2: Xc nh khng gian con cha cc cm c s dng nguyn l Apriori.
Bc 3: Hp cc hnh hp ny to thnh cc cm d liu.
Bc 4: Xc nh cc cm: Trc ht n tm cc cell c n chiu, tip n chng
tm cc hnh ch nht 2 chiu, ri 3 chiu,, cho n khi hnh hp ch nht c k
chiu c tm thy.
Hnh 2.26. Thut ton CLIQUE
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

57

Hnh 2.27. Qu trnh nhn dng cc ca CLIQUE
CLIQUE c kh nng p dng tt i vi d liu a chiu, nhng n li rt
nhy cm vi th t ca d liu vo, phc tp tnh ton ca CLIQUE l O(n).

2.5. Phn cm d liu da trn m hnh
Phng php ny c gng khm ph cc php xp x tt ca cc tham s
m hnh sao cho khp vi d liu mt cch tt nht. Chng c th s dng chin
lc phn cm phn hoch hoc chin lc phn cm phn cp, da trn cu
trc hoc m hnh m chng gi nh v tp d liu v cch m chng tinh chnh
cc m hnh ny nhn dng ra cc phn hoch.
Phng php PCDL da trn m hnh c gng khp gia d liu vi m
hnh ton hc, n da trn gi nh rng d liu c to ra bng hn hp phn
phi xc sut c bn. Cc thut ton phn cm da trn m hnh c hai tip cn
chnh: M hnh thng k v Mng Nron. Mt s thut ton in hnh nh EM,
COBWEB,...
S
a
l
a
r
y

(
1
0
,
0
0
0
)

20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
V
a
c
a
t
i
o
n
(
w
e
e
k
)

t = 3
age
V
a
c
a
t
i
o
n

Salary 30 50
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

58
2.5.1. Thut ton EM
Thut ton EM (Expectation - Maximization) c nghin cu t 1958 bi
Hartley v c nghin cu y bi Dempster, Laird v Rubin cng b nm
1977. Thut ton ny nhm tm ra s c lng v kh nng ln nht ca cc
tham s trong m hnh xc sut (cc m hnh ph thuc vo cc bin tim n
cha c quan st), n c xem nh l thut ton da trn m hnh hoc l
m rng ca thut ton k-means. EM gn cc i tng cho cc cm cho theo
xc sut phn phi thnh phn ca i tng . Phn phi xc sut thng c
s dng l phn phi xc sut Gaussian vi mc ch l khm ph lp cc gi tr
tt cho cc tham s ca n bng hm tiu chun l hm logarit kh nng ca i
tng d liu, y l hm tt m hnh xc sut cho cc i tng d liu.
Thut ton gm 2 bc x l: nh gi d liu cha c gn nhn
(bc E) v nh gi cc tham s ca m hnh, kh nng ln nht c th xy ra
(bc M).
C th thut ton EM bc lp th t thc hin cc cng vic sau:
1) Bc E: Tnh ton xc nh gi tr ca cc bin ch th da trn m
hnh hin ti v d liu:

= = = =
k
g
x
t
j i j
ij ij
t
ij
g fg
x f
x z x z E z
i
t

t
) (
) (
) (
1
) (
) | 1 ( Pr ) | (

2) Bc M: nh gi xc sut t

=
+
=
n
i
t
ij
t
j
n z
1
) ( ) 1 (
/ t

EM c th khm ph ra nhiu hnh dng cm khc nhau, tuy nhin do thi
gian lp ca thut ton kh nhiu nhm xc nh cc tham s tt nn ch ph tnh
ton ca thut ton l kh cao. c mt s ci tin c xut cho EM da
trn cc tnh cht ca d liu: c th nn, c th sao lu trong b nh v c th
hu b. Trong cc ci tin ny, cc i tng b hu b khi bit chc chn c
nhn phn cm ca n, chng c nn khi khng b loi b v thuc v mt cm
qu ln so vi b nh v chng s c lu li trong cc trng hp cn li.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

59
2.5.2. Thut ton COBWEB
COBWEB c xut bi Fisher nm 1987. Cc i tng u vo ca
thut ton c m t bi cp thuc tnh-gi tr, n thc hin phn cm phn cp
bng cch to cy phn lp, cc cu trc cy khc nhau.
Thut ton ny s dng cng c nh gi heuristic c gi l cng c
phn loi CU (Category untility) qun l cu trc cy. T cu trc cy
c hnh thnh da trn php o tng t m phn loi tng t v phi
tng t, c hai c th m t phn chia gi tr thuc tnh gia cc nt trong lp.
Cu trc cy c th hp nht hoc phn tch khi chn mt nt mi vo cy.
Cc bc chnh ca thut ton:
1) Khi to cy bt u bng mt nt rng.
2) Sau khi thm vo tng nt mt v cp nht li cy cho ph hp ti mi
thi im.
3) Cp nht cy bt u t l bn phi trong mi trng hp, sau cu
trc li cy.
4) Quyt nh cp nht da trn s phn hoch v cc hm tiu chun
phn loi.
Ti mi nt, n xem xt 4 kh nng xy ra (Insert, Create, Merge, Split) v
la chn mt kh nng c hm gi tr CU t c tt nht ca qu trnh.
Mt s hn ch ca COBWEB l n tha nhn phn b xc sut trn cc
thuc tnh n l l c lp thng k v chi ph tnh ton phn b xc sut ca
cc cm khi cp nht v lu tr l kh cao.
Cc phng php ci tin ca thut ton COBWEB l CLASSIT,
AutoClass.
2.6. Phn cm d liu m
Thng thng, mi phng php PCDL phn mt tp d liu ban u
thnh cc cm d liu c tnh t nhin v mi i tng d liu ch thuc v
mt cm d liu, phng php ny ch ph hp vi vic khm ph ra cc cm
c mt cao v ri nhau. Tuy nhin, trong thc t, cc cm d liu li c th
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

60
chng ln nhau (mt s cc i tng d liu thuc v nhiu cc cm khc
nhau), ngi ta p dng l thuyt v tp m trong PCDL gii quyt cho
trng hp ny, cch thc kt hp ny c gi l phn cm m. Trong phng
php phn cm m, ph thuc ca i tng d liu x
k
ti cm th i (u
ik
) c
gi tr thuc khong [0,1]. tng trn c gii thiu bi Ruspini (1969) v
c Dunn p dng nm 1973 nhm xy dng mt phng php phn cm m
da trn ti thiu ho hm tiu chun. Bezdek (1982) tng qut ho phng
php ny v xy dng thnh thut ton phn cm m c-means c s dng trng
s m [10][13][20].
c-means l thut ton phn cm m (ca k-means). Thut ton c-means m
hay cn gi tt l thut ton FCM (Fuzzy c- means) c p dng thnh cng
trong gii quyt mt s ln cc bi ton PCDL nh trong nhn dng mu, x l
nh, y hc, Tuy nhin, nhc im ln nht ca thut ton FCM l nhy cm
vi cc nhiu v phn t ngoi lai, ngha l cc trung tm cm c th nm xa so
vi trung tm thc t ca cm.
c nhiu cc phng php xut ci tin cho nhc im trn ca
thut ton FCM bao gm: Phn cm da trn xc sut (keller, 1993), phn cm
nhiu m (Dave, 1991), Phn cm da trn ton t L
P
Norm (Kersten, 1999).
Thut ton c - Insensitive Fuzzy c-means (c FCM-khng nhy cm m
c- means).
2.7. Tng kt chng 2
Chng ny trnh by mt s phng php phn cm d liu ph bin nh
phn cm phn hoch, phn cm phn cp, phn cm da trn mt , phn cm
da trn li, phn cm da trn m hnh v phng php tip cn mi trong
PCDL l phn cm m.
Phng php phn cm phn hoch da trn tng ban u to ra k phn
hoch, sau lp li nhiu ln phn b li cc i tng d liu gia cc cm
nhm ci thin cht lng phn cm. Mt s thut ton in hnh nh k-means,
PAM, CLARA, CLARANS,...
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

61
Phng php phn cm phn cp da trn tng cy phn cp phn
cm d liu. C hai cch tip cn l phn cm di ln (Bottom up) v phn
cm trn xung (Top down). Mt s thut ton in hnh nh BIRCH, CURE,..
Phng php phn cm da trn mt , cn c vo hm mt ca cc
i tng d liu xc nh cm cho cc i tng. Mt s thut ton in
hnh nh DBSCAN, DENCLUE, OPTICS,...
Phng php phn cm da trn li, tng ca n l u tin lng ho
khng gian i tng vo mt s hu hn cc theo mt cu trc di dng
li, sau thc hin phn cm da trn cu trc li . Mt s thut ton tiu
biu ca phng php ny l STING, CLIQUE,...
Phng php phn cm da trn m hnh, tng chnh ca phng php
ny l gi thuyt mt m hnh cho mi cm v tm kim s thch hp nht ca
i tng d liu vi m hnh , cc m hnh tip cn theo thng k v mng
Nron. Mt s thut ton in hnh ca phng php ny c th k n nh EM,
COBWEB,...
Mt cch tip cn khc trong PCDL l hng tip cn m, trong phng
php phn cm m phi k n cc thut ton nh FCM, c FCM,...


Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

62

Chng 3. KHAI PH D LIU WEB

Tng ng cc kiu d liu Web, ta c th phn chia cc hng tip cn
trong khai ph Web nh sau:

Hnh 3.1. Phn loi khai ph Web

3.1. Khai ph ni dung Web
Khai ph ni dung Web tp trung vo vic khm ph mt cch t ng
ngun thng tin c gi tr trc tuyn. Khng ging nh khai ph s dng Web
v cu trc Web, khai ph ni dung Web tp trung vo ni dung ca cc trang
Web, khng ch n thun l vn bn n gin m cn c th l d liu a
phng tin nh m thanh, hnh nh, phn bin i d liu v siu lin kt,....
Trong lnh vc khai ph Web, khai ph ni dung Web c xem xt nh l
k thut KPDL i vi CSDL quan h, bi n c th pht hin ra cc kiu tng
t ca tri thc t kho d liu khng cu trc trong cc ti liu Web. Nhiu ti
liu Web l na cu trc (nh HTML) hoc d liu c cu trc (nh d liu
trong cc bng hoc CSDL to ra cc trang HTML) nhng phn a d liu vn
bn l khng cu trc. c im khng cu trc ca d liu t ra cho vic khai
ph ni dung Web nhng nhim v phc tp v thch thc.
Web mining
Web content
mining
Web Structure
mining

Web Usage
mining
Web Page
Content Mining

Search Result
Mining

Customized
Usage Tracking

General Access
Pattern Tracking

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

63
Khai ph ni dung Web c th c tip cn theo 2 cch khc nhau: Tm
kim thng tin v KPDL trong CSDL ln. KPDL a phng tin l mt phn
ca khai ph ni dung Web, n ha hn vic khai thc c cc thng tin v tri
thc mc cao t ngun a phng tin trc tuyn rng ln. KPDL a phng
tin trn Web gn y thu ht s quan tm ca nhiu nh nghin cu. Mc
ch l lm ra mt khung thng nht i vi vic th hin, gii quyt bi ton v
hun luyn da vo a phng tin. y thc s l mt thch thc, lnh vc
nghin cu ny vn cn l thi k s khai, nhiu vic ang i thc hin.
C nhiu cch tip cn khc nhau v khai ph ni dung Web, song trong
lun vn ny s xem xt di 2 gc : Khai ph kt qu tm kim v khai ph
ni dung trang HTML.
3.1.1. Khai ph kt qu tm kim
- Phn loi t ng ti liu s dng searching engine: Search engine c th
nh ch s tp trung d liu hn hp trn Web. V d, trc tin ti v cc trang
Web t cc Web site. Th hai, search engine trch ra nhng thng tin ch mc
m t t cc trang Web lu tr chng cng vi URL ca n trong search
engine. Th ba s dng cc phng php KPDL phn lp t ng v to iu
kin thun tin cho h thng phn loi trang Web v c t chc bng cu trc
siu lin kt.
- Trc quan ho kt qu tm kim: Trong h thng phn loi, c nhiu ti
liu thng tin khng lin quan nhau. Nu ta c th phn tch v phn cm kt
qu tm kim, th hiu qu tm kim s c ci thin tt hn, ngha l cc ti
liu tng t nhau v mt ni dung th a chng vo cng nhm, cc ti liu
phi tng t th a chng vo cc nhm khc nhau.
3.1.2. Khai ph vn bn Web
KPVB l mt k thut hn hp. N lin quan n KPDL, x l ngn ng
t nhin, tm kim thng tin, iu khin tri thc,... KPVB l vic s dng k
thut KPDL i vi cc tp vn bn tm ra tri thc c ngha tim n trong
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

64
n. Kiu i tng ca n khng ch l d liu c cu trc m cn l d liu na
cu trc hoc khng cu trc [16]. Kt qu khai ph khng ch l trng thi
chung ca mi ti liu vn bn m cn l s phn loi, phn cm cc tp vn bn
phc v cho mc ch no . Cu trc c bn ca khai ph thng tin vn bn
c th hin trong hnh di y [16].

Hnh 3.2. Qu trnh khai ph vn bn Web
3.1.2.1. La chn d liu
V c bn, vn bn cc b c nh dng tch hp thnh cc ti liu theo
mong mun khai ph v phn phi trong nhiu dch v Web bng vic s
dng k thut truy xut thng tin.
3.1.2.2. Tin x l d liu
Ta thng ly ra nhng metadata c trng nh l mt cn c v lu tr
cc c tnh vn bn c bn bng vic s dng cc quy tc/ phng php lm
r d liu [16]. c c kt qu khai ph tt ta cn c d liu r rng, chnh
xc v xa b d liu hn n v d tha. Trc ht cn hiu yu cu ca ngi
dng v ly ra mi quan h gia ngun tri thc c ly ra t ngun ti nguyn.
Th hai, lm sch, bin i v sp xp li nhng ngun tri thc ny. Cui cng,
tp d liu kt qu cui cng l bng 2 chiu. Sau bc tin x l, tp d liu
t c thng c cc c im nh sau [16]:
- D liu thng nht v hn hp cng bc.
- Lm sch d liu khng lin quan, nhiu v d liu rng. D liu khng
b mt mt v khng b lp.
Ngun d
liu Web
Trch rt cc mu
nh gi v biu
din tri thc
Biu din
d liu

Tin
x l
S dng cc k
thut khai ph d
liu x l

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

65
- Gim bt s chiu v lm tng hiu qu vic pht hin tri thc bng vic
chuyn i, quy np, cng bc d liu,...
- Lm sch cc thuc tnh khng lin quan gim bt s chiu ca d liu.
3.1.2.3. Biu in vn bn
KPVB Web l khai ph cc tp ti liu HTML, l khng t nhin. Do ta
s phi bin i v biu din d liu thch hp cho qu trnh x l. Ta c th x
l v lu tr chng trong mng 2 chiu m d liu c th phn nh c trng
ca ti liu. Ngi ta thng dng m hnh TF-IDF vector ha d liu.
Nhng c mt vn quan trng l vic biu din ny s dn n s chiu
vector kh ln. La chn cc c trng m n chc chn tr thnh kha v n
nh hng trc tip n hiu qu KPVB.
Phn lp t v loi b cc t: Trc ht, chn lc cc t c th m t c
c trng ca ti liu. Th hai, qut tp ti liu nhiu ln v lm sch cc t tn
s thp. Cui cng ta cng loi tr cc c tn s cao nhng v ngha, nh cc t
trong ting Anh: ah, eh, oh, o, the, an, and, of, or,...
3.1.2.4. Trch rt cc t c trng
Rt ra cc c trng l mt phng php, n c th gii quyt s chiu
vector c trng ln c mang li bi k thut KPVB.
Vic rt ra cc c trng da trn hm trng s:
- Mi t c trng s nhn c mt gi tr trng s tin cy bng vic tnh
ton hm trng s tin cy. Tn s xut hin cao ca cc t c trng l kh nng
chc chn n s phn nh n ch ca vn bn, th ta s gn cho n mt gi
tr tin cy ln hn. Hn na, nu n l tiu , t kha hoc cm t th chc
chn n c gi tr tin cy ln hn. Mi t c trng s c lu tr li x l.
Sau ta s la chn kch thc ca tp cc c trng (kch thc phi nhn
c t thc nghim).
- Vic rt ra cc c trng da trn vic phn tch thnh phn chnh trong
phn tch thng k. tng chnh ca phng php ny l s dng thay th t
c trng bao hm ca mt s t cc t c trng chnh trong phn m t thc
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

66
hin gim bt s chiu. Hn na, ta cng s dng phng php quy np thuc
tnh d liu gim bt s chiu vector thng qua vic tng hp nhiu d liu
thnh mt mc cao.
3.1.2.5. Khai ph vn bn
Sau khi tp hp, la chn v trch ra tp vn bn hnh thnh nn cc c
trng c bn, n s l c s KPDL. T ta c th thc hin trch, phn loi,
phn cm, phn tch v d on.
3.1.2.5.1 Trch rt vn bn
Vic trch rt vn bn l a ra ngha chnh c th m t tm tt ti
liu vn bn trong qu trnh tng hp. Sau , ngi dng c th hiu ngha
chnh ca vn bn nhng khng cn thit phi duyt ton b vn bn. y l
phng php c bit c s dng trong searching engine, thng cn a
ra vn bn trch dn. Nhiu searching engines lun a ra nhng cu d on
trong qu trnh tm kim v tr v kt qu, cch tt nht thu c ngha
chnh ca mt vn bn hoc tp vn bn ch yu bng vic s dng nhiu thut
ton khc nhau. Theo , hiu qu tm kim s tt hn v ph hp vi s la
chn kt qu tm kim ca ngi dng.
3.1.2.5.2. Phn lp vn bn
Trc ht, nhiu ti liu c phn lp t ng mt cch nhanh chng v
hiu qu cao. Th hai, mi lp vn bn c a vo mt ch ph hp. Do
n thch hp vi vic tm v duyt qua cc ti liu Web ca ngi s dng.
Ta thng s dng phng php phn lp Navie Bayesian v K-lng
ging gn nht (K-Nearest Neighbor) khai ph thng tin vn bn. Trong
phn lp vn bn, u tin l phn loi ti liu. Th hai, xc nh c trng
thng qua s lng cc c trng ca tp ti liu hun luyn. Cui cng, tnh
ton kim tra phn lp ti liu v tng t ca ti liu phn lp bng thut
ton no . Khi cc ti liu c tng t cao vi nhau th nm trong cng
mt phn lp. tng t s c o bng hm nh gi xc nh trc. Nu t
ti liu tng t nhau th a n v 0. Nu n khng ging vi s la chn ca
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

67
phn lp xc nh trc th xem nh khng ph hp. Sau , ta phi chn li
phn lp. Trong vic la chn c 2 giai on: Hun luyn v phn lp.
- La chn trc c trng phn lp, Y={y
1
, y
2
,..., y
m
}
- Tp ti liu hun luyn cc b, X={x
1
, x
2
,...x
n
}, v(x
j
} l vector c trng ca x
j
.
- Mi v(y
i
) trong Y c xc nh bng v(x
j
) thng qua vic hun luyn v(x
j
)
trong X.
- Tp ti liu kim tra, C={c
1
, c
2
,...,c
p
}, c
k
trong C l mt ti liu phn lp mong
i, cng vic ca ta l tnh ton tng t gia v(c
k
) v v(y
i
), sim(c
k
,y
i
).
- La chn ti liu c
k
m tng t ca n vi y
i
ln nht, nh vy c
k
nm
trong phn lp vi y
i
, vi max(sim(c
k
,y
i
)) i=1,...,m.
Qu trnh c thc hin lp li cho ti khi tt c cc ti liu c phn lp.
Hnh 3.3. Thut ton phn lp K-Nearest Neighbor
3.1.2.5.3. Phn cm vn bn
Ch phn loi khng cn xc nh trc. Nhng ta phi phn loi cc ti
liu vo nhiu cm. Trong cng mt cm, th tt c tng t ca cc ti liu
yu cu cao hn, ngc li ngoi cm th tng t thp hn. Nh l mt quy
tc, quan h cc cm ti liu c truy vn bi ngi dng l gn nhau. Do
, nu ta s dng trng thi trong vng hin th kt qu searching engine bi
nhiu ngi dng th n c gim bt rt nhiu. Hn na, nu phn loi cm
rt ln th ta s phn loi li n cho ti khi ngi dng c p ng vi phm
vi tm kim nh hn. Phng php sp xp lin kt v phng php phn cp
thng c s dng trong phn cm vn bn.
- Trong tp ti liu xc nh, W={w
1
, w
2
, ..,w
m
}, mi ti liu w
i
l mt cm c
i
, tp
cm C l C={c
1
, c
2
, ...c
m
}.
- Chn ngu nhin 2 cm c
i
v c
j
, tnh tng t sim(c
i
,c
j
) ca chng. Nu
tng t gia c
i
v c
j
l ln nht, ta s a c
i
v c
j
vo mt cm mi. cui cng ta s
hnh thnh c cm mi C={c
1
, c
2
,..c
m-1
}
- Lp li cng vic trn cho ti khi ch cn 1 phn t.
Hnh 3.4. Thut ton phn cm phn cp
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

68
Ton b qu trnh ca phng php sp xp lin kt s to nn mt cy m
n phn nh mi quan h lng nhau v tng t gia cc ti liu. Phng
php c tnh chnh xc cao. Nhng tc ca n rt chm bi vic phi so snh
tng t trong tt c cc cm. Nu tp ti liu ln th phng php ny
khng kh thi.
- Trc ht ta s chia tp ti liu thnh cc cm khi u thng qua vic ti u
ha hm nh gi theo mt nguyn tc no , R={R
1
, R
2
,...,R
n
}, vi n phi c
xc nh trc.
- Vi mi ti liu trong tp ti liu W, W={w
1
, w
2
,..,w
m
}, tnh ton tng t
ca n ti R
j
ban u, sim(w
i
, R
j
), sau la chn ti liu tng t ln nht, a n
vo cm R
j
.
- Lp li cc cng vic trn cho ti khi tt c cc ti liu a vo trong cc
cm xc nh.
Hnh 3.5. Thut ton phn cm phn hoch
Phng php ny c cc c im l kt qu phn cm n nh v nhanh
chng. Nhng ta phi xc nh trc cc phn t khi u v s lng ca n,
m chng s nh hng trc tip n hiu qu phn cm.
3.1.2.5.4. Phn tch v d on xu hng
Thng qua vic phn tch cc ti liu Web, ta c th nhn c quan h
phn phi ca cc d liu c bit trong tng giai on ca n v c th d on
c tng lai pht trin.
3.1.3. nh gi cht lng mu
KPDL Web c th c xem nh qu trnh ca machine learning. Kt qu
ca machine learning l cc mu tri thc. Phn quan trng ca machine learning
l nh gi kt qu cc mu. Ta thng phn lp cc tp ti liu vo tp hun
luyn v tp kim tra. Sau lp li vic hc v kim th trong tp hun luyn
v tp kim tra. Cui cng, cht lng trung bnh c dng nh gi cht
lng m hnh.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

69
3.2. Khai ph theo s dng Web
Vic nm bt c nhng c tnh ca ngi dng Web l vic rt quan
trng i vi ngi thit Web site. Thng qua vic khai ph lch s cc mu truy
xut ca ngi dng Web, khng ch thng tin v Web c s dng nh th
no m cn nhiu c tnh khc nh cc hnh vi ca ngi dng c th c xc
nh. S iu hng ng dn ngi dng Web mang li gi tr thng tin v
mc quan tm ca ngi dng n cc WebSite .
Da trn nhng tiu chun khc nhau ngi dng Web c th c phn
cm v cc tri thc hu ch c th c ly ra t cc mu truy cp Web. Nhiu
ng dng c th gip ly ra c cc tri thc. V d, vn bn siu lin kt ng
c to ra gia cc trang Web c th c xut sau khi khm ph cc cm
ngi dng Web, th hin tng t thng tin. Thng qua vic pht hin mi
quan h gia nhng ngi dng nh s thch, s quan tm ca ngi dng Web
ta c th d on mt cch chnh xc hn ngi s dng ang cn g, ti thi
im hin ti c th d on c k tip h s truy cp nhng thng tin v h
cn thng tin g.
Gi s rng tm c tng t v s quan tm gia nhng ngi dng
Web c khm ph t hin trng (profile) ca ngi dng. Nu Web site c
thit kt tt s c nhiu s tng quan gia tng t ca cc chuyn hng
ng dn v tng t gia s quan tm ca ngi dng.
Khai ph theo s dng Web l khai ph truy cp Web (Web log) khm
ph cc mu ngi dng truy nhp vo WebSite. Thng qua vic phn tch v
kho st nhng quy tc trong vic ghi nhn li qu trnh truy cp Web ta c th
chng thc khch hng trong thng mi in t, nng cao cht lng dch v
thng tin trn Internet n ngi dng, nng cao hiu sut ca cc h thng phc
v Web. Thm vo , t pht trin cc Web site bng vic hun luyn t cc
mu truy xut ca ngi dng. Phn tch qu trnh ng nhp Web ca ngi
dng cng c th gip cho vic xy dng cc dch v Web theo yu cu i vi
tng ngi dng ring l c tt hn.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

70
Hin ti, ta thng s dng cc cng c khm ph mu v phn tch mu.
N phn tch cc hnh ng ngi dng, lc d liu v khai ph tri thc t tp
d liu bng cch s dng tr tu nhn to, KPDL, tm l hc v l thuyt thng
tin. Sau khi tm ra cc mu truy cp ta thng s dng cc k thut phn tch
tng ng hiu, gii thch v khm ph cc mu . V d, k thut x l
phn tch trc tuyn, tin phn loi hnh thi d liu, phn tch mu thi quen s
dng ca ngi dng.
Kin trc tng qut ca qu trnh khai ph theo s dng Web nh sau:

Hnh 3.6. Kin trc tng qut ca khai ph theo s dng Web
3.2.1. ng dng ca khai ph theo s dng Web
- Tm ra nhng khch hng tim nng trong thng mi in t.
- Chnh ph in t (e-Gov), gio dc in t (e-Learning).
- Xc nh nhng qung co tim nng.
- Nng cao cht lng truyn ti ca cc dch v thng tin Internet n
ngi dng cui.
- Ci tin hiu sut h thng phc v ca cc my ch Web.
- C nhn dch v Web thng quan vic phn tch cc c tnh c nhn
ngi dng.
- Ci tin thit k Web thng qua vic phn tch thi quen duyt Web v
phn tch cc mu ni dung trang quy cp ca ngi dng.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

71
- Pht hin gian ln v xm nhp bt hp l trong dch v thng mi in
t v cc dch v Web khc.
- Thng qua vic phn tch chui truy cp ca ngi dng c th d bo
nhng hnh vi ca ngi dng trong qu trnh tm kim thng tin.
3.2.2. Cc k thut c s dng trong khai ph theo s dng Web
Lut kt hp: tm ra nhng trang Web thng c truy cp cng nhau
ca ngi dng nhng la chn cng nhau ca khch hng trong thng mi
in t.
K thut phn cm: Phn cm ngi dng da trn cc mu duyt tm ra
s lin quan gia nhng ngi dng Web v cc hnh vi ca h.
3.2.3. Nhng vn trong khai kh theo s dng Web.
Khai ph theo cch dng Web c 2 vic: Trc tin, Web log cn c lm
sch, nh ngha, tch hp v bin i. Da vo phn tch v khai ph.
Nhng vn tn ti:
- Cu trc vt l cc Web site khc nhau t nhng mu ngi dng truy xut.
- Rt kh c th tm ra nhng ngi dng, cc phin lm vic, cc giao tc.
Vn chng thc phin ngi dng v truy cp Web:
Cc phin chuyn hng ca ngi dng: Nhm cc hnh ng c thc
hin bi ngi dng t lc h truy cp vo Web site n lc h ri khi Web
site . Nhng hnh ng ca ngi dng trong mt Web site c ghi v lu
tr li trong mt file ng nhp (log file) (file ng nhp cha a ch IP ca
my khch, ngy, thi gian t khi yu cu c tip nhn, cc i tng yu cu
v nhiu thng tin khc nh cc giao thc ca yu cu, kch thc i tng,...).
3.2.3.1. Chng thc phin ngi dng
Chng thc ngi dng: Mi ngi dng vi cng mt Client IP c xem
l cng mt ngi.
Chng thc phin lm vic: Mi phin lm vic mi c to ra khi mt
a ch mi c tm thy hoc nu thi gian thm mt trang qu ngng thi
gian cho php (v d 30 pht) i vi mi a ch IP.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

72
3.2.3.2. ng nhp Web v xc nh phin chuyn hng ngi dng
Dch v file ng nhp Web: Mt file ng nhp Web l mt tp cc s ghi
li nhng yu cu ngi dng v cc ti liu trong mt Web site, v d:
216.239.46.60 - - [04/April/2007:14:56:50 +0200] "GET
/~lpis/curriculum/C+Unix/ Ergastiria/Week-7/filetype.c.txt HTTP/1.0" 304 -
216.239.46.100- - [04/April/2007:14:57:33 +0200]"GET /~oswinds/top.html
HTTP/ 1.0" 200 869
64.68.82.70 - - [04/April/2007:14:58:25 +0200] "GET /~lpis/systems/rdevice/r-
device_examples.html HTTP/1.0" 200 16792
216.239.46.133 - - [04/April/2007:14:58:27 +0200] "GET /~lpis/publications/crc-
chapter1. html HTTP/1.0" 304 -
209.237.238.161 - - [04/April/2007:14:59:11+0200] "GET /robots.txt HTTP/1.0"
404 276
209.237.238.161 - - [04/April/2007:14:59:12 +0200] "GET /teachers/pitas1.html
HTTP/1.0" 404 286
216.239.46.43 - - [04/April/2007:14:59:45 +0200] "GET /~oswinds/publication
Ngun t: http://www.csd.auth.gr/
Hnh 3.7. Minh ha ni dung logs file
3.2.3.3. Cc vn i vi vic x l Web log
- Thng tin c cung cp c th khng y , khng chi tit.
- Khng c thng tin v ni dung cc trang c thm.
- C qu nhiu s ghi li cc ng nhp do yu cu phc v bi cc proxy.
- S ghi li cc ng nhp khng y do cc yu cu phc v bi proxy.
- c bit l vic lc cc mc ng nhp: Cc mc ng nhp vi tn file
m rng nh gif, jpeg, jpg. Cc trang yu cu to ra bi cc tc nhn t ng v
cc chng trnh gin ip.
- c lng thi gian thm trang: Thi gian dng thm mt trang l
mt o tt cho vn xc nh mc quan tm ca ngi dng i vi
trang Web , n cung cp mt s nh gi ngm nh i vi trang Web .
- Khong thi gian thm trang: l khong thi gian gia hai yu cu
trang khc nhau lin tip.
- Quy lui: Nhiu ngi dng ri trang bi h hon thnh vic tm kim
v h khng mun thi gian lu chuyn hng.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

73
3.2.3.4. Phng php chng thc phin lm vic v truy cp Web
Chng thc phin lm vic: Nhm cc tham chiu trang ca ngi dng
vo mt phin lm vic da trn nhng phng php gii quyt heuristic:
Phng php heuristics da trn IP v thi gian kt thc mt phin lm
vic (v d 30 pht) c s dng chng thc phin ngi dng. y l
phng php n gin nht.
Cc giao tc ni ti ca phin lm vic c th nhn c da trn m hnh
hnh vi ca ngi dng (bao hm phn loi tham chiu ni dung hoc
chuyn hng i vi mi ngi dng).
Trng s c gn cho mi trang Web da trn mt s o i vi s
quan tm ca ngi dng (v d khong thi gian xem mt trang, s ln lui
ti trang).
3.2.4. Qu trnh khai ph theo s dng Web
Khai ph s dng Web c 3 pha [22]: Tin x l, khai ph v phn tch
nh gi, biu din d liu.
3.2.4.1. Tin x l d liu
Chng thc ngi dng, chng thc hot ng truy nhp, ng dn y
, chng thc giao tc, tch hp d liu v bin i d liu. Trong pha ny, cc
thng tin v ng nhp Web c th c bin i thnh cc mu giao tc thch
hp cho vic x l sau ny trong cc lnh vc khc nhau.
Trong giai on ny gm c vic loi b cc file c phn m rng l gif,
jpg,... B sung hoc xa b cc d liu khuyt thiu nh cache cc b, dch v
proxy. X l thng tin trong cc Cookie, thng tin ang k ngi dng kt hp
vi IP, tn trnh duyt v cc thng tin lu tm.
Chng thc giao tc: Chng thc cc phin ngi dng, cc giao tc.
3.2.4.2. Khai ph d liu
S dng cc phng php KPDL trong cc lnh vc khc nhau nh lut kt
hp, phn tch, thng k, phn tch ng dn, phn lp v phn cm khm
ph ra cc mu ngi dng.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

74
+ Phn tch ng dn [8][9][22]: Hu ht cc cc ng dn thng c
thm c b tr theo th vt l ca trang Web. Mi nt l mt trang, mi
cnh l ng lin kt gia cc trang . Thng qua vic phn tch ng dn
trong qu trnh truy cp ca ngi dng ta c th bit c mi quan h trong
vic truy cp ca ngi gia cc ng dn lin quan.
V d:
- 70% cc khch hng truy cp vo /company/product2 u xut pht t
/company thng qua /company/new, /company/products v /company/product1.
- 80% khch hng truy cp vo WebSite bt u t /company/products.
- 65% khch hng ri khi site sau khi thm 4 hoc t hn 4 trang.
+ Lut kt hp [8]: S tng quan gia cc tham chiu n cc file khc
nhau c trn dch v nh vic s dng lut kt hp.
V d:
- 40% khch hng truy cp vo trang Web c ng dn
/company/product1 cng truy cp vo /company/product2.
- 30% khch hng truy cp vo /company/special u thng qua
/company/product1.
N gip cho vic pht trin chin lc kinh doanh ph hp, xy dng v t
chc mt cch tt nht khng gian Web ca cng ty.
+ Chui cc mu: Cc mu thu c gia cc giao tc v chui thi gian.
Th hin mt tp cc phn t c theo sau bi phn t khc trong th t thi
gian lu hnh tp giao tc.
Qu trnh thm ca khch hng c ghi li trn tng giai on thi gian.
V d:
30% khch hng thm /company/products thc hin tm kim bng
Yahoo vi cc t kha tm kim.
60% khch hng t hng trc tuyn /company/product1 th cng t
hng trc tuyn /company/product4 trong 15 ngy.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

75
+ Quy tc phn loi [22]: Profile ca cc phn t thuc mt nhm ring
bit theo cc thuc tnh chung. V d nh thng tin c nhn hoc cc mu
truy cp. Profile c th s dng phn loi cc phn t d liu mi c thm
vo CSDL.
V d: Khch hng t cc v tr a l mt quc gia hoc chnh ph thm
site c khuynh hng b thu ht trang /company/product1 hoc 50% khch
hng t hng trc tuyn /company/product2 u thuc nhm tui 20-25 B
bin Ty.
+ Phn tch phn cm: Nhm cc khch hng li cng nhau hoc cc phn
t d liu c cc c tnh tng t nhau.
N gip cho vic pht trin v thc hin cc chin lc tip th khch hng
c v trc tuyn hoc khng trc tuyn nh vic tr li t ng cho cc khch
hng thuc nhm chc chn, n to ra s thay i linh ng mt WebSite ring
bit i vi mi khch hng.
3.2.4.3. Phn tch nh gi
Phn tch m hnh [22]: Thng k, tm kim tri thc v tc nhn thng
minh. Phn tch tnh kh thi, truy vn d liu hng ti s tiu dng ca con
ngi.
Trc quan ha: Trc quan Web s dng lc ng dn Web v a ra
th c hng OLAP.
V d: Querying: SELECT association-rules(A*B*C*) FROM log.data
WHERE (date>= 970101) AND (domain = ''edu'' )AND (support = 1.0) AND
(confidence = 90.0)
3.2.5. V d khai ph theo s dng Web
V d ny s dng phng php khai ph phn lp v phn cm, lut kt
hp c th c dng phn tch s lng ngi dng. Sau ngi thit k
Web c th a ra nhiu dch v khc nhau ti cc thi im khc nhau theo cc
quy tc ca ngi dng truy cp Web site. Cht lng dch v tt s thc y s
lng ngi dng thm Web site. Qu trnh thc hin nh sau:
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

76
- Chng thc ngi dng truy cp vo Web site, phn tch nhng ngi
dng c bit tm ra nhng ngi dng quan trng thng qua mc truy cp
ca h, thi gian lu li trn v mc yu thch trang Web.
- Phn tch cc ch c bit v chiu su ni dung Web. V d, hot
ng thng ngy ca mt quc gia, gii thiu cc tour,... Quan h kh t nhin
gia ngi dng v ni dung Web. Tm ra nhng dch v hp dn v tin li vi
ngi dng.
Ty theo mc hiu qu hot ng truy cp Web site v iu kin ca
vic duyt Web site ta c th d kin v nh gi ni dung Web site tt hn.
Da trn d liu kim tra ta xc nh mc truy xut ca ngi dng qua
vic phn tch mt Web site v phn tch yu cu phc v thay i tng gi,
tng ngy nh sau [16]:
Thi gian
S ngi
truy cp
Thi gian
S ngi
truy cp
00:00-00:59 936 12:00-12:59 2466
01:00-01:59 725 13:00-13:59 1432
02:00-02:59 433 14:00-14:59 1649
03:00-03:59 389 15:00-15:59 1537
04:00-04:59 149 16:00-16:59 2361
05:00-05:59 118 17:00-17:59 2053
06:00-06:59 126 18:00-18:59 2159
07:00-07:59 235 19:00-19:59 1694
08:00-08:59 599 20:00-20:59 2078
09:00-09:59 1414 21:00-21:59 2120
10:00-10:59 2424 22:00-22:59 1400
11:00-11:59 2846 23:00-23:59 1163
Bng 3.1. Thng k s ngi dng ti cc thi gian khc nhau
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

77

Hnh 3.8. Phn tch ngi dng truy cp Web

3.3. Khai ph cu trc Web
WWW l h thng thng tin ton cu, bao gm tt c cc Web site. Mi
mt trang c th c lin kt n nhiu trang. Cc siu lin kt thay i cha
ng ng ngha chung ch ca trang. Mt siu lin kt tr ti mt trang Web
khc c th c xem nh l mt chng thc ca trang Web . Do , n rt
c ch trong vic s dng nhng thng tin ng ngha ly c thng tin quan
trng thng qua phn tch lin kt gia cc trang Web.
S dng cc phng php khai ph ngi dng ly tri thc hu ch t
cu trc Web, tm ra nhng trang Web quan trng v pht trin k hoch xy
dng cc WebSite ph hp vi ngi dng.
Mc tiu ca khai ph cu trc Web l pht hin thng tin cu trc v
Web. Nu nh khai ph ni dung Web ch yu tp trung vo cu trc bn trong
ti liu th khai ph cu trc Web c gng pht hin cu trc lin kt ca cc
siu lin kt mc trong ca ti liu. Da trn m hnh hnh hc ca cc siu
lin kt, khai ph cu trc Web s phn loi cc trang Web, to ra thng tin nh
tng t v mi quan h gia cc WebSite khc nhau. Nu trang Web c
0
500
1000
1500
2000
2500
3000
0
0
:
0
0
-
0
0
:

5
9
0
1
:
0
0
-
0
1
:
5
9
0
2
:
0
0
-
0
2
:
5
9
0
3
:
0
0
-
0
3
:
5
9
0
4
:
0
0
-
0
4
:
5
9
0
5
:
0
0
-
0
5
:
5
9
0
6
:
0
0
-
0
6
:
5
9
0
7
:
0
0
-
0
7
:
5
9
0
8
:
0
0
-
0
8
:
5
9
0
9
:
0
0
-
0
9
:
5
9
1
0
:
0
0
-
1
0
:
5
9
1
1
:
0
0
-
1
1
:
5
9
1
2
:
0
0
-
1
2
:
5
9
1
3
:
0
0
-
1
3
:
5
9
1
4
:
0
0
-
1
4
:
5
9
1
5
:
0
0
-
1
5
:
5
9
1
6
:
0
0
-
1
6
:
5
9
1
7
:
0
0
-
1
7
:
5
9
1
8
:
0
0
-
1
8
:
5
9
1
9
:
0
0
-
1
9
:
5
9
2
0
:
0
0
-
2
0
:
5
9
2
1
:
0
0
-
2
1
:
5
9
2
2
:
0
0
-
2
2
:
5
9
2
3
:
0
0
-
2
3
:
5
9
S ngi
T
h

i

g
i
a
n
S


n
g

i

Thi gian
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

78
lin kt trc tip vi trang Web khc th ta s mun pht hin ra mi quan h
gia cc trang Web ny. Chng c th tng t vi nhau v ni dung, c th
thuc dch v Web ging nhau do n c to ra bi cng mt ngi. Nhng
nhim v khc ca khai ph cu trc Web l khm ph s phn cp t nhin
hoc mng li ca cc siu lin kt trong cc Web site ca mt min c bit.
iu ny c th gip to ra nhng lung thng tin trong Web site m n c th
i din cho nhiu min c bit. V th vic x l truy vn s tr nn d dng
hn v hiu qu hn.
+ Vic phn tch lin kt Web c s dng cho nhng mc ch[1]:
- Sp th t ti liu ph hp vi truy vn ca ngi s dng.
- Quyt nh Web no c a vo la chn trong truy vn.
- Phn trang.
- Tm kim nhng trang lin quan.
- Tm kim nhng bn sao ca Web.
+ Web c xem nh l mt th [1]:
- th lin kt: Mi nt l mt trang, cung c hng t u n v nu c
mt siu lin kt t trang Web u sang trang Web v.

Hnh 3.9. thi lin kt Web
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

79
- th trch dn: Mi nt cho mt trang, khng c cung hng t u n v
nu c mt trang th ba w lin kt c u v v.
- Gi nh: Mt lin kt t trang u n trang v l mt thng bo n trang v
bi trang u. Nu u v v c kt ni bi mt ng lin kt th rt c kh nng
hai trang Web u c ni dung tng t nhau.
3.3.1. Tiu chun nh gi tng t
- Khm ph ra mt nhm cc trang Web ging nhau khai ph, ta phi ch
ra s ging nhau ca hai nt theo mt tiu chun no .
Tiu chun 1: i vi mi trang Web d
1
v d
2
. Ta ni d
1
v d
2
quan h vi
nhau nu c mt lin kt t d
1
n d
2
hoc t d
2
n d
1
.

Hnh 3.10. Quan h trc tip gia 2 trang
Tiu chun 2: ng trch dn: tng t gia d
1
v d
2
c o bi s
trang dn ti c d
1
v d
2
.

Hnh 3.11. tng t ng trch dn
Tng t ch mc: tng t gia d
1
v d
2
c o bng s trang m c
d
1
v d
2
u tr ti.

Hnh 3.12. tng t ch mc
d1
d2
2
d1
d2
2

d1
d2
2

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

80

3.3.2. Khai ph v qun l cng ng Web
Cng ng Web l mt nhm gm cc trang Web chia s chung nhng vn
m ngi dng quan tm. Cc thnh vin ca cng ng Web c th khng
bit tnh trng tn ti ca mi trang (v c th thm ch khng bit s tn ti ca
cng ng). Nhn bit c cc cng ng Web, hiu c s pht trin v
nhng c trng ca cc cng ng Web l rt quan trng. Vic xc nh v
hiu cc cng ng trn Web c th c xem nh vic khai ph v qun
l Web.

Hnh 3.13. Cng ng Web
c im ca cng ng Web:
- Cc trang Web trong cng mt cng ng s tng t vi nhau hn cc
trang Web ngoi cng ng.
- Mi cng ng Web s to thnh mt cm cc trang Web.
- Cc cng ng Web c xc nh mt cch r rng, tt c mi ngi
u bit, nh cc ngun ti nguyn c lit k bi Yahoo.
- Cng ng Web c xc nh hon chnh: Chng l nhng cng ng
bt ng xut hin.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

81
Cng ng Web ngy cng c mi ngi quan tm v c nhiu ng
dng trong thc tin. V vy, vic nghin cu cc phng php khm ph cng
ng l rt c ngha to ln trong thc tin. trch dn ra c cc cng ng
n, ta c th phn tch th Web. C nhiu phng php chng thc cng
ng nh thut ton tm kim theo ch HITS, lung cc i v nht ct cc
tiu, thut ton PageRank,...
3.3.2.1. Thut ton PageRank
Google da trn thut ton PageRank [brin98], n lp ch mc cc lin kt
gia cc Web site v th hin mt lin kt t A n B nh l xc nhn ca B bi
A. Cc lin kt c nhng gi tr khc nhau. Nu A c nhiu lin kt ti n v C
c t cc lin kt ti n th mt lin kt t A n B c gi tr hn mt lin kt t
C n B. Gi tr c xc nh nh th c gi l PageRank ca mt trang v
xc nh th t sp xp ca n trong cc kt qu tm kim (PageRank c s
dng trong php cng quy c ch s vn bn to ra cc kt qu tm kim
chnh xc cao). Cc lin kt c th c phn tch chnh xc v hiu qu hn i
vi khi lng chu chuyn hoc khung nhn trang v tr thnh o ca s
thnh cng v vic bin i th hng ca cc trang.

Hnh 3.14. Kt qu ca thut ton PageRank
PageRank khng n gin ch da trn tng s cc lin kt n. Cc tip
cn c bn ca PageRank l mt ti liu trong thc t c xt n quan trng
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

82
hn l cc ti liu lin kt ti n, nhng nhng lin kt v (ti n) khng bng
nhau v s lng. Mt ti liu xp th hng cao trong cc phn t ca PageRank
nu nh c cc ti liu th hng cao khc lin kt ti n. Cho nn trong khi
nim PageRank, th hng ca mt ti liu c da vo th hng cao ca cc ti
liu lin kt ti n. Th hng ngc li ca chng c da vo th hng thp
ca cc ti liu lin kt ti chng.
3.3.2.2. Phng php phn cm nh thut ton HITS
Thut ton HITS (Hypertext-Induced Topic Selection) do Kleinberg
xut, l thut ton pht trin hn trong vic xp th hng ti liu da trn thng
tin lin kt gia tp cc ti liu.
nh ngha:
- Authority: L cc trang cung cp thng tin quan trng, tin cy da trn
cc ch a ra.
- Hub: L cc trang cha cc lin kt n authorities
- Bc trong: L s cc lin kt n mt nt, c dng o y quyn.
- Bc ngoi: L s cc lin kt i ra t mt nt, n c s dng o mc
trung tm.
Trong : Mi Hub tr n nhiu Authority, mi Authority th c tr n
bi nhiu Hub. Chng kt hp vi nhau to thnh thi phn i.

Hnh 3.15. th phn i ca Hub v Authority
Cc Authority and hub th hin mt quan h tc ng qua li tng cng
lc lng. Ngha l mt Hub s tt hn nu n tr n cc Authority tt v
ngc li mt Authority s tt hn nu n c tr n bi nhiu Hub tt.
Hub Authoritie

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

83

Hnh 3.16. S kt hp gia Hub v Authority
Cc bc ca phng php HITS
Bc 1: Xc nh mt tp c bn S, ly mt tp cc ti liu tr v bi
Search Engine chun c gi l tp gc R, khi to S tng ng vi R.

Bc 2: Thm vo S tt c cc trang m n c tr ti t bt k trang no
trong R.
Thm vo S tt c cc trang m n tr ti bt k trang no trong R
Vi mi trang p trong S:
Tnh gi tr im s Authority: a
p
(vector a)
Tnh gi tr im s Hub: h
p
(vector h)
Vi mi nt khi to a
p
v h
p
l 1/n (n l s cc trang)
Bc 3. Trong mi bc lp tnh gi tr trng s Authority cho mi nt
trong S theo cng thc:

=
p q q
q p
h a
:

Bc 4. Mi bc lp tnh gi tr trng s Hub i vi mi nt trong S theo
cng thc

=
p q q
p q
a h
:

1 1
7
1
2
3
4
5
7
6
h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

84
Lu rng cc trng s Hub c tnh ton nh vo cc trng s Authority
hin to, m cc trng s Authority ny li c tnh ton t cc trng s ca
cc Hub trc .
Bc 5. Sau khi tnh xong trng s mi cho tt c cc nt, cc trng s
c chun ha li theo cng thc:

e e
= =
S p
p
S p
p
h and a 1 ) ( 1 ) (
2 2

Lp li bc 3 cho ti khi cc h
p
v a
p
khng i.
V d: Tp gc R l {1, 2, 3, 4}

Hnh 3.17. th Hub-Authority

Kt qu tnh c nh sau:

Hnh 3.18. Gi tr trng s cc Hub v Authority


Gi tr trng s ca Authority
Gi tr trng s ca Hub
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

85
KPDL Web l mt lnh vc nghin cu mi, c trin vng ln. Cc k
thut c p dng rng ri trn th gii nh KPDL vn bn trn Web, KPDL
khng gian v thi gian lin tc trn Web. Khai ph Web i vi h thng
thng mi in t, khai ph cu trc siu lin kt Web,... Cho ti nay k thut
KPDL vn phi ng u vi nhiu th thch ln trong vn KPDL Web.

3.4. p dng thut ton phn cm d liu trong tm kim v phn
cm ti liu Web
Ngy nay, nh s ci tin khng ngng ca cc Search engine v c chc
nng tm kim ln giao din ngi dng gip cho ngi s dng d dng
hn trong vic tm kim thng tin trn web. Tuy nhin, ngi s dng thng
vn phi duyt qua hng chc thm ch hng ngn trang Web mi c th tm
kim c th m h cn. Theo tm l chung, ngi dng ch xem qua vi chc
kt qu u tin, h thiu kin nhn v khng thi gian xem qua tt c kt
qu m cc search engine tr v. Nhm gii quyt vn ny, chng ta c th
nhm cc kt qu tm kim thnh thnh cc nhm theo cc ch , khi ngi
s dng c th b qua cc nhm m h khng quan tm tm n nhm ch
quan tm. iu ny s gip cho ngi dng thc hin cng vic ca h mt cch
hiu qu hn. Tuy nhin vn phn cm d liu trn Web v chn ch thch
hp n c th m t c ni dung ca cc trang l mt vn khng n
gin. Trong lun vn ny, ta s xem kha cnh s dng k thut phn cm
phn cm ti liu Web da trn kho d liu c tm kim v lu tr.
3.4.1. Hng tip cn bng k thut phn cm
Hin nay, xc nh mc quan trng ca mt trang web chng ta c
nhiu cch nh gi nh PageRank, HITS, Tuy nhin, cc phng php nh
gi ny ch yu u da vo cc lin kt trang xc nh trng s cho trang.
Ta c th tip cn cch nh gi mc quan trng theo mt hng khc
l da vo ni dung ca cc ti liu xc nh trng s, nu cc ti liu "gn
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

86
nhau" v ni dung th s c mc quan trng tng ng v s thuc v cng
mt nhm.
Gi s cho tp S gm cc trang web, hy tm trong tp S cc trang cha ni
dung cu hi truy vn ta c tp R. S dng thut ton phn cm d liu
phn tp R thnh k cm (k xc nh) sao cho cc phn t trong cm l tng t
nhau nht, cc phn t cc cm khc nhau th phi tng t vi nhau.
T tp S-R, chng ta a cc phn t ny vo mt trong k cm c
thit lp trn. Nhng phn t no tng t vi trng tm ca cm (theo mt
ngng xc nh no ) th a vo cm ny, nhng phn t khng tha mn
xem nh khng ph hp vi truy vn v loi b n khi tp kt qu. K tip,
chng ta nh trng s cho cc cm v cc trang trong tp kt qu theo thut
ton sau:
INPUT: tp d liu D cha cc trang gm k cm v k trng tm
OUTPUT: trng s ca cc trang
BEGIN
Mi cm d liu th m v trng tm C
m
ta gn mt trng s ts
m
.

Vi cc trng
tm C
i
, C
j
bt k ta lun c ts
i
>ts
j
nu t
i
tng t vi truy vn hn t
j
.
Vi mi trang p trong cm m ta xc nh trng s trang pw
m
. Vi mi pw
i
, pw
j
bt k, ta lun c pw
1
>pw
2
nu pw
1
gn trng tm hn pw
2
.
END
Hnh 3.19. Thut ton nh trng s cm v trang
Nh vy, theo cch tip cn ny ta s gii quyt c cc vn sau:
+ Kt qu tm kim s c phn thnh cc cm theo cc ch khc nhau,
ty vo yu cu c th ngi dng s xc nh ch m h cn.
+ Qu trnh tm kim v xc nh trng s cho cc trang ch yu tp trung
vo ni dung ca trang hn l da vo cc lin kt trang.
+ Gii quyt c vn t/cm t ng ngha trong cu truy vn ca
ngi dng.
+ C th kt hp phng php phn cm trong lnh vc khai ph d liu
vi cc phng php tm kim c.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

87
Hin ti, c mt s thut ton phn cm d liu c s dng trong phn
cm vn bn nh thut ton phn cm phn hoch (k-means, PAM, CLARA),
thut ton phn cm phn cp (BIRCH, STC),... Trong thc t phn cm theo
ni dung ti liu Web, mt ti liu c th thuc vo nhiu nhm ch khc
nhau. gii quyt vn ny ta c th s dng thut ton phn cm theo cch
tip cn m.
3.4.2. Qu trnh tm kim v phn cm ti liu
V c bn, qu trnh phn cm kt qu tm kim s din ra theo cc bc
c th hin nh sau [31]:
- Tm kim cc trang Web t cc Website tha mn ni dung truy vn.
- Trch rt thng tin m t t cc trang v lu tr n cng vi cc URL
tng ng.
- S dng k thut phn cm d liu phn cm t ng cc trang Web
thnh cc cm, sao cho cc trang trong cm tng t v ni dung vi nhau
hn cc trang ngoi cm.

Hnh 3.20. Cc bc phn cm kt qu tm kim trn Web

3.4.2.1. Tm kim d liu trn Web
Nhim v ch yu ca giai on ny l da vo tp t kha tm kim
tm kim v tr v tp gm ton vn ti liu, tiu , m t tm tt, URL,
tng ng vi cc trang .
D liu web
Tm kim v
trch rt d liu
Tin x l
Biu din
d liu
p dng thut
ton phn cm
Biu din
kt qu

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

88
Nhm nng cao tc x l, ta tin hnh tm kim v lu tr cc ti liu
ny trong kho d liu s dng cho qu trnh tm kim (tng t nh cc
Search Engine Yahoo, Google,). Mi phn t gm ton vn ti liu, tiu ,
on m t ni dung, URL,
3.4.2.2. Tin x l d liu
Qu trnh lm sch d liu v chuyn dch cc ti liu thnh cc dng biu
din d liu thch hp.
Giai on ny bao gm cc cng vic nh sau: Chun ha vn bn, xa b
cc t dng, kt hp cc t c cng t gc, s ha v biu din vn bn,..
3.4.2.2.1. Chun ha vn bn
y l giai on chuyn vn bn th v dng vn bn sao cho vic x l
sau ny c d dng, n gin, thut tin, chnh xc so vi vic x l trc tip
trn vn bn th m nh hng t n kt qu x l. Bao gm:
+ Xa cc th HTML v cc loi th khc trch ra cc t/cm t.
+ Chuyn cc k t hoa thnh cc k t thng.
+ Xa b cc du cu, xo cc k t trng d tha,...
3.4.2.2.2. Xa b cc t dng
Trong vn bn c nhng t mang t thng tin trong qu trnh x l, nhng
t c tn s xut hin thp, nhng t xut hin vi tn s ln nhng khng quan
trng cho qu trnh x l u c loi b. Theo mt s nghin cu gn y cho
thy vic loi b cc t dng c th gim bi c khong 20-30% tng s t
trong vn bn.
C rt nhiu t xut hin vi tn s ln nhng n khng hu ch cho qu
trnh phn cm d liu. V d trong ting Anh cc t nh a, an, the, of, and, to,
on, by,... trong ting Vit nh cc t th, m, l, v, hoc,... Nhng t
xut hin vi tn s qu ln cng s c loi b.
n gin trong ng dng thc t, ta c th t chc thnh mt danh sch
cc t dng, s dng nh lut Zipf xa b cc t c tn s xut hin thp
hoc qu cao.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

89
3.4.2.2.3. Kt hp cc t c cng gc
Hu ht trong cc ngn ng u c rt nhiu cc t c chung ngun gc vi
nhau, chng mang ngha tng t nhau, do gim bt s chiu trong biu
din vn bn, ta s kt hp cc t c cng gc thnh mt t. Theo mt s nghin
cu [5] vic kt hp ny s gim c khong 40-50% kch thc chiu trong
biu din vn bn.
V d trong ting Anh, t user, users, used, using c cng t gc v s c
quy v l use; t engineering, engineered, engineer c cng t gc s c quy
v l engineer.
V d x l t gc trong ting Anh:
- Nu mt t kt thc bng ing th xa ing, ngoi tr trng hp sau
khi xa cn li 1 k t hoc cn li th.
- Nu mt t kt thc bng ies nhng khng phi l eies hoc aies th
thay th ies bng y.....
- Nu mt t kt thc bng es th b s.
- Nu mt t kt thc bi "s" v ng trc n l mt ph m khc s th
xa s.
- Nu mt t kt thc bng ed, nu trc n l mt ph m th xa ed
ngoi tr sau khi xa t ch cn li mt k t, nu ng trc l nguyn m i
th i ied thnh y.
3.4.2.3. Xy dng t in
Vic xy dng t in l mt cng vic rt quan trng trong qu trnh
vector ha vn bn, t in s gm cc t/cm t ring bit trong ton b tp d
liu. T in s gm mt bng cc t, ch s ca n trong t in v c sp
xp theo th t.
Mt s bi bo xut [31] nng cao cht lng phn cm d liu cn
xem xt n vic x l cc cm t trong cc ng cnh khc nhau. Theo xut
ca Zemir [19][31] xy dng t in c 500 phn t l ph hp.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

90
3.4.2.4. Tch t, s ha vn bn v biu din ti liu
Tch t l cng vic ht sc quan trng trong biu din vn bn, qu trnh
tch t, vector ha ti liu l qu trnh tm kim cc t v thay th n bi ch s
ca t trong t in.
y ta c th s dng mt trong cc m hnh ton hc TF, IDF, TF-
IDF,... biu din vn bn.
Chng ta s dng mng W (trng s) hai chiu c kch thc m x n, vi n
l s cc ti liu, m l s cc thut ng trong t in (s chiu), hng th j l mt
vector biu din ti liu th j trong c s d liu, ct th i l thut ng th i
trong t in. W
ij
l gi tr trng s ca thut ng i i vi ti liu j.
Giai on ny thc hin thng k tn s thut ng t
i
xut hin trong ti liu
d
j
v s cc ti liu cha t
i
. T xy dng bng trng s ca ma trn W theo
cng thc sau:
Cng thc tnh trng s theo m hnh IF-IDF:

Trong :
tf
ij
l tn s xut hin ca t
i
trong ti liu d
j

idf
ij
l nghch o tn s xut hin ca t
i
trong ti liu d
j
.
h
i
l s cc ti liu m t
i
xut hin.
n l tng s ti liu.
3.4.2.5. Phn cm ti liu
Sau khi tm kim, trch rt d liu v tin x l v biu din vn bn
chng ta s dng k thut phn cm phn cm ti liu.
INPUT: Tp gm n ti liu v k cm.
OUTPUT: Cc cm C
i
(i=1,..,k) sao cho hm tiu chun t gi tr cc tiu.
BEGIN
Bc 1. Khi to ngu nhin k vector lm i tng trng tm ca k cm.
W
ij
=
) log( )] log( 1 [
i
ij ij ij
h
n
tf idf tf + = nu t
i
e d
j

0 nu ngc li (t
i
e d
j
)
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

91
Bc 2. Vi mi ti liu d
j
xc nh tng t ca n i vi trng tm ca mi
cm theo mt trong cc o tng t thng dng (nh Dice, Jaccard, Cosine,
Overlap, Euclidean, Manhattan). Xc nh trng tm tng t nht cho mi ti liu v
a ti liu vo cm .
Bc 3. Cp nhn li cc i tng trng tm. i vi mi cm ta xc nh li
trng tm bng cch xc nh trung bnh cng ca cc vector ti liu trong cm .
Bc 4. Lp li bc 2 v 3 cho n khi trong tm khng thay i.
END.
Hnh 3.21. Thut ton k-means trong phn cm ni dung ti liu Web
Vn xc nh trng tm ca cm ti liu: Xt mt cm vn bn c, trong
trng tm C ca cm c c tnh nh vo vector tng D (

e
=
c d
d D ) ca cc
vn bn trong cm c:
| | c
D
C =
Trong , |c| l s phn t thuc tp ti liu c.
Trong k thut phn cm, trng tm ca cc cm c s dng lm i
din cho cc cm ti liu.
Vn tnh ton tng t gia 2 cm ti liu: Gi s ta c 2 cm c
1
, c
2
,
khi tng t gia 2 cm ti liu c tnh bng mc gn nhau gia 2
vector trng tm C
1
, C
2
: Sim(c
1
,c
2
)= sim(C
1
,C
2
)
y, ta hiu rng c
1
v c
2
cng c th ch gm mt ti liu v khi c
th coi mt cm ch gm 1 phn t.
Trong thut ton k-means, cht lng phn cm c nh gi thng quan
hm tiu chun

=
e
=
k
i
x
i
Ci
m x E
D
1
2
) ( , trong x l cc vector biu din ti
liu, m
i
l cc trng tm ca cc cm, k l s cm, C
i
l cm th i.
- phc tp ca thut ton k-means l O((n.k.d).r).
Trong : n l s i tng d liu, k l s cm d liu, d l s chiu, r l
s vng lp.
Sau khi phn cm xong ti liu, tr v kt qu l cc cm d liu v cc
trng tm tng ng.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

92
3.4.6. Kt qu thc nghim
+ D liu thc nghim l cc trang Web ly t 2 ngun chnh sau:
- Cc trang c ly t ng t cc Website trn Internet, vic tm kim
c thc hin bng cch s dng Yahoo tm kim t ng, chng trnh s
da vo URL ly ton vn ca ti liu v lu tr li phc v cho qu trnh
tm kim sau ny (da liu gm hn 4000 bi v cc ch data mining, web
mining, Cluster algorithm, Sport).
- Tm kim c chn lc, phn ny c tin hnh ly th cng, ngun d
liu ch yu c ly t cc Web site:
http://www.baobongda.com.vn/
http://bongda.com.vn/
http://vietnamnet.vn
http://www.24h.com
Gm hn 250 bi bo ch bng .
- Vic xy dng t in, sau khi thng k tn s xut hin ca cc t trong
tp ti liu, ta p dng nh lut Zipf loi b nhng t c tn s xut hin qu
cao v loi b nhng t c tn s qu thp, ta thu c b t in gm 500 t.
S ti liu S cm
Thi gian trung bnh (giy)
Tin x l v biu
din vn bn
Phn cm
ti liu
50 10 0,206 0,957
50 15 0,206 1,156
100 10 0,353 2,518
100 15 0,353 3,709
150 10 0,515 4,553
150 15 0,515 5,834
250 10 0,824 9,756
250 15 0,824 13,375
Bng 3.2. Bng o thi gian thc hin thut ton phn cm
Ta thy rng thi gian thc hin thut ton ph vo ln d liu v s
cm cn phn cm. Ngoi ra, vi thut ton k-means cn ph thuc vo k trng
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

93
tm khi to ban u. Nu k trng tm c xc nh tt th cht lng v thi
gian thc hin c ci thin rt nhiu.
Phn giao din chng trnh v mt s on m code in hnh c trnh
by ph lc.
3.5. Tng kt chng 3
Chng ny tc gi trnh by mt s hng tip cn trong khai ph Web
nh khai ph d liu ton vn ca ti liu Web, khai ph cu trc Web, khai ph
s dng Web v mt s thut ton ang c p dng trong khai ph Web.
Phn ny cng trnh by mt s chc nng trong quy trnh ca h thng
thc nghim nh tm kim v trch chn d liu trn Web, tin x l d liu,
chun ho vn bn, xo b t dng, xy dng t in, tch t v biu din vn
bn, phn cm ti liu v nh gi kt qu thc nghim.
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

94

KT LUN V HNG PHT TRIN

Lun vn nu ln c nhng nt c bn v khai ph d liu, khm ph
tri thc v nhng vn lin quan, k thut phn cm d liu v i su vo mt
s phng php phn cm truyn thng, ph bin nh phn cm phn hoch,
phn cm phn cp, phn cm da trn mt , phn cm da trn li, phn
cm da trn m hnh v theo hng tip cn m.
Lun vn tp trung vo mt hng nghin cu v pht trin mi trong khai
ph d liu l khai ph Web, mt hng ang thu ht s quan tm ca nhiu
nh khoa hc. Phn ny trnh by nhng vn v cc hng tip trong khai
ph Web nh khai ph ti liu Web, khai ph cu trc Web v khai ph theo
hng s dng Web. Mt k thut trong khai ph Web l phn cm d liu
Web. Tc gi cng trnh by mt hng tip cn trong vic s dng cc k
thut phn cm trong khai ph d liu Web. xut v xy dng mt chng
trnh thc nghim phn cm ti liu Web p dng trong tm kim d liu vi
thut ton k-means da trn m hnh vector biu din vn bn TF-IDF.
Lnh vc khai ph Web l mt vn kh mi m, rt quan trng v kh,
bn cnh nhng kt qu nghin cu t c n t ra nhng thch thc
ln i vi cc nh nghin cu. Khai ph Web l mt lnh vc y trin vng,
phc tp v cn l vn m. Hin cha c mt thut ton v m hnh biu din
d liu ti u trong khai ph d liu Web.
Mc d c gng, n lc ht mnh song do thi gian nghin cu, trnh
ca bn thn c hn v iu kin nghin cu cn nhiu hn ch nn khng th
trnh khi nhng khuyt thiu v hn ch, tc gi rt mong nhn c nhng
gp , nhn xt qu bu ca qu thy c v bn b kt qu ca ti hon
thin hn.

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

95
Hng pht trin:
- Tip tc nghin cu, xut v ci tin mt s k thut mi trong phn
cm d liu nh phn cm m, cc thut ton phn cm song song,... nhm
nng cao hiu sut khai ph d liu trn h thng d liu ln, phn tn.
- Nghin cu cc m hnh biu din v x l vn bn mi nh m hnh m,
m hnh tp th,... nhm nng cao hiu qu x l v khai ph d liu c bit l
x l d liu trong mi trng Web..
- p dng cc k thut KPDL vo lnh vc thng mi in t, chnh ph
in t,...







Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

96

PH LC
Chng trnh c vit trn nn .NET Framework 2.0 v ngn ng lp
trnh Visual Basic 2005, c s d liu c lu tr v qun l bng SQL Server
2005. Sau y l mt s m lnh v giao din x l ca chng trnh.
Mt s modul x l trong cng trnh
1. Chun ho xu vn bn
Private Function Chuanhoa(ByVal S As String) As String
For i = 1 To 9
S = S.Replace(Str(i) + ".", Str(i))
S = S.Replace(Str(i) + ",", Str(i))
Next
i = 0
Do While i < S.Length - 1
If (Not Char.IsLetterOrDigit(S(i))) And (S(i) <> " ")
Then
S = S.Remove(i, 1)
S = S.Insert(i, " ")
Else
i = i + 1
End If
Loop
i = 0
Do While i < S.Length - 1
If ((Char.IsDigit(S(i))) And (Not Char.IsDigit(S(i +
1)))) Or ((Not Char.IsDigit(S(i))) And
(Char.IsDigit(S(i + 1)))) Then
S = S.Insert(i + 1, " ")
i = i + 1
End If
i = i + 1
Loop
S = S.ToLower(VN)
i = 0
Do While i < S.Length - 2
If S(i) + S(i + 1) = " " Then
S = S.Remove(i, 1)
Else
i = i + 1
End If
Loop
S = S.Trim
Return S
End Function

Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

97
2. Xo t dng
Private Function XoaTuDung(ByVal S As String) As String
For i = 0 To ListTD.Count - 1
S = S.Replace(" " + ListTD.Item(i) + " ", " ")
Next i
i = 0
Do While i < S.Length
Do While (i < S.Length - 1) And (S(i) = " ")
i = i + 1
Loop
j = i + 1
Kt = False
Do While (j < S.Length) And (Not Kt)
If S(j) <> " " Then
j = j + 1
Else
Kt = True
End If
Loop
If i = j - 1 Then
S = S.Remove(i, 1)
End If
i = j
Loop
S = S.Trim()
Return S
End Function

3. Xy dng t in
Private Sub XayDungTuDien(ByVal Doc As ArrayList)
For Each S In Doc
list = New ArrayList(S.Split(" "))
For Each ST In list
If Trim(ST) <> "" Then
i = TuDien.IndexOf(ST)
If (i < 0) Then
TuDien.Add(Trim(ST))
TuDienTS.Add(1)
Else
TuDienTS(i) = TuDienTS(i) + 1
End If
End If
Next
Next
'Sap xep theo giam dan cua tan so tu trong tap Van ban
If (TuDien(0) = " ") Or (TuDien(0) = "") Then
TuDien.RemoveAt(0)
End If
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

98
QuikSort(0, TuDien.Count - 1, TuDien, TuDienTS)
Do While (TuDien.Count>500) And (TuDienTS(0) > Int(NumDoc *
(MaxWork / 100)))
TuDien.RemoveAt(0)
TuDienTS.RemoveAt(0)
Loop
Do While (TuDien.Count > MaxWork)
TuDien.RemoveAt(MaxWork)
TuDienTS.RemoveAt(MaxWork)
Loop
End Sub

4. Vector ho vn bn
Private Sub VectorVB(ByVal Collect As ArrayList)
Vector = Array.CreateInstance(GetType(Byte), NumDoc,
NumWord)
i = 0
For Each S In Collect
List = New ArrayList(S.Split(" "))
For Each ST In List
If Trim(ST) <> "" Then
k = TuDien.IndexOf(ST)
If k > 0 Then
Vector(i, k) = Vector(i, k) + 1
End If
End If
Next
i = i + 1
Next
End Sub

5. Xy dng bng trng s
Private Sub XDTrongSo(ByVal Vector As Array)
Dim thongke(NumWord) As Integer
For i = 0 To NumWord - 1
thongke(i) = 0
For j = 0 To NumDoc - 1
If Vector(j, i) > 0 Then
thongke(i) = thongke(i) + 1
End If
Next
Next
W = Array.CreateInstance(GetType(Double), NumDoc, NumWord)
For i = 0 To NumDoc - 1
For j = 0 To NumWord - 1
If Vector(i, j) > 0 Then
W(i, j)=(1 + Math.Log(Vector(i,
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

99
j)))*(Math.Log(NumDoc)-Math.Log(thongke(j)))
Else
W(i, j) = 0.0
End If
Next
Next
End Sub

6. Thut ton k-means
Private Sub PhanCumKMean()
Randomize(NumDoc)
'Buoc 1: KHOI TAO CAC TRONG TAM
i = 0
Do While i < k
r = CInt(Int(NumDoc * Rnd()))
If Not rnum.Contains(r) Then
rnum.Add(r)
For j = 0 To NumWord - 1
C(i, j) = W(r, j)
Next
i = i + 1
End If
Loop
For i = 0 To NumDoc - 1
Cum(i) = 0
Next
check1 = True
Do While check1
'Buoc 2:Tinh toan khoang cach va xac dinh cum cho cac pt.
For i = 0 To NumDoc - 1
min = Double.MaxValue
Cum(i) = 0
For j = 0 To k - 1
dis = 0
For m = 0 To NumWord - 1
temp = W(i, m) - C(j, m)
dis = dis + Math.Abs(temp * temp)
Next
dis = Math.Sqrt(dis)
If dis < min Then
min = dis
Cum(i) = j
End If
Next
Next
'Buoc 3: Cap nhat lai Trong tam
check1 = False
For i = 0 To k - 1 'Cap nhat lan luot Trong tam tung cum
For j = 0 To NumWord - 1
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

100
n = 0
sum = 0
For m = 0 To NumDoc - 1
If Cum(m) = i Then
sum = sum + W(m, j)
n = n + 1
End If
Next
sum = sum / n
If C(i, j) <> sumThen
C(i, j) = sum
check1 = True
End If
Next
Next
Loop
End Sub

Mt s giao din chng trnh
1. Cng c tm kim t ng ti liu trn Internet v lu tr vo CSDL



2. Cng c tm kim chn lc ti liu trn Internet v lu tr vo CSDL


Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

101
3. Trch chn d liu, tin x l, xy dng t in v vector ha vn bn



4. Phn cm ti liu v biu din kt qu



Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

102

TI LIU THAM KHO
Ti liu ting Vit
[1] Cao Chnh Ngha, Mt s vn v phn cm d liu, Lun vn thc s,
Trng i hc Cng ngh, H Quc gia H Ni, 2006.
[2] Hong Hi Xanh, V cc k thut phn cm d liu trong data mining,
lun vn thc s, Trng H Quc Gia H Ni, 2005
[3] Hong Th Mai, Khai ph d liu bng phng php phn cm d liu,
Lun vn thc s, Trng HSP H Ni, 2006.
Ti liu ting Anh
[4] Athena Vakali, Web data clustering Current research status & trends,
Aristotle University,Greece, 2004.
[5] Bing Liu, Web mining, Springer, 2007.
[6] Brij M. Masand, Myra Spiliopoulou, Jaideep Srivastava, Osmar R. Zaiane,
Web Mining for Usage Patterns & Profiles, ACM, 2002.
[7] Filippo Geraci, Marco Pellegrini, Paolo Pisati, and Fabrizio Sebastiani, A
scalable algorithm for high-quality clustering of Web Snippets, Italy, ACM, 2006.
[8] Giordano Adami, Paolo Avesani, Diego Sona, Clustering Documents in a
Web Directory, ACM, 2003.
[9] Hiroyuki Kawano, Applications of Web mining- from Web search engine to
P2P filtering, IEEE, 2004.
[10] Ho Tu Bao, Knowledge Discovery and Data Mining, 2000.
[11] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma,
Learning to Cluster Web Search Results, ACM, 2004.
[12] Jitian Xiao, Yanchun Zhang, Xiaohua Jia, Tianzhu Li, Measuring
Similarity of Interests for Clustering Web-Users, IEEE, 2001.
[13] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques,
University of Illinois at Urbana-Champaign, 1999.
[14] Khoo Khyou Bun, Topic Trend Detection and Mining in World Wide
Web, A thesis for the degree of PhD, Japan, 2004.
[15] LIU Jian-guo, HUANG Zheng-hong , WU Wei-ping, Web Mining for
Electronic Business Application, IEEE, 2003.
[16] Lizhen Liu, Junjie Chen, Hantao Song, The research of Web Mining, IEEE, 2002
Khai ph d liu Web bng k thut phn cm
Hong Vn Dng

103
[17] Maria Rigou, Spiros Sirmakessis, and Giannis Tzimas, A Method for
Personalized Clustering in Data Intensive Web Applications, 2006.
[18] Miguel Gomes da Costa Jnior, Zhiguo Gong, Web Structure Mining: An
Introduction, IEEE, 2005.
[19] Oren Zamir and Oren Etzioni, Web document Clustering: A Feasibility
Demonstration, University of Washington, USA, ACM, 1998.
[20] Pawan Lingras, Rough Set Clustering for Web mining, IEEE, 2002.
[21] Periklis Andritsos, Data Clusting Techniques, University Toronto,2002.
[22] R. Cooley, B. Mobasher, and J. Srivastava, Web mining: Information and
Pattern Discovery on the World Wide Web, University of Minnesota, USA, 1998.
[23] Raghu Krishnapuram, Anupam Joshi, and Liyu Yi, A Fuzzy Relative of the
K -Medoids Algorithm with Application toWeb Document and Snippet
Clustering, 2001
[24] Raghu Krishnapuram,Anupam Joshi, Olfa Nasraoui, and Liyu Yi, Low-
Complexity Fuzzy Relational Clustering Algorithms for Web Mining, IEEE, 2001.
[25] Raymond and Hendrik, Web Mining Research: A Survey, ACM, 2000
[26] Rui Wu, Wansheng Tang,Ruiqing Zhao, An Efficient Algorithm for Fuzzy
Web-Mining, IEEE, 2004.
[27] T.A.Runkler, J.C.Bezdek, Web mining with relational clustering,
ELSEVIER, 2002.
[28] Tsau Young Lin, I-Jen Chiang , A simplicial complex, a hypergraph,
structure in the latent semantic space of document clustering, ELSEVIER, 2005.
[29] Wang Jicheng, Huang Yuan, Wu Gangshan, and Zhang Fuyan, Web
Mining: Knowledge Discovery on the Web, IEEE, 1999.
[30] WangBin, LiuZhijing, Web Mining Research, IEEE, 2003.
[31] Wenyi Ni, A Survey of Web Document Clustering, Southern Methodist
University, 2004.
[32] Yitong Wang, Masaru Kitsuregawa, Evaluating Contents-Link Coupled
Web Page Clustering for Web Search Results, ACM, 2002.
[33] Zifeng Cui, Baowen Xu , Weifeng Zhang, Junling Xu, Web Documents
Clustering with Interest Links, IEEE, 2005.

You might also like