Professional Documents
Culture Documents
Ging vin:
Sinh vin:
Trn Tun Anh
inh Th Thanh Hng
Nguyn Trng Th
M U
S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l
cc d liu dng siu vn bn (d liu Web). Cng vi s thay i v pht trin hng ngy,
hng gi v ni dung cng nh s lng cc trang Web trn Internet th vn tm kim thng
tin i vi ngi s dng li cng kh khn. C th ni nhu cu tm kim thng tin trn mt
CSDL phi cu trc c pht trin ch yu cng vi s pht trin ca Internet. Thc vy,
vi Internet con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong
nhng nm gn y Internet tr thnh mt trong nhng kn v khoa hc, thng tin kinh t,
thng mi v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu
tn khi cng khai mt tran Web trn Internet. So snh vi nhng dch v khc nh mua bn
hay qung co trn mt t bo hay tp ch, th mt trang Web i r hn rt nhiu v cp
nht nhanh chng hn ti hng triu ngi dung khp mi ni trn th gii. C th ni trang
Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni
dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc thng tin v
mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh
Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti
thng tin. Ngi ta khng th t tm kim a ch trang Web cha thng tin m mnh cn, do
vy i hi phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy
cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny
qun l d liu nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc
tin ch nh vy, l: yahoo, google, alvista
Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t - X hi
v xy dng Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v,
sau khi phn lp chng ta s bit khch hng hay tp trung vo ni dung g trn trang Web ca
chng ta, t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng
quan tm v ngc li. Cn v pha khch hng sau khi phn tch chng ta cng bit c
khch hng hay tp trung v vn g, t c th a ra nhng h tr thm cho khch hng
. T nhng nhu cu thc t trn , phn lp v tm kim trang Web vn l bi ton hay v cn
pht trin nghin cu hin nay.
[D08 HTTT1]
Page 2
[D08 HTTT1]
Page 3
Gii thiu
Khai thc d liu
Khi nim
Khi ph d liu c nh ngha l: qu trnh trch xut cc thng tin c gi tr tim n
bn trong lng ln d liu c lu tr trong cc c s d liu, kho d liu Hin nay,
ngoi thut ng khai ph d liu, ngi ta cn dng mt s thut ng khc c ngha tng t
nh: khai ph tri thc t c s d liu (knowlegde mining from databases), trch lc d liu
(knowlegde extraction), phn tch d liu/mu (data/patten analysis), kho c d liu (data
archaeology), no vt d liu (data dredging). Nhiu ngi coi khai ph d liu v mt thut
ng thng dng khc l khai ph tri thc trong c s d liu (Knowlegde Discovery in
Databases KDD) l nh nhau. Tuy nhin trn thc t, khai ph d liu ch l mt bc thit
yu trong qu trnh khm ph tri thc trong c s d liu. Qu trnh ny bao gm cc bc sau:
Bc 1: Lm sch d liu (data cleaning): loi b nhiu hoc cc d liu khng thch
hp.
Bc 2: Tch hp d liu (data intergration): tch hp d liu t cc ngun khc nhau
nh: c s d liu, kho d liu, file text
Bc 3: Chn d liu (data selection): bc ny, nhng d liu lin quan trc tip
n nhim v s c thu thp t cc ngun d liu ban u.
Bc 4: Chuyn i d liu (data transformation): trong bc ny, d liu s c
chuyn i v dng ph hp cho vic khai ph bng cch thc hin cc thao tc nhm hoc
tp hp.
Bc 5: Khai ph d liu (data mining): l giai on thit yu, trong cc phng
php thng minh s c p dng trch xut ra cc mu d liu.
Bc 6: nh gi mu (pattern evaluation): nh gi s hu ch ca cc mu biu
din tri thc da vo mt s php o.
Bc 7: Trnh din d liu (knowlegde presentation): s dng cc k thut trnh
din v trc quan ho d liu biu din tri thc khai ph c cho ngi s dng
[D08 HTTT1]
Page 4
Khai ph d liu v pht hin tri thc trong cc c s d liu cun ht cc phng
php, thut ton v k thut t nhiu chuyn ngnh nghin cu khc nhau nh hc my, thu
nhn mu, c s d liu, thng k, tr tu nhn to, thu nhn tri thc trong h chuyn gia
cng hng ti mc tiu thng nht l trch lc ra c cc tri thc t d liu trong cc c s
d liu khng l. Song so vi cc phng php khc, khai ph d liu c mt s u th r rt
u th khai ph d liu
Khai ph d liu c nhiu ng dng v mt s u th r rt c xem xt di y:
+ So vi phng php hc my, khai ph d liu c li th hn ch, khai ph d liu
c th s dng vi cc c s d liu cha nhiu nhiu, d liu khng y hoc bin i lin
tc. Trong khi phng php hc my ch yu c p dng trong cc c s d liu y ,
t bin ng v tp d liu khng qu ln.
+ Phng php h chuyn gia: phng php ny khc vi khai ph d liu ch cc v
d ca chuyn gia thng mc cht lng cao hn nhiu so vi cc d liu trong c s d
liu, v chng thng ch bao hm c cc trng hp quan trng. Hn na cc chuyn gia
s xc nhn gi tr v tnh hu ch ca cc mu pht hin c.
+ Phng php thng k l mt trong nhng nn tng l thuyt ca Khai ph d liu,
nhng khi so snh hai phng php vi nhau ta c th thy cc phng php thng k cn tn
ti mt s im yu m Khai ph d liu khc phc c:
Cc phng php thng k chun khng ph hp vi cc kiu d liu c cu
[D08 HTTT1]
Page 5
[D08 HTTT1]
Page 6
Cc chc nng chnh ca Weka Explorer th hin trong cc th tab ca man hnh chnh,
bao gm:
[D08 HTTT1]
Page 7
S dng th Preprocess
S dng th Classify
Use training set: s dng chnh tp d liu hun luyn kim nghim
Cross-validation: Chia d liu thnh nhiu phn (Flods) thc hin nhiu ln
nh gi kt qu.
(3) Result list: Danh sch kt qu cc ln chy thut ton, c th tng tc trn danh
sch ny thc hin mt chc nng ph
[D08 HTTT1]
Page 8
Lp 1
D liu u
vo
Lp 2
Lp n
Page 9
Bc th nht (learning)
Classification
algorithm
Training data
Classifier (modle)
P_i
P_t
63.02
68.82
22.52
40.47
98.67 -0.254
If D_s39.60
<= 19.85 and
P_r <=125.21
and S_s <=40.47 and P_t >9.97
10.06 25.01
28.99 114.4 4.5642
Then class = Abnormal
22.21 50.09
46.61 105.98 -3.53
39.05
L_l_a
S_s
P_r
D_s
Bc th hai (classification)
[D08 HTTT1]
Page 10
[D08 HTTT1]
Page 11
Page 12
[D08 HTTT1]
Page 13
Tn thuc tnh
Kiu d liu
Cc gi tr ca thuc tnh
Pelvic_incidence
Numeric
Pelvic_tilt
Numeric
Lumbar_lordosis_angle
Numeric
14 > 125.742
Sacral_slope
Numeric
Pelvic_radius
Numeric
Degree_spondylolisthesis
Numeric
Class
Nominal
Normal, Abnormal
[D08 HTTT1]
Page 14
@relation column_2C_weka
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
pelvic_incidence numeric
pelvic_tilt numeric
lumbar_lordosis_angle numeric
sacral_slope numeric
pelvic_radius numeric
degree_spondylolisthesis numeric
Phn khai bo
Phn d liu
Phn khai bo:
@relation <tn d liu>
@attribute <tn thuc tnh 1><Kiu d liu>
@attribute <tn thuc tnh 2><Kiu d liu>
D liu chui
V d: @ATTRIBUTE name string
D liu kiu ngy V d: @ATTRIBUTE discovered
Page 15
Pelvic_incidence = Pi
Pelvic_tilt = Pt
Lumbar_lordosis_angle = lla
Sacral_slope = Ss
dc xng cng
[D08 HTTT1]
Page 16
Pelvic_radius = Pr
Degree_spondylolisthesis = Ps
Mc spondylolisthesis
Phn tch kt qu
S dng thut ton J48 (C4.5) ca Weka cung cp hun luyn tp d liu
Cy quyt nh ca thut ton l:
Page 17
Cross-validation
Ln test th nht : vi t l phn chia thnh 10 phn
S mu
T l
Phn lp ng
253
81.6129%
Phn lp sai
57
18.3871%
Khng phn c lp
Tng
310
S mu
T l
Phn lp ng
255
82.2581%
Phn lp sai
55
17.7419%
Khng phn c lp
Tng
310
S mu
T l
Phn lp ng
254
81.9355%
Phn lp sai
56
18.0645%
Khng phn c lp
Tng
310
[D08 HTTT1]
Page 18
S mu
T l
Phn lp ng
255
82.2581%
Phn lp sai
55
17.7419%
Khng phn c lp
Tng
310
S mu
T l
Phn lp ng
260
83.871%
Phn lp sai
50
16.129%
Khng phn c lp
Tng
310
Sau khi chy thut ton trn theo phng php Cross-Validation th vi tham s Fold =
15 t c hiu qu phn lp nht l 83.871% vi s mu test l 310
Precentage split: cho bit chia l bao nhiu % th t hiu qu phn lp cao nht:
Ln test th nht: vi t l phn chia l 66% th ta c:
S mu
T l
Phn lp ng
90
85.7143%
Phn lp sai
15
14.2857%
Khng phn c lp
Tng
105
[D08 HTTT1]
Page 19
S mu
T l
Phn lp ng
97
78.2258%
Phn lp sai
27
21.7742%
Khng phn c lp
Tng
124
S mu
T l
Phn lp ng
117
84.1727%
Phn lp sai
22
15.8273%
Khng phn c lp
Tng
139
S mu
T l
Phn lp ng
76
81.7204%
Phn lp sai
17
18.2796%
Khng phn c lp
Tng
93
[D08 HTTT1]
Page 20
S mu
T l
Phn lp ng
65
84.4156%
Phn lp sai
12
15.5844%
Khng phn c lp
Tng
77
Sau khi chy thut ton trn vi phng php Precentage split vi t l phn chia l
66% t hiu qu phn lp cao nht 85.7143%, nhng vi s mu phn lp 105 gim so vi
310 nn cha t hiu qu phn lp
Cc suy lun suy ra t cy quyt nh s dng phng php Cross-Validation:
[D08 HTTT1]
Page 21
Classifier model: chi tit m hnh phn loi, tuy nhin i vi mt s b phn
loi th m hnh phn loi khng th hin y thng tin bng vn bn c
[D08 HTTT1]
Page 22
Ma trn nhm ln l ma trn 2x2. S lng cc trng hp phn loi chnh l tng ca ng
cho chnh trong ma trn aa + bb.
[D08 HTTT1]
Page 23
[D08 HTTT1]
Page 24