Professional Documents
Culture Documents
BO CO BI TP LN
X L NGN NG T NHIN
ti: Phn lp vn bn - phn loi website
20081124
Nguyn Hu Hnh
20080903
Nguyn c Yn
20083244
Don nh Vit
20083124
Mc lc
I. Tng quan......................................................................................................3
1. Bi ton phn lp vn bn ................................................................................3
2. ng dng ..........................................................................................................3
I.
Tng quan
1. Bi ton phn lp vn bn
Phn lp vn bn c coi l qu trnh phn loi mt vn bn bt k vo mt
hay nhiu lp cho trc. Qu trnh ny gm hai bc. bc th nht, mt m
hnh phn lp (classfication model) c xy dng da trn tri thc kinh nghim.
y, tri thc kinh nghim chnh l mt tp d liu hun luyn (training dataset)
c cung cp bi con ngi bao gm mt tp vn bn v phn lp tng ng ca
chng. Bc ny cn gi l bc xy dng hun luyn (training process) hay c
lng m hnh phn lp. bc th hai, m hnh phn lp xy dng bc u
s c s dng phn lp cho nhng vn bn (cha c phn loi) trong tng
lai. Bc u tin c xem nh l vic hc c gim st m chng ta c th s
dng rt nhiu cc k thut hc my c nh: Nave Bayes, k lng ging gn nht
(kNN), cy quyt nh (Decision Tree), Mc tiu ca bi ton phn lp l nhm
xy dng m hnh c kh nng gn nhn cho mt vn bn bt k vi chnh xc
cao nht c th.
2. ng dng
ng dng ln nht ca bi ton phn lp vn bn l p dng vo bi ton phn
loi hay lc ni dung. Trong bi ton lc ni dung: mt vn bn c phn loi
vo nhm: c ch hoc khng c ch. Sau ly tt c nhng vn bn thuc nhm
c ch, nhm cn li b loi b. Cc ng dng c th nh: lc th rc, lc trang
web phn ng, Mt ng dng khc ca bi ton phn lp l xy dng b phn
lp sau tm kim, ng dng ny rt hu ch v n nh v ni dung thng tin cn
tm kim nhanh v d dng hn.
Tm li, vi tt c ngha thc t trn, mt ln na c th khng nh rng
trong thi i Internet c coi l mt phn khng th thiu trong cuc sng, phn
lp vn bn lun l vn ng c quan tm c th pht trin v xy dng
c nhng cng c ngy cng hu dng hn. Da trn nhu cu cp thit ,
nh l Bayes
Trong :
-
V d ta c tp d liu nh sau:
1.2.
Trong phng php phn loi Nave Bayes, cc thuc tnh l c lp c iu kin
i vi cc lp. Vy:
1.3.
C:
C, tnh gi tr likehood:
o Xc nh phn lp ca z l c th nht:
1.4.
o i vi mi phn lp ci:
Tnh gi tr xc sut trc ca phn lp ci :
Trong :
- Tp d liu hun luyn D_train: trong khun kh BTL, do thi gian c hn
chng em chn tp d liu hun luyn D_train l phn ni dung ca cc bi vit
trn trang vnexpress.net (b qua bc x l ly phn ni dung ny t 1 trang
web) v gn nhn (lp) cho chng.
- Tch t: trong bc ny, chng em c include chng trnh vnTagger ca tc
gi L Hng Phng HQG H Ni vo trong chng trnh ca mnh x
l tch t trong cc vn bn thuc tp d liu hun luyn D_train.
- Loi b Stop-Word: bn cht ca cc ngn ng t nhin l lun c cc t xut
hin nhiu nhng khng mang ngha phn loi. Cc t ny c gi l
stop-word. Chng em s tin hnh loi b nhng t ny t tp nhng t tch
c bc trn xy dng 1 tp cc t kha. Danh sch cc stop-word c
thng k trong bng sau:
Cn
hay
hoc
Khng
khng nhng
khng ch
cn
nu
th
Nn
tuy
Nhng
b li
Gi
bi
ti
do
Song
du
mc du
du
du cho
chng l
lm nh
th m
by m
c iu
hn na
hung h
hung g
hung na
Ngay
cng
chnh
Tn
Nhn
M t
Kinh t
nss
Gio dc
edu
Vn ha, gii tr
ent
Sc khe
hel
10
Chnh tr, x hi
plt
Khoa hc
sci
Th thao
spt
Cng nh
tec
Bng cc lp tin tc
11
12
13
14
Kt lun
chnh xc ca chng trnh ph thuc nhiu vo s lng ca tp d liu
hun luyn D_train v chng trnh tch t. Do thi gian c hn nn chng em ch
mi th nghim chng trnh trn 1 tp D_train nh v thc hin lun i vi phn
ni dung chnh ca 1 trang web m b qua bc x l ly phn ni dung t 1
a ch trang web.
Trong thi gian ti, chng em s c gng pht trin hon thin chng
trnh ca mnh hn. Chng em rt mong nhn c s gp ca c.
Em cm n!
[2]
[3]
[4]
Website: http://vi.wikipedia.org
V mt s trang web tham kho khc.
15