You are on page 1of 15

I HC BCH KHOA H NI

Vin Cng Ngh Thng Tin v Truyn Thng

BO CO BI TP LN
X L NGN NG T NHIN
ti: Phn lp vn bn - phn loi website

Nhm sinh vin thc hin :


inh Quang Huy

20081124

Nguyn Hu Hnh

20080903

Nguyn c Yn

20083244

Don nh Vit

20083124

Gio vin hng dn :


TS. L Thanh Hng

H Ni, Thng 4 nm 2012


1

Mc lc

I. Tng quan......................................................................................................3
1. Bi ton phn lp vn bn ................................................................................3
2. ng dng ..........................................................................................................3

II. Phng php gii quyt bi ton .....................................................4


1. Phng php Nave Bayes ...............................................................................4
1.1. nh l Bayes .............................................................................................4
1.2. Phn loi Nave Bayes ...............................................................................6
1.3. Phn loi Nave Bayes Gii thut ...........................................................7
1.4. Phn loi vn bn bng phng php Nave Bayes ..................................7
2. p dng vo bi ton phn lp trang web........................................................9

III. Chng trnh Demo............................................................................12


1. Giao din chng trnh ...................................................................................12
2. Cu trc m ngun cc lp chnh ca chng trnh .......................................14

IV. Kt lun .....................................................................................................15


V. Ti liu tham kho ..................................................................................15

I.

Tng quan

1. Bi ton phn lp vn bn
Phn lp vn bn c coi l qu trnh phn loi mt vn bn bt k vo mt
hay nhiu lp cho trc. Qu trnh ny gm hai bc. bc th nht, mt m
hnh phn lp (classfication model) c xy dng da trn tri thc kinh nghim.
y, tri thc kinh nghim chnh l mt tp d liu hun luyn (training dataset)
c cung cp bi con ngi bao gm mt tp vn bn v phn lp tng ng ca
chng. Bc ny cn gi l bc xy dng hun luyn (training process) hay c
lng m hnh phn lp. bc th hai, m hnh phn lp xy dng bc u
s c s dng phn lp cho nhng vn bn (cha c phn loi) trong tng
lai. Bc u tin c xem nh l vic hc c gim st m chng ta c th s
dng rt nhiu cc k thut hc my c nh: Nave Bayes, k lng ging gn nht
(kNN), cy quyt nh (Decision Tree), Mc tiu ca bi ton phn lp l nhm
xy dng m hnh c kh nng gn nhn cho mt vn bn bt k vi chnh xc
cao nht c th.
2. ng dng
ng dng ln nht ca bi ton phn lp vn bn l p dng vo bi ton phn
loi hay lc ni dung. Trong bi ton lc ni dung: mt vn bn c phn loi
vo nhm: c ch hoc khng c ch. Sau ly tt c nhng vn bn thuc nhm
c ch, nhm cn li b loi b. Cc ng dng c th nh: lc th rc, lc trang
web phn ng, Mt ng dng khc ca bi ton phn lp l xy dng b phn
lp sau tm kim, ng dng ny rt hu ch v n nh v ni dung thng tin cn
tm kim nhanh v d dng hn.
Tm li, vi tt c ngha thc t trn, mt ln na c th khng nh rng
trong thi i Internet c coi l mt phn khng th thiu trong cuc sng, phn
lp vn bn lun l vn ng c quan tm c th pht trin v xy dng
c nhng cng c ngy cng hu dng hn. Da trn nhu cu cp thit ,

chng em chn ti Phn lp vn bn phn loi trang web c th nghin


cu v pht trin ng dng ny.

II. Phng php gii quyt bi ton


Nh cp phn I, hin nay c kh nhiu k thut hc my c p
dng gii quyt bi ton phn lp, in hnh l cc phng php Nave Bayes,
cy quyt nh, Entropy cc i, Trong cc phng php , chng em chn
phng php phn lp Nave Bayes.
1. Phng php Nave Bayes
- L phng php hc phn lp c gim st
- Da trn m hnh (hm) xc sut
- Vic phn loi da trn cc gi tr xc sut ca cc kh nng xy ra ca cc gi
thit
- L mt phng php thng c s dng trong cc bi ton phn lp vn bn
- Da trn nh l Bayes
1.1.

nh l Bayes

Trong :
-

h: gi thit (phn lp).


D: tp d liu.
P(h): xc sut trc (prior probability) xy ra gi thit (phn lp) h l ng.
P(D): xc sut trc ca tp d liu D quan st c.
P(D|h): xc sut ca vic quan st c (thu c) tp d liu D vi iu kin
gi thit h l ng.
- P(h|D): xc sut gi thit h l ng vi iu kin tp d liu D c quan st.
4

V d ta c tp d liu nh sau:

Trong v d trn, gi s rng:


Tp d liu D: l tp cc ngy m thuc tnh Outlook c gi tr Sunny v thuc
tnh Wind c gi tr Strong.
Gi thit (phn lp) h: mt ngi c chi tennis.
Xc sut trc P(h): xc sut 1 ngi chi tennis (khng ph thuc vo cc
thuc tnh Outlook v Wind).
Xc sut trc P(D): xc sut ca mt ngy c thuc tnh Outlook nhn gi tr
Sunny v thuc tnh Wind nhn gi tr Strong.
P(D|h): xc sut ca mt ngy c thuc tnh Outlook nhn gi tr Sunny v
thuc tnh Wind nhn gi tr Strong vi iu kin mt ngi c chi tennis.
P(h|D): xc sut mt ngi chi tennis vi iu kin thuc tnh Outlook nhn
gi tr Sunny v thuc tnh Wind nhn gi tr Strong.

Vy phng php phn lp Nave Bayes da trn xc sut c iu kin ny.


Vi mt tp cc gi thit (cc phn lp) h c th, h thng hc s tm gi thit c
th xy ra nht i vi d liu quan st c D. Gi thit h tm c ny gi l gi
thit cc i ha xc sut c iu kin (maximum a posteriori - MAP):
5

1.2.

Phn loi Nave Bayes

- Biu din bi ton phn loi:


o Mt tp hc D_train, trong mi v d hc x c biu din l mt vect n
chiu: (x1, x2,, xn)
o Mt tp xc nh cc nhn lp: C = {c1, c2,, cm}
o Vi mt v d mi z th cn xc nh xem z s c phn vo lp no?
- Mc tiu: xc nh phn lp ph hp nht vi z.

V xc sut P(z1, z2,, zn) l nh nhau i vi cc lp nn ta cn tm:

Trong phng php phn loi Nave Bayes, cc thuc tnh l c lp c iu kin
i vi cc lp. Vy:

V phn loi Nave Bayes tm phn lp c th nht i vi v d mi z l:

1.3.

Phn loi Nave Bayes Gii thut

- Giai on hc: ta s dng 1 tp hc. i vi mi phn lp c th

C:

o Tnh xc sut trc P(ci)


o i vi mi gi tr thuc tnh xj, tnh xc sut xy ra ca gi tr thuc tnh
vi mt phn lp ci: P(xj | ci)
- Giai on phn lp, i vi mi 1 v d mi:
o i vi mi phn lp

C, tnh gi tr likehood:

o Xc nh phn lp ca z l c th nht:

1.4.

Phn loi vn bn bng phng php Nave Bayes

- Biu din bi ton phn loi vn bn:


o Tp hc D_train, trong mi v d hc l mt biu din vn bn gn vi
mt nhn lp: D = {(dk, ci)}
o Mt tp cc nhn lp xc nh: C = {ci}
- Giai on hc:
o T tp cc vn bn trong D_train, ta trch ra tp cc t kha T = {tj}.
o Gi D_ci (

D_train) l tp cc vn bn trong D_train c nhn lp l ci .

o i vi mi phn lp ci:
Tnh gi tr xc sut trc ca phn lp ci :

i vi mi t kha tj, tnh xc sut t kha tj xut hin i vi lp ci


theo cng thc :
7

Trong : n(dk, tj) l s ln xut hin ca t kha tj trong vn bn dk


- Giai on phn lp cho 1 vn bn mi d:
o T vn bn d, trch ra tp T_d gm cc t kha c nh ngha trong tp T
(T_d T)
o Gi s rng, xc sut xut hin ca t kha tj i vi lp ci l c lp i vi
v tr ca t kha trong vn bn.

o i vi mi phn lp ci, ta tnh gi tr likehood ca vn bn d i vi lp ci

o Vn bn d s c phn vo lp c* c gi tr likehood ln nht:

2. p dng vo bi ton phn lp trang web


M hnh gii quyt bi ton ca chng em nh sau:

M hnh quy trnh gii quyt bi ton

Trong :
- Tp d liu hun luyn D_train: trong khun kh BTL, do thi gian c hn
chng em chn tp d liu hun luyn D_train l phn ni dung ca cc bi vit
trn trang vnexpress.net (b qua bc x l ly phn ni dung ny t 1 trang
web) v gn nhn (lp) cho chng.
- Tch t: trong bc ny, chng em c include chng trnh vnTagger ca tc
gi L Hng Phng HQG H Ni vo trong chng trnh ca mnh x
l tch t trong cc vn bn thuc tp d liu hun luyn D_train.
- Loi b Stop-Word: bn cht ca cc ngn ng t nhin l lun c cc t xut
hin nhiu nhng khng mang ngha phn loi. Cc t ny c gi l
stop-word. Chng em s tin hnh loi b nhng t ny t tp nhng t tch
c bc trn xy dng 1 tp cc t kha. Danh sch cc stop-word c
thng k trong bng sau:

Cn

hay

hoc

Khng

khng nhng

khng ch

cn

nu

th

Nn

tuy

Nhng

b li

Gi

bi

ti

do

Song

du

mc du

du

du cho

chng l

lm nh

th m

by m

c iu

hn na

hung h

hung g

hung na

Ngay

cng

chnh

Bng danh sch cc stop-word

- Tp cc t kha: l tp cc t c tch sau khi loi b stop-word.


- a vo CSDL: l bc a cc t kha trn vo CSDL.
5 bc trn l 5 bc tin x l c thc hin trc. Mi khi chy chng
trnh th chng trnh s khng phi thc hin li cc bc na. Sau khi c tp
d liu hun luyn D_train v tp cc t kha T, chng ta tin hnh p dng gii
thut phn lp vn bn bng phng php Nave Bayes i vi 1 vn bn mi u
vo xc nh lp cho vn bn v a ra kt lun.
Sau khi quan st v nghin cu cc trang tin, chng em a ra danh sch phn
lp tin tc nh sau:
STT

Tn

Nhn

M t

Kinh t

nss

Cc ni dung lin quan n th trng,


kinh doanh,

Gio dc

edu

Cc ni dung lin quan n gio dc

Vn ha, gii tr

ent

Cc ni dung lin quan n ngh thut, m


nhc, in nh.

Sc khe

hel

Cc ni dung lin quan n sc khe.

10

Chnh tr, x hi

plt

Cc ni dung lin quan n tnh hnh chnh


tr, x hi,

Khoa hc

sci

Cc ni dung lin quan n khoa hc.

Th thao

spt

Cc ni dung lin quan n th thao.

Cng nh

tec

Cc ni dung lin quan n cng ngh.

Bng cc lp tin tc

11

III. Chng trnh Demo


Chng em xy dng chng trnh phn loi website da trn cng ngh
web-based. Sau y l giao din v cu trc m ngun chnh ca chng trnh:
1. Giao din chng trnh

Giao din trang ch

12

Giao din hin th kt qu

13

2. Cu trc m ngun cc lp chnh ca chng trnh

14

Kt lun
chnh xc ca chng trnh ph thuc nhiu vo s lng ca tp d liu
hun luyn D_train v chng trnh tch t. Do thi gian c hn nn chng em ch
mi th nghim chng trnh trn 1 tp D_train nh v thc hin lun i vi phn
ni dung chnh ca 1 trang web m b qua bc x l ly phn ni dung t 1
a ch trang web.
Trong thi gian ti, chng em s c gng pht trin hon thin chng
trnh ca mnh hn. Chng em rt mong nhn c s gp ca c.
Em cm n!

IV. Ti liu tham kho


[1]

Bi ging mn Tr tu nhn to TS. Nguyn Nht Quang, Vin CNTT&TT


HBK H Ni.

[2]

Bi gin mn X l ngn ng t nhin TS. L Thanh Hng, Vin


CNTT&TT HBK H Ni.

[3]

Chng trnh vnTagger version 4.0 tc gi L Hng Phng, HKHTN


HQG H Ni.

[4]

Website: http://vi.wikipedia.org
V mt s trang web tham kho khc.

15

You might also like