You are on page 1of 16

I.

Khi qut v phn lp......................................................................................................................2


1. Khi nim v phn lp phn loi vn bn (Document /categorization classification)2
2. Xy dng h thng phn loi vn bn. 3
3. Cc giai on x l vn bn trong phn loi vn bn5
Bc 1: Tin x l s liu 5
Bc 2: Tch t: 6
Bc 3: Xc inh trng s cho t.

Bc 4: S dng thut ton phn loi vn bn cn c


II.

Cc thut ton trong phn lp vn bn.............................................................................................8


1. Thut ton cy quyt nh (Decision tree)

2. Thut ton K-Nearest Neighbor (KNN)

11

3. Nave Bayes (NB):

12

4. Support Vector Machine (SVM).

13

5. Support Vector Machines Nearest Neighbor (SVM-NN)


III.

14

Phn a lp (Multi class)...........................................................................................................14

1. Chin lc One-against-One (OAO). 15


2. Chin lc One-against-Rest (OAR). 15
IV.
V.

Chy Th Nghim cc thut ton...............................................................................................16


Hng Pht Trin...........................................................................................................................16

I.

Khi qut v phn lp.

1. Khi nim v phn lp phn loi vn bn (Document /categorization


classification)
Phn loi vn bn l cng vic phn tch ni dung ca vn bn v sau ra quyt
nh (hay d on) vn bn ny thuc nhm no trong cc nhm vn bn cho trc.
Vn bn c phn loi c th thuc mt nhm,nhiu nhm, hoc khng thuc nhm vn
bn m ta nh ngha trc.
C hai cch tip cn bi ton phn loi vn bn :
Phn lp vn bn da trn cch tip cn h chuyn gia:
Theo cch tip cn ny, vic phn lp vn bn t ng c iu khin bng tay
bi cc chuyn gia tri thc v h chuyn gia c kh nng a ra quyt nh phn lp. H
chuyn gia bao gm mt tp cc lut logic nh ngha bng tay, cho mi loi, c dng:
If (DNF formula) then (category).
Cng thc DNF (Disjunctive Normal Form) l hp ca cc mnh lin kt, ti
liu c phn lp vo category nu n tha mn cng thc, ngha l, nu n tha mn t
nht mt mnh trong cng thc.
iu tr ngi ca cch tip cn ny l hn ch trong qu trnh thu nhn tri thc t
ti liu ca cc h thng chuyn gia. Ngha l, cc lut phi c nh ngha bng tay bi
k s tri thc vi s gip ca chuyn gia v lnh vc c nu trong ti liu. Nu tp
hp ca cc loi c cp nht, th hai nh chuyn gia phi can thip li, v nu phn lp
c chuyn hon ton sang mt phm vi khc, mt chuyn gia v lnh vc ny cn thit
phi can thip vo v cng vic phi c bt u li t tp ti liu hn tp ban u.

Phn lp vn bn da trn cch tip cn my hc


Theo cch tip cn ny, mt qu trnh x l quy np chung (cng c gi l qu trnh
hc) xy dng t ng mt phn lp cho mt loi ci bng quan st cc c trng ca tp
hp cc ti liu c phn bng tay vo ci hay

ci

bi chuyn gia v lnh vc ny; t

, qu trnh qui np thu lm cc c trng phn lp mt ti liu mi (khng nhn


thy) vo ci. Trong k thut my hc, bi ton phn lp l hot ng hc c gim st, qu
trnh hc c gim st bi tri thc ca cc phn lp v ca cc mu hun luyn thuc
chng.
( Hc c gim st l mt k thut ca ngnh hc my xy dng mt hm (function) t

d liu hun luyn. D liu hun luyn bao gm cc cp gm i tng u vo (thng


dng vec-t), v u ra mong mun. u ra ca mt hm c th l mt gi tr lin tc
(gi l hi qui), hay c th l d on mt nhn phn loi cho mt i tng u vo
(gi l phn loi). Nhim v ca chng trnh hc c gim st l d on gi tr ca
hm cho mt i tng bt k l u vo hp l, sau khi xem xt mt s v d hun
luyn (ngha l, cc cp u vo v u ra tng ng). t c iu ny, chng
trnh hc phi tng qut ha t cc d liu sn c d on c nhng tnh hung
cha gp phi theo mt cch "hp l")
Mt s thut ton da trn tip cn my hc c s dng ph bin hin nay gm:Cy
quyt nh, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), thut ton
Nave Bayes (NB), Neural Network (NNet).
Ngoi ra, cn mt s thut ton c pht trin da trn cc thut ton ni
trn v d vi SVM ta c cc thut ton ci tin nh Fuzzy Support Vector Machines.
Cng c mt s phng php ngi ta kt hp cc thut ton li vi nhau nh Support
Vector Machines Nearest Neighbor (SVM-NN) vic kt hp ny s tn dng nhng u
im v b xung nhng mt yu ca cc thut ton m ta s thy phn trnh by v cc
thut ton.
2. Xy dng h thng phn loi vn bn.
a. Ba giai on xy dng h thng phn loi

Ta c th nhn ra ba giai on khc nhau trong vic thit k h thng phn loi vn
bn: biu din ti liu, xy dng b phn loi, nh gi b phn loi.

+ Giai on biu din ti liu : Tc l chuyn ti liu t dng th


(.doc,.HTML,.PDF .. ) sang dng thch hp vi thut ton rt trch ca giai on sau.
+ Giai on xy dng b phn loi : T cc thut ton m ta chn trong h thng
phn loi vn bn ta s xy dng b phn loi cho h thng. y ta c th hiu l xy
dng mu ta lm mc nh gi cho qu trnh phn loi.
+ Giai on nh gi b phn loi : T kt qu ca b phn loi v kim th ta nh
gi b phn loi da vo t l ng v cha ng ca vic phn loi ti liu.

Hnh 1: H Thng Phn Loi Vn bn


b. Cc vn m h thng phn loi cn phi quan tm v gii quyt
chnh xc: Da vo phn trm ca vn bn ng so vi vn bn a vo phn loi. T
l cng cao th h thng s c nh gi l tt .
Tc :Mt h thng c tc phn loi nhanh nhng chnh xc thp hoc h thng
c tc thp nhng chnh xc cao th khng c cho l h thng tt v vy phi
m bo tnh tc v chnh xc cho h thng.
D hiu: Mt h thng phn loi d hiu s to cho ngi s dng tin tng hn vo h
thng, ng thi cng gip cho ngi s dng trnh c vic hiu lm kt qu ca mt
lut c a ra bi h thng.
Thi gian hc: Yu cu h thng phi hc rt nhanh mt lut phn lp hoc nhanh
chng iu chnh mt lut c hc cho ph hp vi thc t.

3. Cc giai on x l vn bn trong phn loi vn bn

Cc bc trong qu trnh phn loi vn bn.

Bc 1: Tin x l s liu
Mc ch ca bc ny l x l tng i sch d liu c vo cc bc sau s
x l tt hn, do cng vic ca bc ny s ch l chuyn vn bn c thnh chui k
t thun ty (text), do n s c yu cu nh sau:
-

u vo: Tp vn bn cn phi phn tch (File PDF, TXT, DOC, HTML, HTM)
u ra: chui k t thun ty (text only) vi font ch nh dng nh sn.

Thc hin :
-

Nu d liu u vo l tp vn bn dng text (txt) th ly tt c s liu


Nu d liu u vo l tp vn bn dng rich-text-box (rtf) th s liu ly ra s l
dng text do s dng control rft trong chng trnh, control ny s c u vo l
tn tp .rft (c cha ng dn) v u ra l dng text thng thng

Nu d liu u vo l tp vn bn dng MS word (doc) th s s dng


Microsoft.Office.Core chuyn i, vi cng c ny vic chuyn i mt file
dng Microsoft word sang text ch l mt hm

Nu d liu u vo l tp vn bn dng PDF th s s dng control PDFbox


c v loi b cc thuc tnh khng cn thit cho chng trnh nh hnh nh, m
thanh, nh dng v ch ly gi tr text

Nu d liu u vo l cc tp vn bn (htm) hay (html) th vic loi b cc d


liu l loi b cc on tag nh dng, cc link lin kt, cc link hnh nh.

Lm sch s liu tip theo bao gm:


-

Loi b cc khong trng nhiu hn 1 khong trng


Cc du xung dng
Cch dng trng
Cc k t l

Bc 2: Tch t:
Mc ch ca bc ny l tch mt vn bn text thun ty thnh t m bo t c ngha trong
vn bn cha n .Nh vy sau khi cht cu ta s xt cc nhim v sau y :

+Tch lc (Filtration)
Tch lc c bit n nh mt qu trnh ca s quyt nh nhng t no nn c
s dng biu din cho cc ti liu v th n c th c s dng cho:
M t ni dung ca vn bn
C s phn bit ti liu t nhng ti liu khc trong b su tp.
giai on ny ta loi b cc t stopword (danh mc cc t khng nh hng n
ni dung vn bn ). Trong ting anh ta c th lc cc t ny theo danh sch c
cung cp ti a ch http://armandbrahaj.blog.al/2009/04/14/list-of-english-stopwords/
+ Stemming (gc t)
Stemming l qu trnh lin quan n vic x l gim i s t i vi gc t hay ci
ngun khc nhau ca chng. Do vy, nhng t "computer", "computing", "compute"
c gim li thnh t "compute" v "walks", "walking" v "walker" c gim li thnh
"walk" . i vi ting Anh, b xc nh gc t ph bin l thut ton xc nh gc t ca
Martin Porter (Martin Porter's Stemming Algorithm).
Nh vy trong bc ny :

u vo: Chui k t vn bn thun ty


u ra: Vector cha cc t c tch trong vn bn.

Bc 3: Xc inh trng s cho t.


ng vi mi thut ton p dng trong h thng phn lp m ta i xc nh nhng trng s ca t
trong vn bn.
Trng s y da trn cc i lng sau:

Tn sut t Term Frequency (tf)- tn s (hay s ln) xut hin ca t- thut ng


trong vn bn.

Tn s vn bn Document Frequency(df) tn s (hay s ln) xut hin ca t thut ng trong khi ti liu c.

Ty vo thut ton m chng ta chn i lng c trng trng s cho cc t trong


vn bn .
Nh vy trong bc ny :

u vo: Vecto cc t
u ra : Vn bn c biu din
-

u ra: Vecto cha cc t c gn nhn

Bc 1 du (a), (b)- Bc 2 (c), stemming (d) xc nh trng s (e)


Bc 4: S dng thut ton phn loi vn bn cn c
y l bc chnh yu ca chng trnh

u vo: Vecto cc t, d liu chun ca cc nhm vn bn


u ra: Xc nh nhm ca vn bn
II.

Cc thut ton trong phn lp vn bn.

1. Thut ton cy quyt nh (Decision tree)


y l phng php hc xp x cc hm mc tiu c gi tr ri rc. Mt khc cy
quyt nh cn c th chuyn sang dng biu din tng ng di dng c s tri thc
l cc lut Nu Th.
* tng thut ton
B phn lp cy quyt nh l mt dng cy m mi nt c gn nhn l mt c
trng, mi nhnh l gi tr trng s xut hin ca c trng trong vn bn cn phn lp,

v mi l l nhn ca phn lp ti liu. Vic phn lp ca mt ti liu d j s c duyt


quy theo trng s ca nhng c trng c xut hin trong vn bn d j. Thut ton lp
quy n khi t n nt l v nhn ca d j chnh l nhn ca nt l tm c. Thng
thng vic phn lp vn bn nh phn s tng thch vi vic dng cy nh phn.
Cy quyt nh ny c t chc nh sau: Cc nt trong c gn nhn bi cc thut
ng, nhn ca cc cung tng ng vi trng s ca thut ng trong ti liu mu, nhn ca
cc l tng ng vi nhn ca cc lp. Cho mt ti liu dj, ta s thc hin so snh cc
nhn ca cung xut pht t mt nt trong (tng ng vi mt thut ng no ) vi trng
s ca thut ng ny trong dj, quyt nh nt trong no s c duyt tip. Qu trnh
ny c lp t nt gc ca cy, cho ti khi nt c duyt l mt l ca cy. Kt thc
qu trnh ny, nhn ca nt l s l nhn ca lp c gn cho vn bn
V d
Ta c bng d liu gm 10 ti liu c m t bng vector nh phn thng qua 7
thut ng thi tit, m, lng ma, gi, kh hu, thuyn, nhit .
Trong ct cui cng trong bng l nhn c gn cho tng ti liu vi ch thi tit,
gi tr ca ti liu di trong ct ny bng 1 tng ng d i thuc ch thi tit, nu gi tr
ny bng 0 th di khng thuc ch thi tit.
Bng : Biu din vn bn bng vector nh phn
Ti liu thi

lng

gi

kh

thuy

nhit

thi

tit

ma

hu

tit

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

Cy quyt nh c xy dng tng ng vi bng trn l:

Hnh Xy dng cy quyt nh cho tp mu dng hun luyn


T cy quyt nh trn ta xy dng c c s tri thc di dng lut Nu -Th
nh sau:
Nu (thi tit=1) v (lng ma=1) v ( m=1) Th class thi tit=1
Nu (thi tit=1) v (lng ma=0) v ( m=1) Th class thi tit=0
Nu (thi tit=1) v (gi=0) v ( m=0) Th class thi tit=0
Nu (thi tit=1) v (gi=1) v ( m=0) Th class thi tit=1
Nu (thi tit=0) v (kh hu=0) Th class thi tit=0
Nu (thi tit=0) v (kh hu=1) v (nhit =0) Th class thi tit=0

Nu (thi tit=0) v (kh hu=1) v (nhit =1) Th class thi tit=1


Xt ti liu d, c biu din bi vector nh phn nh sau:
d = (thi tit, lng ma, m, gi, kh hu, thuyn, nhit )
=(1, 1, 1, 0, 0, 1, 0)
Qu trnh tm kim li gii trn cy quyt nh s nh sau:

Hnh Qu trnh tm kim li gii trn cy quyt nh


Class thi tit=1, hay ni cch khc vn bn d thuc lp vn bn ni v ch
thi tit (lp thi tit).
2. Thut ton K-Nearest Neighbor (KNN)
y l phng php truyn thng kh ni ting v hng tip cn da trn thng k
c nghin cu trong nhn dng mu.
* tng ca thut ton.
tng chnh ca thut ton K-lng ging gn nht (K-NN) l so snh ph hp
ca vn bn d vi tng nhm ch , da trn k vn bn mu trong tp hun luyn m c
tng t vi vn bn d l ln nht.

Khi cn phn loi mt vn bn mi, thut ton s tnh khong cch (khong cch
Euclide, Cosine ...) ca tt c cc vn bn trong tp hun luyn n vn bn ny tm ra
k vn bn gn nht (gi l k lng ging), sau dng cc khong cch ny nh
trng s cho tt c ch . Trng s ca mt ch chnh l tng tt c khong cch
trn ca cc vn bn trong k lng ging c cng ch , ch no khng xut hin trong
k lng ging s c trng s bng 0. Sau cc ch s c sp xp theo mc trng
s gim dn v cc ch c trng s cao s c chn l ch ca vn bn cn phn
loi.
C 2 vn cn quan tm khi phn lp vn bn bng thut ton K- lng ging gn
nht l xc nh khi nim gn, cng thc tnh mc gn; v lm th no tm
c nhm vn bn ph hp nht vi vn bn (ni cch khc l tm c ch thch
hp gn cho vn bn).
Khi nim gn y c hiu l tng t gia cc vn bn. C nhiu cch xc
nh tng t gia hai vn bn, trong cng thc Cosine trng s c coi l hiu
qu nh gi tng t gia hai vn bn. Cho T={t1, t2, , tn} l tp hp cc thut
ng; W={wt1, wt2, , wtn} l vector trng s, wti l trng s ca thut ng ti. Xt hai vn
bn X={x1, x2, , xn} v Y={y1, y2, , yn}, xi, yi ln lt l tn s xut hin ca thut
ng ti trong vn bn X, Y. Khi tng t gia hai vn bn X v Y c tnh theo
cng thc sau:
Sim( X , Y ) cos ine( X , Y , W )

tT

tT

( xt wt ) ( yt wt )

( xt wt )

tT

( yt wt )

Trong vector X, Y cc thnh phn xi, yi c chun ho theo tn s xut hin ca


thut ng ti trong cc vn bn X v Y. Vector W c xc nh bng tay hoc tnh vector
W theo nghch o tn sut vn bn IDF khi vn bn c biu din di dng vector
tn xut TFxIDF
3. Nave Bayes (NB):
NB l phng php phn loi da vo xc sut c s dng rng ri trong lnh vc
my hc.

* tng ca thut ton


tng c bn ca cch tip cn Nave Bayes l s dng xc sut c iu kin gia
t v ch d on xc sut ch ca mt vn bn cn phn loi. im quan trng
ca phng php ny chnh l ch gi nh rng s xut hin ca tt c cc t trong vn
bn u c lp vi nhau. Vi gi nh ny NB khng s dng s ph thuc ca nhiu t
vo mt ch , khng s dng vic kt hp cc t a ra phn on ch v do
vic tnh ton NB chy nhanh hn cc phng php khc.
4. Support Vector Machine (SVM).
SVM l mt phng php phn lp xut pht t l thuyt hc thng k.
* tng ca thut ton
tng ca n l nh x (tuyn tnh hoc phi tuyn) d liu vo khng gian cc
vector c trng (space of feature vectors) m mt siu phng ti u c tm ra
tch d liu thuc hai lp khc nhau.
Cho trc mt tp hun luyn c biu din trong khng gian vector trong
mi ti liu l mt im, phng php ny tm ra mt siu mt phng h quyt nh tt
nht c th chia cc im trn khng gian ny thnh hai lp ring bit tng ng lp + v
lp . Cht lng ca siu mt phng ny c quyt nh bi khong cch (gi l bin)
ca im d liu gn nht ca mi lp n mt phng ny. Khong cch bin cng ln th
mt phng quyt nh cng tt ng thi vic phn loi cng chnh xc. Mc ch thut
ton SVM tm c khong cch bin ln nht.

Hnh Support vector machine


5. Support Vector Machines Nearest Neighbor (SVM-NN)
Support Vector Machines Nearest Neighbor (SVM-NN) (Blanzieri & Melgani
2006) l mt thut ton phn lp ci tin gn y nht ca phng php phn lp SVM.
SVM-NN l mt k thut phn loi vn bn my hc s dng kt hp cch tip cn Klng ging gn nht (K-NN) vi nhng lut ra quyt nh da trn SVM (SVM-based
decision rule).
tng ca thut ton SVM-NN
Thut ton phn lp SVM-NN kt hp cc tng ca thut ton phn lp SVM
v thut ton phn lp K-NN.
N hot ng theo cch sau:
- Cho mt mu phn loi, thut ton xc nh k mu gn nht trong cc mu d
liu ca tp d liu hun luyn.
- Mt phn loi SVM c hun luyn trn nhng mu ny.

- Sau , cc b phn loi SVM c hun luyn s c s dng phn loi cc


mu cha bit.
III.

Phn a lp (Multi class)

Cc thut ton trn thng p dng cho phn lp hai lp, tc l xc nh mt


vn bn c hay khng thuc mt lp cho trc . Vic p dng trong bi ton phn lp a
lp cn kt hp vi cc chin lc phn lp khc. .
tng ca bi ton phn lp a lp l chuyn v bi ton phn lp hai lp bng
cch xy dng nhiu b phn lp hai lp gii quyt. Cc chin lc phn lp a lp
ph bin ny l One-against-One (OAO) v One-against-Rest (OAR).

V d phn lp s dng chin lc OAR v OAO


Trong hnh ta thy chin lc OAR (hnh bn tri)- OAO (hnh bn phi) phi xy dng
siu phng tch lp nh du o ra khi tt c cc lp khc.

1. Chin lc One-against-One (OAO).


Trong chin lc ny ta s dng (n-1) b phn lp i vi n lp. Bi ton phn lp n lp
c chuyn thnh n bi ton phn lp hai lp.
Nhc im ca chin lc OAR l ta phi xy dng mt siu phng tch mt lp ra
khi cc lp cn li, vic ny i hi s phc tp v c th khng chnh xc

2. Chin lc One-against-Rest (OAR).


Trong chin lc ny ta s dng n(n-1)/2 b phn lp hai lp c xy dng bng cch
bt cp tng hai lp mt nn chin lc ny cn c gi l pairwise v s dng phng
php la chn theo a s kt hp cc b phn lp ny xc nh c kt qu phn
lp cui cng. S lng cc b phn lp khng bao gi vt qu n(n-1)/2.
IV.
V.

Chy Th Nghim cc thut ton.


Hng Pht Trin.

You might also like