You are on page 1of 10

Phn loi vn bn ting Vit vi b phn loi vect h tr SVM

Classification of Vietnamese Documents Using Support Vector Machine


Nguyn Linh Giang, Nguyn Mnh Hin

Abstract: In this paper, we present studies on Vietnamese document classification problem using Support Vector Machine (SVM). SVM is a learning method with ability to automatically tune the capacity of the learning machine by maximizing the margin between positive and negative examples in order to optimize the generalization performance, SVM has a large potential for the successful applications in the field of text categorization. This paper presents the results of the experiment on Vietnamese text categorization with SVM. T kha: Phn loi vn bn, Support Vector Machine

I. GII THIU Bi ton t ng phn loi l mt trong nhng bi ton kinh in trong lnh vc x l d liu vn bn. y l vn c vai tr quan trng khi phi x l mt s lng ln d liu. Trn th gii c nhiu cng trnh nghin cu t nhng kt qu kh quan v hng ny. Tuy vy, cc nghin cu v ng dng i vi vn bn ting Vit cn c nhiu hn ch. Phn nhiu l do l c th ca ting Vit trn phng din t vng v cu. Trong lnh vc khai ph d liu, cc phng php phn loi vn bn da trn nhng phng php quyt nh nh quyt nh Bayes, cy quyt nh, klng ging gn nht, mng nron, ... Nhng phng php ny cho kt qu chp nhn c v c s dng trong thc t. Trong nhng nm gn y, phng php phn loi s dng B phn loi vector

h tr (SVM) c quan tm v s dng nhiu trong nhng lnh vc nhn dng v phn loi. SVM l mt h cc phng php da trn c s cc hm nhn (kernel) ti thiu ha ri ro c lng. Phng php SVM ra i t l thuyt hc thng k do Vapnik v Chervonenkis xy dng [11, 12] v c nhiu tim nng pht trin v mt l thuyt cng nh ng dng trong thc tin. Cc th nghim thc t cho thy, phng php SVM c kh nng phn loi kh tt i vi bi ton phn loi vn bn cng nh trong nhiu ng dng khc (nh nhn dng ch vit tay, pht hin mt ngi trong cc nh, c lng hi quy, ...). So snh vi cc phng php phn loi khc, kh nng phn loi ca SVM l tng ng hoc tt hn ng k [1, 2, 3, 4, 10]. Vn phn loi vn bn ting Vit c nhiu c s nghin cu trong c nc quan tm trong nhng nm gn y. Mt s cng trnh nghin cu cng t c nhng kt qu kh quan. Cc hng tip cn bi ton phn loi vn bn c nghin cu bao gm: hng tip cn bi ton phn loi bng l thuyt th [14], cch tip cn s dng l thuyt tp th [13], cch tip cn thng k [15], cch tip cn s dng phng php hc khng gim st v nh ch mc [16, 17]. Nhn chung, nhng cch tip cn ny u cho kt qu chp nhn c. Tuy vy i n nhng trin khai kh thi th vn cn y mnh nghin cu

trn hng ny. Mt trong nhng kh khn trong vic p dng nhng thut ton phn loi vn bn vo ting Vit l xy dng c tp hp t vng ca vn bn. Vn ny lin quan ti vic phn tch mt cu thnh cc t mt cch chnh xc. gii quyt vn ny, chng ti s dng t in cc thut ng ting Vit vi khong 11.000 t v cm t. Vn bn c biu din di dng vector v c phn loi theo phng php SVM. Trong bi bo ny, trc ht chng ti trnh by c s ca phng php SVM v cc thut ton gii bi ton quy hoch ton phng pht sinh t phng php ny. Phn tip theo cp ti bi ton phn loi vn bn trong biu din vector. Chng ti nhn mnh vo kha cnh tin x l vn bn, trch chn c trng, biu din vn bn, v phn tch s ph hp ca phng php SVM p dng vo bi ton phn loi vn bn. Phn cui l cc kt qu th nghim ng dng SVM vo phn loi vn bn ting Vit. Nhng th nghim ny nhm kim chng kh nng phn loi ca SVM i vi vn bn ting Vit. ng thi xc nh cc tham s ca SVM thch hp cho cc phn lp xc nh trong bi ton phn loi vn bn. II. B PHN LOI VECTOR H TR (SVM) c trng c bn quyt nh kh nng phn loi ca mt b phn loi l hiu sut tng qut ha, hay l kh nng phn loi nhng d liu mi da vo nhng tri thc tch ly c trong qu trnh hun luyn. Thut ton hun luyn c nh gi l tt nu sau qu trnh hun luyn, hiu sut tng qut ha ca b phn loi nhn c cao. Hiu sut tng qut ha ph thuc vo hai tham s l sai s hun luyn v nng lc ca my hc. Trong sai s hun luyn l t l li phn loi trn tp d liu hun luyn. Cn nng lc ca my hc c xc nh bng kch thc VapnikChervonenkis (kch thc VC). Kch thc VC l mt khi nim quan trng i vi mt h hm phn tch (hay l b phn loi). i lng ny c xc nh bng s im cc i m h hm c th phn tch

hon ton trong khng gian i tng. Mt b phn loi tt l b phn loi c nng lc thp nht (c ngha l n gin nht) v m bo sai s hun luyn nh. Phng php SVM c xy dng da trn tng ny. Xt bi ton phn loi n gin nht - phn loi hai phn lp vi tp d liu mu: {(xi, yi)| i = 1, 2, ..., N, xi Rm } Trong mu l cc vector i tng c phn loi thnh cc mu dng v mu m: Cc mu dng l cc mu xi thuc lnh vc quan tm v c gn nhn yi = 1; Cc mu m l cc mu xi khng thuc lnh vc quan tm v c gn nhn yi = 1;

Hnh 1. Mt siu phng tch cc mu dng khi cc mu m.

Trong trng hp ny, b phn loi SVM l mt siu phng phn tch cc mu dng khi cc mu m vi chnh lch cc i, trong chnh lch cn gi l l (margin) xc nh bng khong cch gia cc mu dng v cc mu m gn mt siu phng nht (hnh 1). Mt siu phng ny c gi l mt siu phng l ti u. Cc mt siu phng trong khng gian i tng c phng trnh l wTx + b = 0, trong w l vector trng s, b l dch. Khi thay i w v b, hng v

khong cch t gc ta n mt siu phng thay i. B phn loi SVM c nh ngha nh sau:
f(x) = sign(wTx + b) (1)

w = i yi x i
i =1

(7)

Trong sign(z) = +1 nu z 0, sign(z) = 1 nu z < 0. Nu f(x) = +1 th x thuc v lp dng (lnh vc c quan tm), v ngc li, nu f(x) = 1 th x thuc v lp m (cc lnh vc khc). My hc SVM l mt h cc mt siu phng ph thuc vo cc tham s w v b. Mc tiu ca phng php SVM l c lng w v b cc i ha l gia cc lp d liu dng v m. Cc gi tr khc nhau ca l cho ta cc h mt siu phng khc nhau, v l cng ln th nng lc ca my hc cng gim. Nh vy, cc i ha l thc cht l vic tm mt my hc c nng lc nh nht. Qu trnh phn loi l ti u khi sai s phn loi l cc tiu. Nu tp d liu hun luyn l kh tch tuyn tnh, ta c cc rng buc sau:

xc nh dch b, ta chn mt mu xi sao cho vi i > 0, sau s dng iu kin KarushKuhn Tucker (KKT) nh sau: i [ yi (wT xi + b) 1] = 0 (8) Cc mu xi tng ng vi i > 0 l nhng mu nm gn mt siu phng quyt nh nht (tha mn du ng thc trong (2), (3)) v c gi l cc vector h tr. Nhng vector h tr l nhng thnh phn quan trng nht ca tp d liu hun luyn. Bi v nu ch c cc vector h tr, ta vn c th xy dng mt siu phng l ti u nh khi c mt tp d liu hun luyn y . Nu tp d liu hun luyn khng kh tch tuyn tnh th ta c th gii quyt theo hai cch. Cch th nht s dng mt mt siu phng l mm, ngha l cho php mt s mu hun luyn nm v pha sai ca mt siu phng phn tch hoc vn v tr ng nhng ri vo vng gia mt siu phng phn tch v mt siu phng h tr tng ng. Trong trng hp ny, cc h s Lagrange ca bi ton quy hoch ton phng c thm mt cn trn C dng tham s do ngi s dng la chn. Tham s ny tng ng vi gi tr pht i vi cc mu b phn loi sai. Cch th hai s dng mt nh x phi tuyn nh x cc im d liu u vo sang mt khng gian mi c s chiu cao hn. Trong khng gian ny, cc im d liu tr thnh kh tch tuyn tnh, hoc c th phn tch vi t li hn so vi trng hp s dng khng gian ban u. Mt mt quyt nh tuyn tnh trong khng gian mi s tng ng vi mt mt quyt nh phi tuyn trong khng gian ban u. Khi , bi ton quy hoch ton phng ban u s tr thnh: Cc i ha:

wT xi + b +1 nu yi = +1 wT xi + b 1 nu yi = 1
T

(2) (3)

Hai mt siu phng c phng trnh l w x + b = 1 c gi l cc mt siu phng h tr (cc ng nt t trn hnh 1). xy dng mt mt siu phng l ti u, ta phi gii bi ton quy hoch ton phng sau: Cc i ha:

i
i =1

1 N N i j yi y j xT x j i 2 i =1 j =1
i 0

(4)

vi cc rng buc:
(5) (6)

i yi = 0
i =1

trong cc h s Lagrange i, i = 1, 2, ..., N, l cc bin cn c ti u ha. Vector w s c tnh t cc nghim ca bi ton ton phng ni trn nh sau:

i 2 i j yi y j k (x i , x j )
i =1 i =1 j =1

(9)

vi cc rng buc:

0 i C

(10) (11)

i yi = 0
i =1

trong k l mt hm nhn tha mn:

k(x i , x j ) = (x i ) (x j )

(12)

Vi vic dng mt hm nhn, ta khng cn bit r v nh x . Hn na, bng cch chn mt nhn ph hp, ta c th xy dng c nhiu b phn loi khc nhau. Chng hn, nhn a thc k(xi, xj) = (xiT xj + 1) p dn n b phn loi a thc, nhn Gaussian k(xi, xj) = exp(||xi xj||2) dn n b phn loi RBF (Radial Basis Functions), v nhn sigmoid k(xi, xj) = tanh(xiT xj + ), trong tanh l hm tang hyperbol, dn ti mng nron sigmoid hai lp (mt lp nron n v mt nron u ra). Tuy nhin, mt u im ca cch hun luyn SVM so vi cc cch hun luyn khc l hu ht cc tham s ca my hc c xc nh mt cch t ng trong qu trnh hun luyn. Hun luyn SVM Hun luyn SVM l vic gii bi ton quy hoch ton phng SVM. Cc phng php s gii bi ton quy hoch ny yu cu phi lu tr mt ma trn c kch thc bng bnh phng ca s lng mu hun luyn. Trong nhng bi ton thc t, iu ny l khng kh thi v thng thng kch thc ca tp d liu hun luyn thng rt ln (c th ln ti hng chc nghn mu). Nhiu thut ton khc nhau c pht trin gii quyt vn nu trn. Nhng thut ton ny da trn vic phn r tp d liu hun luyn thnh nhng nhm d liu. iu c ngha l bi ton quy hoch ton phng ln c phn r thnh cc bi ton quy hoch ton phng vi kch thc nh hn. Sau , nhng thut ton ny kim tra cc iu kin KKT xc nh phng n ti u. Mt s thut ton hun luyn da vo tnh cht [6]: nu trong tp d liu hun luyn ca bi ton quy hoch ton phng con cn gii mi bc c t nht mt mu vi phm cc iu kin KKT, th sau khi gii bi ton ny, hm mc tiu s tng. Nh vy, mt

chui cc bi ton quy hoch ton phng con vi t nht mt mu vi phm cc iu kin KKT c m bo hi t n mt phng n ti u. Do , ta c th duy tr mt tp d liu lm vic ln c kch thc c nh v ti mi bc hun luyn, ta loi b v thm vo cng mt s lng mu. Chng ti tp trung vo nghin cu thut ton hun luyn SVM ti u ha tun t cc tiu (Sequential Minimal Optimization - SMO) [7]. Thut ton ny s dng tp d liu hun luyn (cn gi l tp lm vic) c kch thc nh nht bao gm hai h s Lagrange. Bi ton quy hoch ton phng nh nht phi gm hai h s Lagrange v cc h s Lagrange phi tha mn rng buc ng thc (11). Phng php SMO cng c mt s heuristic cho vic chn hai h s Lagrange ti u ha mi bc. Mc d c nhiu bi ton quy hoch ton phng con hn so vi cc phng php khc, mi bi ton con ny c gii rt nhanh dn n bi ton quy hoch ton phng tng th cng c gii mt cch nhanh chng. III. PHN LOI VN BN V SVM Phn loi vn bn l mt tin trnh a cc vn bn cha bit ch vo cc lp vn bn bit (tng ng vi cc ch hay lnh vc khc nhau). Mi lnh vc c xc nh bi mt s ti liu mu ca lnh vc . thc hin qu trnh phn loi, cc phng php hun luyn c s dng xy dng b phn loi t cc ti liu mu, sau dng b phn loi ny d on lp ca nhng ti liu mi (cha bit ch ). Trong qu trnh phn loi, cc vn bn c biu din di dng vector vi cc thnh phn (chiu) ca vector ny l cc trng s ca cc t. y, chng ta b qua th t gia cc t cng nh cc vn ng php khc. Di y l mt s phng php nh trng s t thng dng: 1. Tn sut t (term frequency - TF): Trng s t l tn sut xut hin ca t trong ti liu. Cch nh trng s ny ni rng mt t l quan trng cho

mt ti liu nu n xut hin nhiu ln trong ti liu . 2. TFIDF: Trng s t l tch ca tn sut t TF v tn sut ti liu nghch o ca t v c xc nh bng cng thc IDF = log(N / DF) + 1 (13) trong : N l kch thc ca tp ti liu hun luyn; DF l tn sut ti liu: l s ti liu m mt t xut hin trong . Trng s TFIDF kt hp thm gi tr tn sut ti liu DF vo trng s TF. Khi mt t xut hin trong cng t ti liu (tng ng vi gi tr DF nh) th kh nng phn bit cc ti liu da trn t cng cao. Cc t c dng biu din cc ti liu cng thng c gi l cc c trng. nng cao tc v chnh xc phn loi, ti bc tin x l vn bn, ta loi b cc t khng c ngha cho phn loi vn bn. Thng thng nhng t ny l nhng t c s ln xut hin qu t hoc qu nhiu. Tuy vy vic loi b nhng t ny c th khng lm gim ng k s lng cc c trng. Vi s lng cc c trng ln b phn loi s hc chnh xc tp ti liu hun luyn, tuy vy nhiu trng hp cho kt qu d on km chnh xc i vi cc ti liu mi. trnh hin tng ny, ta phi c mt tp ti liu mu ln hun luyn b phn loi. Tuy vy, thu thp c tp mu ln tng ng vi s lng c trng thng kh thc hin c trong thc t. Do cho bi ton phn loi c hiu qu thc tin, cn thit phi lm gim s lng c trng. C nhiu phng php chn c trng hiu qu. y, chng ti s dng phng php lng tin tng h . Phng php ny s dng o lng tin tng h gia mi t v mi lp ti liu chn cc t tt nht. Lng tin tng h gia t t v lp c c tnh nh sau:

trong : P(t, c) l xc sut xut hin ng thi ca t t trong lp c; P(t) l xc sut xut hin ca t t v P(c) l xc sut xut hin ca lp c. o MI ton cc (tnh trn ton b tp ti liu hun luyn) cho t t c tnh nh sau:

MI avg (t ) = P (ci ) MI (t , ci )
i

(15)

Khi s dng cc phng php chn c trng, ta c th loi b i nhiu t quan trng, dn n mt mt nhiu thng tin, iu lm cho chnh xc phn loi s gim i ng k. Trong thc t, theo th nghim ca Joachims [4], rt t c trng khng c lin quan, v hu ht u mang mt thng tin no , v vy mt b phn loi tt nn c hun luyn vi nhiu c trng nht nu c th. iu ny lm cho SVM tr thnh mt phng php thch hp cho phn loi vn bn, bi v gii thut SVM c kh nng iu chnh nng lc phn loi t ng m bo hiu sut tng qut ha tt, thm ch c trong khng gian d liu c s chiu cao (s c trng rt ln) v lng ti liu mu l c hn. Trong cc thc nghim i vi bi ton phn loi vn bn ting Anh, phng php SVM cho kt qu phn loi tng i kh quan [4]. Mt trong nhng l do l d liu vn bn thng kh tch tuyn tnh, v SVM thc hin vic xc nh mt siu phng phn tch d liu ti u. Trong nhng th nghim phn loi vn bn ting Vit c thc hin, chng ti cng nhn thy d liu vn bn ting Vit ni chung l kh tch. Khi d liu l kh tch th gii thut SVM ch cn tp trung vo cc i ha l, do c th dn ti mt hiu sut tng qut ha tt. Mt im ng ch na khi hun luyn SVM cho phn loi vn bn l ta c th xy dng c nhiu b phn loi khc nhau bng cch chn nhng hm nhn ph hp nh ni trong phn II. Nhng khng nh cc phng php khc, m hnh ca my hc (cc

MI (t , c) =

t{0,1} c{0,1}

P(t , c) log

P (t , c) (14) P (t ) P(c)

tham s w, b ti u) c hc mt cch t ng trong qu trnh hun luyn SVM. Nhng phn tch trn y cho thy SVM c nhiu im ph hp cho vic ng dng trong phn loi vn bn. V trn thc t, cc th nghim phn loi vn bn ting Anh ch ra rng SVM t c chnh xc phn loi cao v t ra xut sc hn so vi cc phng php phn loi vn bn khc. Trong phn IV ca bi bo ny, chng ti a ra cc kt qu th nghim ng dng SVM vo phn loi vn bn ting Vit. IV. KT QU THC NGHIM Chng ti thc hin mt th nghim ng dng SVM vo phn loi vn bn ting Vit. Tp ti liu mu c s dng gm 4162 ti liu c ly t trang http://vnexpress.net (bng 1). Tp ti liu ny c chia thnh hai phn: 50% c dng lm tp ti liu hun luyn, 50% c dng lm tp ti liu kim th. Vic la chn cc vn bn kim th thut ton da vo nhng gi thit sau: Cc ti liu c phn lp thnh nhng phn nhm tch bit. Trn thc t, cc ti liu trn Vnexpress.net c phn loi khng chnh xc. Cc phn lp ti liu c s giao thoa v do mt ti liu thuc mt phn lp c th c nhng c trng thuc mt phn lp khc. S phn b ti liu trong mt phn nhm khng nh hng ti s phn b ti liu trong phn nhm khc. Gi thit ny c t ra c th chuyn bi ton phn loi nhiu phn lp thnh cc bi ton phn loi hai phn lp. B phn loi SVM s c hun luyn trn tp ti liu hun luyn v hiu sut tng qut ha ( chnh xc) c nh gi trn tp ti liu kim th (tp ti liu kim th khng tham gia vo qu trnh hun luyn, do cho php nh gi khch quan hiu sut tng qut ha).
Bng 1. Tp ti liu mu c dng trong th nghim phn loi vn bn ting Vit.

Loi ti liu m nhc m thc Bt ng sn Gia nh Gio dc Hi ha Kho c Khoa hc Kinh doanh Php lut Phim nh Sc khe Tm l Th gii Th thao Thi trang Vi tnh

Hun luyn 119 109 119 85 165 111 45 119 193 155 117 109 47 85 257 107 140

Kim th 119 109 119 86 166 112 45 118 194 154 117 108 46 85 256 106 140

i vi vic tin x l cc ti liu, chng ti s dng mt b t ting Vit gm 11.210 t. S d chng ti phi s dng t in t l do c im khc bit ca ting Vit so vi ting Anh trn phng din t vng. Cc t ting Anh c ngn cch bng nhng du cch, du cu. Do vic xc nh ranh gii t trong cu vn ting Anh c th da hon ton vo cc du ngt t. Trong khi , vic xc nh ranh gii t trong cu ting Vit l kh kh khn nu khng hiu ng ngha ca t trong tng ngh cnh v ng ngha ca cu. V d, t phn v t ng l nhng t c lp v u c ngha khi ng ring l. Tuy vy khi chng ng cnh nhau to thnh t ghp phn ng th y cng l mt t c lp v c ngha khc ty theo ng cnh. Nh vy tm ranh gii t trong cu ting Vit, khng th ch da vo cc du ngt t nhn du cch thng thng. lm n gin ha vn nay, chng ti s dng mt b t ting Vit h tr qu trnh phn tch t. Bc u tin ca tin x l l m s ln xut hin ca mi t trong mi ti liu. V cc t ting Vit c th bao nhau (nh o v o s mi), cc t di hn

(theo s m tit) s c tch ra trc. Nhng t khng xut hin ln no (trong tp ti liu hun luyn) b loi b, kt qu l cn li 7721 t. th nghim vi nhng s c trng khc nhau, 100 t c tn sut cao nht v cc t xut hin t hn 3 ln b loi b, thu c 5709 t; sau , phng php lng tin tng h c s dng chn ra ln lt 5000, 4000, 3000, 2000 v 1000 t. Vi mi s c trng c chn, cc ti liu c biu din di dng cc vector tha dng cch nh trng s t TFIDF. Mi vector tha gm hai mng: mt mng s nguyn lu ch s ca cc gi tr khc 0, v mt mng s thc lu cc gi tr khc 0 tng ng. S d dng cc vector tha l do s t xut hin trong mi ti liu l rt nh so vi tng s t c s dng; iu ny mt mt tit kim b nh, mt khc lm tng tc tnh ton ln ng k. Cc vector cng c t l sao cho cc thnh phn ca n nm trong khong [0, 1], qua gip trnh vic cc thnh phn c gi tr ln ln t cc thnh phn c gi tr nh, v trnh c cc kh khn khi tnh ton vi cc gi tr ln. thc hin phn loi vn bn bng phng php SVM, chng ti s dng phn mm LIBSVM 2.71 vi cng c grid.py cho php chn tham s ti u cho gii thut SVM vi nhn Gaussian. iu ny c thc hin bng cch chia tp ti liu hun luyn thnh v phn bng nhau, v ln lt mi phn c kim th bng b phn loi c hun luyn trn v 1 phn cn li. chnh xc ng vi mi b gi tr ca cc tham s (C v ) c tnh bng t l ti liu trong tp ti liu hun luyn c d on ng. Ch rng y hon ton khng c s tham gia ca cc ti liu trong tp ti liu kim th. Sau khi chn c cc tham s C v ti u, b phn loi SVM s c hun luyn trn ton b tp ti liu hun luyn, v chnh xc ca n c nh gi bng cch thc hin phn loi trn tp ti liu kim th. LIBSVM thc hin phn loi a lp (trong trng hp ca bi bo ny l 17 lp) theo kiu mtu-mt (one-against-one), ngha l c vi hai lp th

s hun luyn mt b phn loi, kt qu l s c tng cng k(k 1)/2 b phn loi, vi k l s lp. i vi hai lp th i v th j, mt ti liu cha bit x s c phn loi bng b phn loi c hun luyn trn hai lp ny. Nu x c xc nh l thuc lp i th im s cho lp i c tng ln 1, ngc li im s cho lp j c tng ln 1. Ta s d on x nm trong lp c im s cao nht. Trong trng hp c hai lp bng nhau v im s ny, ta ch n gin chn lp c s th t nh hn. Tr li th nghim, cc tham s ti u c tm trong s 110 b gi tr (C, ) th nghim (vi C = 2-5, 2-3, ..., 215, = 23, 21, ..., 2-15). Kt qu chn tham s c a ra trong bng 2. T bng 2, ta thy cc tham s tt nht l 7721 c trng, C = 215 v = 2-13. Nh vy, trong trng hp th nghim ny, cc phng php chn c trng khng em li kt qu nh mong mun chng lm gim chnh xc. Vi cc tham s trn, b phn loi SVM c hun luyn trn ton b tp ti liu hun luyn, sau chnh xc ca n c nh gi trn tp ti liu kim th, cho kt qu nh trong bng 3.
Bng 2. Cc tham s ti u tng ng vi mi s lng c trng.

S c trng 7721 5709 5000 4000 3000 2000 1000

(C, ) tt nht (215, 2-13) (213, 2-11) (211, 2-11) (25, 2-5) (25, 2-5) (27, 2-5) (23, 2-3)

chnh xc (%) 82,90 82,04 80,40 78,58 78,34 73,87 71,57

Bng 3. chnh xc phn loi trn mi lp v trn ton b tp ti liu kim th.

Loi ti liu m nhc m thc Bt ng sn Gia nh

chnh xc (%) 72,27 93,58 94,12 72,09

Gio dc 79,52 Hi ha 82,14 Kho c 51,11 Khoa hc 65,25 Kinh doanh 83,51 Php lut 94,81 Phim nh 66,67 Sc khe 78,70 Tm l 39,13 Th gii 71,76 Th thao 98,05 Thi trang 76,42 Vi tnh 79,29 Tt c 80,72 Trong bng 3, chnh xc trn tt c cc lp ti liu l 80,72% c tnh bng t s gia s ti liu c d on ng trn tng s ti liu ca tp ti liu kim th. Hnh 2 l th minh ha cho trng hp b phn loi SVM c hun luyn trn hai lp ti liu gia nh v gio dc. Hnh 2a cho thy s phn b ca cc im d liu hun luyn, cn hnh 2b cho thy s phn b ca cc im d liu kim th. Ta nhn thy rng khng c mt li v tr no trn hnh 2a, nhng li c mt vi li v tr trn hnh 2b. Trong trng hp ny, my hc SVM hc chnh xc tp ti liu hun luyn (kh tch tuyn tnh) nhng mc phi mt vi sai st khi d on cc ti liu cha bit (cc ti liu kim th). Nhng kt qu thc nghim trong th nghim phn loi cc vn bn ting Vit bng b phn loi SVM c chnh xc cha c cao (khong 80,72%). iu ny c th do qu trnh tin x l vn bn v nhng d liu hun luyn cng vi d liu th nghim c phn loi cha chnh xc. Tht vy y l nhng d liu thu thp trn Vnexpress.net v khng c phn loi chun. Mt vn bn, v d thuc lnh vc Bt ng sn hon ton c th thuc c lnh vc Kinh doanh. Nh vy cc phn lp vn bn mu trn thc t khng hon ton phn tch tuyn tnh m c vng khng gian mp m. iu ny nh hng kh mnh n qu trnh hun luyn b phn loi.

(a) hun luyn ch gm hai lp gia nh v gio dc

Cc ti liu gia nh Cc ti liu gio dc

(b) kim th ch gm hai lp gia nh v gio dc Hnh 2. th gi tr ca f(x) = wT x + b ti cc ti liu x trong tp ti liu

Tuy vy trong nhng ng dng thc t nh phn loi trang Web, hoc x l phn loi khi ln vn bn th kt qu ny c th chp nhn c. Vn t ra cho nhng nghin cu tip theo l: Xy dng c h thng d liu th nghim tiu chun. y l vn ln v cn tp trung nhiu cng sc; Th nghim b phn loi vi nhng hm nhn khc nhau chn c nhn ti u i vi mt tp hp d liu kim th. V. KT LUN Trong bi bo ny, chng ti kho st hiu qu phng php phn loi SVM. y l b phn loi c kh nng t ng iu chnh cc tham s ti u ha hiu sut phn loi thm ch trong nhng khng gian c trng c s chiu cao. B phn loi SVM t ra ph hp cho phn loi vn bn. Trong th nghim vi

bi ton phn loi vn bn ting Vit, chnh xc phn loi l 80,72% c th chp nhn c trong nhng iu kin thc t. Hin ti, chng ti ang tip tc nghin cu ci tin khu tin x l vn bn, xy dng cc mu hun luyn tiu chun cng nh iu chnh gii thut SVM c th nng cao chnh xc phn loi hn na. TI LIU THAM KHO
[1] B. BOSER, I. GUYON, V. VAPNIK, A training algorithm for optimal margin classifiers, Proceedings of the Fifth Annual Workshop on Computational Learning Theory (ACM), pp 144-152, 1992. [2] C. BURGES, A tutorial on Support Vector Machines for pattern recognition, Proceedings of Int Conference on Data Mining and Knowledge Discovery, Vol 2, No 2, pp 121-167, 1998. [3] S. DUMAIS, J. PLATT, D. HECKERMAN, M. SAHAMI, Inductive learning algorithms and representations for text categorization, Proceedings of Conference on Information and Knowledge Management (CIKM), pp 148-155, 1998. [4] T. JOACHIMS, Text categorization with Support Vector Machines: Learning with many relevant features, Technical Report 23, LS VIII, University of Dortmund, 1997. [5] S. HAYKIN, Neural networks: A comprehensive foundation, Prentice Hall, 1998. [6] E. OSUNA, R. FREUND, F. GIROSI, An improved training algorithm for Support Vector Machines, Neural Networks for Signal Processing VII Proceedings of the 1997 IEEE Workshop, pp 276-285, New York, IEEE, 1997. [7] J. PLATT, Sequential minimal optimization: A fast algorithm for training Support Vector Machines, Technical Report MSR-TR-98-14, Microsoft Research, 1998. [8] C.J. VAN RIJSBERGEN, Information Retrieval, Butterworths, London, 1979. [9] Y. YANG, X. LIU, A re-examination of text categorization methods, Proceedings of the 22th Ann Int ACM SIGIR Conference on Research and

Development in Information Retrieval (SIGIR'99), pp 4249, 1999. [10] Y. YANG, J. PEDERSEN, A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning (ICML), pp 412-420, Morgan & Kaufmann 1997. [11] V. VAPNIK, Nature of statistical learning theory, Springer-Verlag, 2000 [12] V. N. VAPNIK, A. YA. CHERVONENKIS, Teoria Raspoznavaniya Obrazov, Nauka, 1974 [13] NGUYN NGC BNH, Dng l thuyt tp th v cc k thut khc phn loi, phn cm vn bn ting Vit, K yu hi tho ICT.rda04. H ni 2004 [14] BCH DIP, Phn loi vn bn da trn m hnh th, Lun vn cao hc. Trng i hc Tng hp New South Wales - Australia. 2004. [15] NGUYN LINH GIANG, NGUYN DUY HI, M hnh thng k hnh v ting Vit v ng dng, Chuyn san Cc cng trnh nghin cu, trin khai Cng ngh Thng tin v Vin thng, Tp ch Bu chnh Vin thng, s 1, thng 7-1999, trang 61-67. 1999 [16] HUNH QUYT THNG, INH TH PHNG THU, Tip cn phng php hc khng gim st trong hc c gim st vi bi ton phn lp vn bn ting Vit v xut ci tin cng thc tnh lin quan gia hai vn bn trong m hnh vector, K yu Hi tho ICT.rda04, trang 251-261, H Ni 2005. [17] INH TH PHNG THU, HONG VNH SN, HUNH QUYT THNG, Phng n xy dng tp mu cho bi ton phn lp vn bn ting Vit: nguyn l, gii thut, th nghim v nh gi kt qu, Bi bo gi ng ti Tp ch khoa hc v cng ngh, 2005. Ngy nhn bi: 8/6/2005

S LC TC GI NGUYN LINH GIANG Sinh nm 1968 ti H Ni Tt nghip i hc nm 1991 v nhn hc v Tin s ti Lin X c chuyn ngnh m bo Ton hc cho my tnh nm 1995. Hin ang Khoa Cng ngh Thng tin, i hc Bch khoa H Ni. Lnh vc nghin cu: iu khin ti u, x l vn bn ting Vit, an ton mng, multimedia Email: giangnl@it-hut.edu.vn

NGUYN MNH HIN Sinh nm 1981 Tt nghip i hc chuyn ngnh Truyn thng v Mng, i hc Bch khoa H Ni nm 2004. Hin ang cng tc ti Khoa Cng ngh Thng tin, i hc Thy Li . Lnh vc nghin cu: Hc my, khai ph d liu ting Vit Email: nmhien@gmail.com

You might also like