Professional Documents
Culture Documents
Trn Th Oanh
H NI 2006
Trn Th Oanh
H NI 2006
Li cm n
Trc tin, ti xin gi li cm n chn thnh v s bit n su sc ti Tin s H Quang Thu (trng i hc Cng ngh) v NCS L Anh Cng (Japan Advanced Institute of Science and Technology) tn tnh hng dn ti trong sut qu trnh thc hin kho lun ny. Ti xin by t li cm n su sc n cc thy c gio ging dy ti trong sut bn nm hc qua, cho ti nhng kin thc qu bu ti c th vng bc trn con ng i ca mnh. Ti xin gi li cm n cc anh ch trong nhm seminar v khai ph d liu: anh Nguyn Vit Cng, anh ng Thanh Hi, ch Nguyn Cm T, nhit tnh ch bo trong qu trnh ti tham gia nghin cu khoa hc v lm kho lun. Ti xin gi li cm n ti cc bn trong lp K47CC, K47CA ng h, khuyn khch ti trong sut qu trnh hc tp ti trng. V li cui cng, ti xin by t lng chn thnh v bit n v hn ti cha m, v cc anh ch ti, nhng ngi lun bn cnh ti nhng lc ti kh khn nht, gip ti vt qua kh khn trong hc tp cng nh trong cuc sng.
Sinh vin
Trn Th Oanh
ii
TM TT NI DUNG
Hin nay, tn ti mt s thut ton hc phn lp vn bn thc hin c kt qu rt tt khi c xy dng da trn mt tp v d hc ln. Tuy nhin, trong thi hnh thc t th iu kin ny ht sc kh khn v v d hc thng c gn nhn bi con ngi nn i hi rt nhiu thi gian v cng sc. Trong khi , cc d liu cha gn nhn (unlabeled data) th li rt phong ph. Do vy, vic xem xt cc thut ton hc khng cn nhiu d liu gn nhn, c kh nng tn dng c ngun rt phong ph cc d liu cha gn nhn nhn c s quan tm ca nhiu nh khoa hc trn th gii. Vic hc ny c cp n vi tn gi l hc bn gim st. Trong kha lun ny, chng ti kho st hai thut ton hc bn gim st in hnh nht, l self-training v co-training v xut mt s k thut lm trn. Kha lun cng tin hnh ng dng cc nghin cu ni trn vo bi ton phn lp vn bn v cho kt qu rt kh quan .
iii
3.4. 3.5.
B d liu thc nghim .......................................................................................35 Qu trnh tin hnh thc nghim .........................................................................35
3.5.1. Xy dng cc c trng ...............................................................................35 3.5.2. Thit lp tham s cho m hnh.....................................................................36 3.6. Kt qu ca cc b phn lp ....................................................................................37 3.7. Mt s nhn xt kt qu t c........................................................................40
vi
Danh mc hnh v
Hnh 1. Siu phng cc i (thut ton TSVM) Hnh 2. th trng s da trn cc mu d liu gn nhn v cha gn nhn (thut ton Spectral Graph Partition) Hnh 3. Biu din trc quan ca thut ton self-training Hnh 4. S thut ton self-training Hnh 5. Biu din trc quan thit lp co-training. Hnh 6. S thit lp co-training cho bi ton hai lp Hnh 7. S th tc SAE duy tr phn phi lp Hnh 8. Thut ton co-training vi k thut lm trn c xut Hnh 9: Hai khung nhn ca mt trang web Hnh 10: th biu din o F1 ca b phn lp gim st Nave Bayes da trn content Hnh 11: th biu din o F1 ca b phn lp bn gim st selftraining gc v self-training ci tin
vii
Bng 1: Bng so snh hai thit lp self-training v co-training (trang 22). Bng 2. Bng m t cc phn lp Bng 3: Cu hnh my tnh Bng 4: Bng cng c phn mm h tr Bng 5: Bng cng c phn mm x l d liu Bng 6: Bng cc lp thc hin hc bn gim st Bng 7: Danh sch cc n-gram Bng 8: Cc o ca b phn lp gim st Nave Bayes da trn content Bng 9: Cc o ca self-training (ban u/ci tin MAX/ ci tin MEDIAN) da trn content.
viii
ix
M U
Hin nay, tn ti mt s thut ton hc phn lp vn bn thc hin c kt qu rt tt khi c xy dng da trn mt tp v d hc (d liu c gn nhn - labeled data) ln. Tuy nhin, trong thc t thc thi iu kin c c tp v d ln l ht sc kh khn v v d hc thng phi do con ngi gn nhn cho nn i hi rt nhiu thi gian v cng sc. Trong khi , cc d liu cha gn nhn (unlabeled data) th li rt phong ph. i vi cc bi ton hc phn lp d liu vn bn, c bit l php lp trang Web, vn ni trn tr nn ph bin hn. Do vy, vic xem xt cc thut ton hc khng cn nhiu d liu gn nhn, c kh nng tn dng c ngun rt phong ph cc d liu cha gn nhn nhn c s quan tm ca nhiu nh khoa hc trn th gii. Vic hc ny c cp ti l vic hc bn gim st. Vo thng 1-2006, Xiaojin Zhu cho mt ci nhn tng quan v cc thut ton ni trn [23]. Hc bn gim st (semi-supervised learning) l vic hc trn c d liu gn nhn v d liu cha gn nhn. Phng php s dng mt s lng ln cc d liu cha gn nhn, v mt lung nh d liu c gn nhn ban u (thng c gi l seed set) xy dng mt b phn lp. V thng tin c b sung t d liu cha gn nhn, tim nng s thu c mt b phn lp mi tt hn b phn lp ch xy dng trn d liu gn nhn. C nhiu thut ton hc bn gim st, in hnh nh cc thut ton EM [20], TSVM (transductive support vector machine) [13], SGT (spectral graph transductive) [12]. Trong phm vi kha lun ny, chng ti tp trung vo hai thut ton thng dng nht l thut ton self-training v co-training. Mc tiu t ra cho kha lun l kho st, phn tch k lng hai thut ton ny nhm xut mt s k thut lm trn chng v ng dng chng trong bi ton phn lp trang Web. Kha lun c t chc thnh bn chng chnh vi ni dung c bn nh sau: Chng 1 trnh by tng quan v phn lp vn bn v hc bn gim st. Trc khi gii thiu v phn lp vn bn bn gim st, kha lun trnh by nhng nt c bn nht v phn lp vn bn c gim st vi thut ton phn lp in hnh l Nave Bayes. Sau kha lun gii thiu v thut ton hc bn gim st v i snh vi thut ton hc gim st. Chng 2 trnh by hai thut ton self-training v co-training. Phn u chng gii thiu hai thut ton hc bn gim st Self-training, Co-training v nh gi chng. Thng qua , kha lun xut mt s k thut lm trn v m hnh thi hnh thut ton self-training v co-training trn c s thut ton Nave Bayes.
Thc nghim phn lp trang web c trnh by trong Chng 3. Ni dung thc nghim cc phng php Nave Bayes c m t chi tit cng vi mt s nhn xt nh v gi kt qu thc nghim. Phn Kt lun tng hp cc kt qu t c ca kha lun v nu mt s phng hng nghin cu tip theo.
Tng quan v phn lp vn bn v hc bn gim st nh vy. Tuy nhin, c th lit k danh sch cc t dng cho ting Vit mc d c th l khng y (chng hn, xem trong Ph lc). Vic ly t gc v lu li cc t pht sinh t mi t gc nng cao kh nng tm kim c p dng cho cc ngn ng t nhin c chia t, chng hn nh ting Anh. Tiu chun nh gi: Phn lp vn bn c coi l khng mang tnh khch quan theo ngha d con ngi hay b phn lp t ng thc hin vic phn lp th u c th xy ra sai st. Tnh a ngha ca ngn ng t nhin, s phc tp ca bi ton phn lp c coi l nhng nguyn nhn in hnh nht ca sai st phn lp. Hiu qu ca b phn lp thng c nh gi qua so snh quyt nh ca b phn lp vi quyt nh ca con ngi khi tin hnh trn mt tp kim th (test set) cc vn bn c gn nhn lp trc. C ba o in hnh c s dng nh gi hiu qu ca thut ton phn lp, l chnh xc (precision), hi tng (recall) v o F1 c tnh ln lt theo cng thc (1.1), (1.2), (1.3). o
precision = true _ positive 100 (true _ positive) + (true _ negative)
o recall =
true _ positive 100 (true _ positive) + ( false _ positive) 2 recall precision recall precision
o F1 (recall , precision) =
Trong cc cng thc ny, positive / negative lin quan ti cc v d cn true / false lin quan ti kt qu thc hin ca b phn lp. C th, i lng true_positive ch s lng v d positive m b phn lp cho l ng thuc lp, i lng true_nagative ch s lng v d nagative m b phn lp cng cho l ng thuc lp, cn i lng false_positive ch s lng v d positive m b phn lp li coi l khng thuc lp. Bi ton phn lp vn bn c rt nhiu ng dng trong thc t, in hnh l cc ng dng lc trn Internet.
d i c to ra bng cch:
La chn mt thnh phn da theo cc u tin ca n,
P(c j ; )
Tng quan v phn lp vn bn v hc bn gim st Sau , m hnh trn to ra vn bn da trn cc tham s ca n, vi phn phi
P ( d i c j ; ) .
Chng ta m t likelihood ca mt vn bn l tng xc sut ca tt c cc thnh phn trn.
P ( d i ) = P ( c j ) P ( d i c j ; )
j =1
(1.4)
Mi vn bn c mt nhn lp, gi s rng c s tng ng mt-mt gia nhn lp v thnh phn ca m hnh trn, v vy, ta s s dng c j va biu din thnh phn trn th j va biu din phn lp th j. Trong m hnh a thc, ta gi thit rng: di ca vn bn l c lp vi phn lp ca n. Gi thit Naive Bayes: Xc sut s xut hin ca t trong mt vn bn l c lp vi ng cnh v v tr ca t trong vn bn . V vy, mi vn bn d i c to ra t phn phi a thc ca cc t vi nhiu ln th nghim c lp vi di ca vn bn. Ta nh ngha ca t
N it
l s ln xut hin
wt trong vn bn d i , th xc sut ca vn bn di
P(di c j ; ) = P( di ) di !
t =1
P( wt c j ; ) Nit Nit !
(1.5)
Da vo cng thc, ta tnh ton ti u Bayes cho nhng nh gi ny t tp d liu hun luyn. y, c lng cho xc xut ca t wt trong vn bn thuc lp c j c tnh theo cng thc (1.6):
P( wt / c j ) =
/ V / + s=1
V
1+ i=1 N it P(c j / d i )
/D/
/D/
i =1
N is P(c j / d i )
(1.6)
Tng quan v phn lp vn bn v hc bn gim st Xc sut u tin ca mi lp c tnh n gin da trn mi lp thay v trn cc t nh cng thc (1.7).
P (c ) =
j
D i =1
P (c j d i ) D
(1.7)
T vic c lng cc tham s trn d liu hun luyn theo cc phng trnh (1.5), (1.6), v (1.7) ta thc hin phn lp cc vn bn kim th v la chn phn lp vi xc sut cao nht theo qui tc Bayes.
P (c j d i ) =
P (c j ) P ( d i c j ) P(di )
(1.8)
V tnh xc sut cho cng mt vn bn v do gi thit Naive Bayes nn cng thc (1.8) s tng ng vi cng thc (1.9):
(1.9)
Tng quan v phn lp vn bn v hc bn gim st luyn (chnh l cc vn bn c gn nhn lp tng ng). Cc d liu hun luyn ny rt him v t v chng thng c thc hin bi con ngi mt tin trnh tn thi gian v cng sc. V d bi ton hc nhn bit c nhng bi bo, nhm tin tc UseNet no m ngi dng quan tm. Khi h thng phi lc, sp xp trc cc bi bo v ch a ra cc bi bo m ngi dng c th quan tm n nht mt bi ton ang thu ht c s ch ngy nay. Theo [20], Lang pht hin rng, sau khi mt ngi c v gn nhn khong 1000 bi bo, mt b phn lp c hun luyn qua chng s thu c chnh xc khong 50% khi d on ch 10% cc bi bo c tin cy cao nht. Tuy nhin, hu ht ngi s dng h thng thc s khng c kin nhn gn nhn hng nghn bi bo c bit ch thu c chnh xc trn. Do vn t ra l xy dng mt thut ton a ra s phn lp chnh xc m ch cn mt s lng nh d liu hc, tc ch vi vi chc bi bo c gn thay v hng nghn bi bo. Nhu cu v mt lng ln cc d liu hc v nhng kh khn thu c cc d liu t ra mt cu hi quan trng: Liu c th s dng c ngun thng tin no khc trong phn lp vn bn m c th lm gim s cn thit ca d liu gn nhn? y chnh l ngun ng lc thc y s pht trin ca cc phng php hc bn gim st (semi-supervised learning). Nhn vo s tn ti ca d liu ta thy, trong thc t d liu thng tn ti dng trung gian: Khng phi tt c d liu u c gn nhn cng nh khng phi tt c chng u cha c gn nhn. Bn gim st l mt phng php hc s dng thng tin t c hai ngun d liu ny. ng lc thc y hc bn gim st: s hiu qu ca hc bn gim st c rt nhiu cc nghin cu v hc bn gim st. Nhng kt qu thc nghim cng nh l thuyt ch ra rng s dng cch tip cn nh gi kh nng ging nhau cc i (Maximum Likelihood) c th ci tin chnh xc phn lp khi c thm cc d liu cha gn nhn[20]. Tuy nhin, cng c nhng nghin cu ch ra rng, d liu cha gn nhn c th ci tin chnh xc phn lp hay khng l ph thuc vo cu trc bi ton c ph hp vi gi thit ca m hnh hay khng? Gn y, Cozman [11] thc nghim trn d liu gi hng vo tm hiu gi tr ca d liu cha gn nhn. ng ch ra rng,
Tng quan v phn lp vn bn v hc bn gim st chnh xc phn lp c th gim i khi thm vo ngy cng nhiu d liu cha gn nhn. Nguyn nhn ca s gim ny l do s khng ph hp gia gi thit ca m hnh v phn phi d liu thc t. Theo [6], vic hc bn gim st mang li hiu qu cn mt iu kin tin quyt l: Phn phi cc mu cn pht hin phi ph hp vi bi ton phn lp. V mt cng thc, cc tri thc thu c t d liu cha gn nhn p ( x ) phi mang li thng tin hu ch cho suy lun p ( x y ) . Olivier Chapelle [6] xut mt gi thit lm trn, l hm nhn lp vng c mt cao th trn hn vng c mt thp. Gi thit c pht biu nh sau: Gi thit bn gim st: Nu hai im x1 , x2 thuc vng c mt cao l gn nhau th u ra tng ng ca chng nn l y1 , y2 . Gi thit ny ng l nu hai im c lin kt bi mt ng dn trn vng mt cao th u ra ca chng nn gn nhau. i vi bi ton phn lp vn bn, ta hnh dung nh sau: D liu cha gn nhn s cung cp thng tin v phn phi xc sut ng thi (joint probability distribution) ca cc t kha. V d vi bi ton phn lp trang web vi hai lp: trang ch ca mt kho hc v khng phi trang ch ca mt kho hc. Ta coi trang ch ca mt kho hc l hm ch. V vy, trang ch ca mt kho hc s l mu dng (positive example), v cc trang cn li l cc mu m (negative example). Gi s ch s dng d liu gn nhn ban u ta xc nh cc vn bn c cha t bi tp(homework) thng thuc lp dng. Nu s dng quan st ny gn nhn cc d liu cha gn nhn, chng ta li xc nh c t bi ging(lecture) xut hin thng xuyn trong cc vn bn cha gn nhn m c d on l thuc lp dng. S xut hin ca cc t bi tp v bi ging trn mt tp ln cc d liu hun luyn cha gn nhn c th cung cp thng tin hu ch xy dng mt b phn lp chnh xc hn xem xt c bi tp v bi ging nh l cc th hin ca cc mu dng. c th hiu c bn cht ca hc bn gim st, u tin chng ta cn hiu th no l hc gim st (supervised) v hc khng gim st (unsupervised).
Tng quan v phn lp vn bn v hc bn gim st st trong mt mu thng c gi thit l c lp cng phn phi (i.i.d) nhm lm n gin ho tnh ton ton hc bn di ca nhiu phng php thng k. Trong nhiu ng dng thc, iu ny thng khng thc t. Hc khng gim st: Cho trc mt mu ch gm cc i tng (objects), cn tm kim cu trc ng quan tm (interesting structures) ca d liu, v nhm cc i tng ging nhau. Biu din ton hc ca phng php ny nh sau: t X = (x1 , x 2 ,..., x n ) l tp hp gm n mu (examples or points), xi vi mi i [n]:= {1,2, ..., n}. Thng thng, ta gi thit rng cc mu c to ra mt cch c lp v ging nhau (i.i.d independently and identically distributed) t mt phn phi chung trn . Mc ch ca hc khng gim st l tm ra mt cu trc thng minh(interesting structure) trn tp d liu . Hc gim st: Cho trc mt mu bao gm cc cp i tng - nhn ( xi , yi ) , cn tm ra mi quan h d on gia cc i tng v cc nhn. Mc ch l hc mt php nh x t x ti y, khi cho trc mt tp hun luyn gm cc cp ( xi , y i ), trong y i gi l cc nhn hoc ch ca cc mu
vector ct ca cc nhn. Nh nu, mt yu cu chun l cc cp ( xi , y i ) tun theo gi thit i.i.d tri khp trn X O Nhim v c nh r l, ta c th tnh ton c Y mt php nh x thng qua thi hnh d on ca n trn tp kim th. Nu cc nhn lp l lin tc, nhim v phn lp c gi l hi quy (regression). C hai h thut ton gim st: generative model v discriminative model Generative model: Phng php ny s to ra mt m hnh mt ph thuc vo lp (classconditional density) p ( x y ) bng mt vi th tc hc khng gim st. Mt mt sinh c th c suy lun bng cch s dng l thuyt Bayes.
p ( x y) = p ( x y) p ( y)
p ( x y ) p ( y ) dy
y
(1.10)
10
Tng quan v phn lp vn bn v hc bn gim st Phng php ny s thay v nh gi xi c to ra nh th no m tp trung nh gi p( y x) . Mt vi phng php discirminative hn ch chng m hnh xem
p ( y x) ln hn hoc nh hn 0.5, v d nh SVM. Trong thc hnh, phng php ny
thng c nh gi l hiu qu hn phng php sinh (generative). T , hc bn gim st c th c xem l: o Hc gim st cng thm d liu cha gn nhn (Supervised learning + additional unlabeled data). o Hc khng gim st cng thm d liu gn nhn (Unsupervised learning + additional labeled data). Hc bn gim st chnh l cch hc s dng thng tin cha trong c d liu cha gn nhn v tp d liu hun luyn. Cc thut ton hc bn gim st c nhim v chnh l m rng tp cc d liu gn nhn ban u. Hiu qu ca thut ton ph thuc vo cht lng ca cc mu gn nhn c thm vo mi vng lp v c nh gi da trn hai tiu ch: Cc mu c thm vo phi c gn nhn mt cch chnh xc. Cc mu c thm vo phi mang li thng tin hu ch cho b phn lp (hoc d liu hun luyn).
11
Tng quan v phn lp vn bn v hc bn gim st Hc bn gim st l vic hc trn c d liu v cha c gn nhn. T mt s lng ln cc d liu cha c gn nhn, v mt lung nh d liu c gn nhn ban u (thng gi l seed set) xy dng mt b phn lp thm ch l tt hn. Trong qu trnh hc nh th phng php s tn dng c nhng thng tin phong ph ca d liu cha gn nhn (unlabeled data), m ch yu cu mt s lng rt nh cc d liu gn nhn (labeled data). tng chung l vn thu c kt qu tt nh i vi vic hc trn tp mt tp d liu ln c gn nhn. C mt cu hi l: Liu cc phng php hc bn gim st c ch hay khng? Chnh xc hn l, so snh vi hc gim st ch s dng d liu gn nhn, ta c th hy vng vo s chnh xc ca d on khi xt thm cc im khng gn nhn. Cu tr li l c di nhng gi thit ph hp (certain assumptions) ca tng m hnh[6].
12
Tng quan v phn lp vn bn v hc bn gim st n( latent variables)[5,20]. y, hai khi nim incomplete data v latent variables c lin quan n nhau: Khi tn ti bin n, th d liu l khng hon chnh v ta khng th quan st c gi tr ca bin n; tng t nh vy khi d liu l khng hon chnh, ta cng c th lin tng n mt vi bin n vi d liu thiu Thut ton EM gm hai bc lp: E-step v M-step. Khi u, n gn gi tr ngu nhin cho tt c cc tham s ca m hnh. Sau , tin hnh lp hai bc lp sau: E-step (Expectation step): Trong bc lp ny, n tnh ton likelihood mong mun cho d liu da trn cc thit lp tham s v incomplete data. M-step (Maximization step): Tnh ton li tt c cc tham s s dng tt c cc d liu. Khi , ta s c mt tp cc tham s mi. Tin trnh tip tc cho n khi likelihood hi t, v d nh t ti cc i a phng. EM s dng hng tip cn leo i, nn ch m bo t c cc i a phng. Khi tn ti nhiu cc i, vic t ti cc i ton cc hay khng l ph thuc vo im bt u leo i. Nu ta bt u t mt i ng (right hill), ta s c kh nng tm c cc i ton cc. Tuy nhin, vic tm c right hill thng l rt kh. C hai chin lc c a ra gii quyt bi ton ny: Mt l, chng ta th nhiu gi tr khi u khc nhau, sau la chn gii php c gi tr likelihood hi t ln nht. Hai l, s dng m hnh n gin hn xc nh gi tr khi u cho cc m hnh phc tp. tng l: mt m hnh n gin hn s gip tm c vng tn ti cc i ton cc, v ta bt u bng mt gi tr trong vng tm kim ti u chnh xc khi s dng m hnh phc tp hn. Thut ton EM rt n gin, t nht l v mt khi nim. N c s dng hiu qu nu d liu c tnh phn cm cao.
13
Tng quan v phn lp vn bn v hc bn gim st tm hm f. Sau , ta s s dng hm f d on nhn yn +1 cho cc mu cha gn nhn xn +1 . Cc vn ca phng php: o Kh tp hp cc d liu gn nhn . o Ly cc mu d liu cha gn nhn th d dng. o Cc mu cn phn lp l bit trc. o Khng quan tm n hm phn lp f. Do vy cn ng dng hc theo kiu truyn dn. Hc truyn dn Hc truyn dn c Vapnik cp t nm 1998. Mt b hc c gi l truyn dn nu n ch x l trn d liu gn nhn v d liu cha gn nhn, v khng th x l d liu m n cha bit. Cho trc mt tp cc mu gn nhn
' ' {(x i , yi ): i = 1,2, ..., n} v mt tp cc d liu cha gn nhn x1' , x2 ,..., xm , mc ' ' ch ca ta l tm cc nhn y1' , y2 ,..., ym . Hc truyn dn khng cn thit phi xy
dng hm f, u ra ca n s l mt vector cc nhn lp c xc nh bng vic chuyn thng tin t d liu gn nhn sang d liu cha gn nhn. Cc phng php da trn th lc u thng l truyn dn.
+ +
Transductive SVM
(Balcan)
Hnh 1. Siu phng cc i. ng chm chm l kt qu ca b phn lp SVM quy np, ng lin tc chnh l phn lp SVM truyn dn 14
Tng quan v phn lp vn bn v hc bn gim st TSVM l mt m rng ca SVM chun. Trong SVM ch c d liu gn nhn c s dng, mc ch l tm siu phng cc i da trn cc mu d liu hun luyn. Vi TSVM, cc im d liu cha gn nhn cng c s dng. Mc ch ca TSVM l gn nhn cho cc im d liu cha gn nhn cho bin tuyn tnh c l phn cch l ln nht trn c d liu gn nhn v d liu cha gn nhn (xem hnh 1).
Hnh 2. th trng s da trn cc mu d liu gn nhn v d liu cha gn nhn Mc ch l tm ra mt nht ct cc tiu ( v+ , v ) trn th (nh hnh 2). Sau , gn nhn dng cho tt c cc mu cha gn nhn thuc th con cha v+ , v gn nhn m cho tt c cc mu cha gn nhn thuc th con cha v . Phng php ny a ra mt thut ton c thi gian a thc tm kim ti u ton cc thc s ca n.
15
+
La chn cc mu c gn nhn c tin cy cao Hun luyn mt b phn lp bng mt thut ton hc
Self-training l k thut hc bn gim st c s dng rt ph bin, vi mt b phn lp (classifier) ban u c hun luyn bng mt s lng nh cc d liu gn nhn. Sau , s dng b phn lp ny gn nhn cc d liu cha gn nhn. Cc d liu c gn nhn c tin cy cao (vt trn mt ngng no ) v nhn tng ng ca chng c a vo tp hun luyn (train set). Tip , b phn lp c hc
16
Thut ton self-training v co-training li trn tp hun luyn mi y v th tc lp tip tc. mi vng lp, b hc s chuyn mt vi cc mu c tin cy cao nht sang tp d liu hun luyn cng vi cc d on phn lp ca chng. Tn gi self-training xut pht t vic n s dng d on ca chnh n dy chnh n. S thut ton self-training c m t nh hnh 4. t L : Tp cc d liu gn nhn. U : Tp cc d liu cha gn nhn Lp Hun luyn b phn lp h trn tp d liu hun luyn L. S dng h phn lp d liu trong tp U. Tm tp con U ca U c tin cy cao nht. L + U -> L U U-> U Hnh 4: S thut ton self-training Self-training c ng dng trong mt vi nhim v x l ngn ng t nhin: Riloff, Wiebe v Wilson (2003) [10] s dng self-training xc nh cc danh t c thuc quan im c nhn hay khng... Self-training cng c ng dng trong phn tch c php v dch my.
17
Thut ton self-training v co-training o Mi b phn lp sau li phn lp cc d liu unlabel data. Sau , chng la chn ra cc unlabeled data + nhn d on ca chng (cc examples c tin cy cao) dy cho b phn lp kia. o Sau , mi b phn lp c hc li (re-train) vi cc mu hun luyn c cho bi b phn lp kia v tin trnh lp bt u. Ci kh ca co-training l ch: hai b phn lp phi d on trng khp trn d liu cha gn nhn rng ln cng nh d liu gn nhn. Nhng tng v s dng s d tha feature c thi hnh trong mt vi nghin cu. Yarowsky s dng co-training tm ngha cho t vng, v d quyt nh xem t plant trong mt ng cnh cho trc c ngha l mt sinh vt sng hay l mt x nghip. Yarrowsky[8] tin hnh tm ngha ca t bng cch xy dng mt b phn lp ngha (sense classifier) s dng ng cnh a phng ca t v mt b phn lp ngha da trn ngha ca nhng ln xut hin khc trong cng mt vn bn; Riloff v Jones[9] phn lp cm danh t ch v tr a l bng cch xem xt chnh cm danh t v ng cnh ngn ng m cm danh t xut hin; Collin v Singer[16] thc hin phn lp tn thc th nh danh s dng chnh t v ng cnh m t xut hin. S co-training c s dng trong rt nhiu lnh vc nh phn tch thng k v xc nh cm danh t. Hnh v 5 di y cho chng ta mt ci nhn trc quan ca thit lp co-training.
18
Thut ton self-training v co-training Blum v Mitchell [4] cng thc ho hai gi thit ca m hnh co-training v chng minh tnh ng n ca m hnh da trn thit lp hc gim st theo m hnh PAC chun. Cho trc mt khng gian cc mu X = X 1 X 2 , y X 1 v X 2 tng ng vi hai khung nhn (views) khc nhau ca cng mt mu (examples). Mi mu x v vy c th c biu din bi mt cp ( x1 , x2 ) . Chng ta gi thit rng mi khung nhn l ph hp phn lp chnh xc. C th, nu D l mt phn phi trn X , v C1 ,
C2 l cc lp khi nim (concept classes) c nh ngha tng ng trn X 1 v X 2 ;
gi thit rng tt c cc nhn trn cc mu vi xc sut ln hn khng di phn phi D l trng khp vi mt hm ch (target function) f1 C1 , v cng trng khp vi hm ch f 2 C2 . Ni cch khc, nu f biu din khi nim ch kt hp trn ton b mu, th vi bt k mu x = x1 x2 c nhn l , ta c f ( x ) = f1 ( x1 ) = f 2 ( x2 ) = l . Ngha l D gn xc sut bng khng mu ( x1 , x2 ) bt k m f1 ( x1 ) f 2 ( x2 ) . Gi thit th nht: Tnh tng thch (compatibility) Vi mt phn phi D cho trc trn X , ta ni rng hm ch
f = ( f1 , f 2 ) C1 C2 l tng thch (compatible) vi D nu tho mn iu kin:
D gn xc sut bng khng cho tp cc mu ( x1 , x2 ) m f1 ( x1 ) f 2 ( x2 ) . Ni cch khc, mc tng thch ca mt hm ch f = ( f1 , f 2 ) vi mt phn phi D c th c nh ngha bng mt s
0 p 1
p = 1 PrD ( x1 , x2 ) : f1 ( x1 ) f 2 ( x2 ) .
Gi thit th hai: c lp iu kin (conditional independence assumption) Ta ni rng hm ch f1 , f 2 v phn phi D tho mn gi thit c lp iu kin nu vi bt k mt mu ( x1 , x2 ) X vi xc sut khc khng th,
x = x x = x = Pr x = x f ( x )= f x Pr 1 1 2 2 1 1 2 2 2 2 ( x1 , x2 ) D ( x1 , x 2 ) D
v tng t,
19
x = x x = x = Pr x = x f ( x )= f x Pr 2 2 1 1 2 2 1 1 1 1 ( x1 , x2 ) D ( x1 , x 2 ) D
Hai ng ch ra rng, cho trc mt gi thit c lp iu kin trn phn phi D, nu lp ch c th hc c t nhiu phn lp ngu nhin theo m hnh PAC chun, th bt k mt b d on yu ban u no cng c th c nng ln mt chnh xc cao tu m ch s dng cc mu cha gn nhn bng thut ton co-training. Hai ng cng chng minh tnh ng n ca s co-training bng nh l sau: nh l (A.Blum & T. Mitchell). Nu C2 c th hc c theo m hnh PAC vi nhiu phn lp, v nu gi thit c lp iu kin tho mn, th ( C1 , C2 ) c th hc c theo m hnh co-training ch t d liu cha gn nhn, khi cho trc mt b d on yu nhng hu ch ban u
h ( x1 ) .
Blum v Mitchell tin hnh thc nghim co-training trong phn lp trang web theo s trong hnh 6 th hin rng vic s dng d liu cha gn nhn to ra mt ci tin quan trng trong thc hnh. Trong s thit lp trn, vic s dng U ' s to ra kt qu tt hn v: N bt buc hai b phn lp la chn cc mu c tnh i din hn cho phn phi D to ra tp U.
20
Cho trc: o L l tp cc mu hun luyn gn nhn. o U l tp cc mu cha gn nhn. To mt tp U ' gm u mu c chn ngu nhin t U Lp k vng S dng L hun luyn b phn lp h1 trn phn x1 ca x . S dng L hun luyn b phn lp h2 trn phn x2 ca x . Cho h1 gn nhn p mu dng v n mu m t tp U ' . Cho h2 gn nhn p mu dng v n mu m t tp U ' . Thm cc mu t gn nhn ny vo tp L . Chn ngu nhin 2 p + 2n mu t tp U b sung vo tp U ' .
21
Thut ton self-training v co-training Bng 1. Bng so snh hai thit lp self-training v co-training Tiu ch Khung nhn Tnh hung s dng Self-training 1 khung nhn
Co-training
2 khung nhn c lp
Cho kt qu tt nu cc gi thit c tho mn V hc trn 2 views d liu nn chng s cung cp nhiu thng tin hu ch cho nhau hn.
Nhc
- Kh khn trong la chn ngng tin cy ca d on ( lm gim noise trong d on). - C th c trng hp c mu khng c gn nhn cn xc nh s ln lp trnh lp v hn.
Kh khn
Co-training v self-training l hai thut ton hc bn gim st c nhim v chnh l m rng tp cc mu gn nhn ban u. Hiu qu ca thut ton ph thuc vo cht lng ca cc mu gn nhn c thm vo mi vng lp, c o bi hai tiu ch:
22
Thut ton self-training v co-training chnh xc ca cc mu c thm vo . Thng tin hu ch m cc mu mang li cho b phn lp. Xem xt tiu ch th nht ta thy, b phn lp cha cng nhiu thng tin th tin cy cho cc d on cng cao. Thut ton co-training s dng hai khung nhn khc nhau ca mt mu d liu vi gi thit l mi khung nhn l y (sufficient) d on nhn cho cc mu d liu mi. Tuy nhin, gi thit ny l khng thc t bi v nhiu khi tp tt c cc features ca mt mu d liu cng cha gn nhn chng mt cch chnh xc. V vy, trong cc ng dng thc, nu xt theo tiu ch ny th self-training thng c tin cy cao hn. Vi tiu ch th hai, ta bit rng thng tin m mi mu d liu gn nhn mi em li thng l cc features mi. V thut ton co-training hun luyn trn hai khung nhn khc nhau nn n s hu ch hn trong vic cung cp cc thng tin mi cho nhau. Vic la chn cc mu gn nhn mi c tin cy cao l mt vn ht sc quan trng, v nu tiu ch th nht khng c tho mn, cc mu b gn nhn sai th thng tin mi do chng em li chng nhng khng gip ch c m thm ch cn lm gim hiu qu ca thut ton.
23
Thut ton self-training v co-training th no la chn c b phn lp tt nht gia cc b phn lp trung gian (generated classifier) hoc c cch no khc la chn b phn lp cui cng l b phn lp tt nht. Xut pht t nhng i hi ny, chng ti xin trnh by mt s k thut lm trn nng cao hiu qu ca thut ton: Cc k thut m bo phn phi lp v kt hp cc b phn lp trung gian. Sau y, chng ti s trnh by chi tit cc k thut ny.
24
u vo:
S N : Tp cc mu gn nhn mi;
SL
u ra:
Tp cc mu gn nhn mi c chn t S N
Thut ton: Vi mi l L , gi sl ( sl ) l tp cc mu trong While (true) Tnh phn phi lp mi DNC da vo tp cc mu gn nhn mi:
'
S L ( SOL ) c gn nhn l
S = S N SL
Vi mi l L , gi
dol
dnl
If (tn ti mt lp l L m d nl dol > ) then Loi b ngu nhin (r+1) mu t sl sao cho
sl + sl' r S r
Else break;
= (dol + )
25
2.4.2. Kt hp b phn lp
Mt phng php gii quyt vn th 3 l lin kt cc b phn lp thnh mt b phn lp duy nht. Cu hi t ra l lm th no tn dng c li im ca tng b phn lp. Chng ta nghin cu mt vi chin lc c s dng cho tm ngha ca t (WSD Word Sense Disambiguation) nh trnh by trong [12]. t D = {D1 ,..., DR } l tp gm R b phn lp
Di ( w) = [d i ,1 ( w),..., d i ,c ( w)]
y
di , j ( w)
l mc h tr m b phn lp
Di gn w vo lp
wj .
Lin kt cc b phn lp ngha l tm nhn lp cho w da trn u ra ca R b phn lp. Khi , u ra ca b phn lp cui cng s l:
D( w) = [1 ( w),..., c ( w)]
ta gn nhn w vo lp ws nu
s ( w) t ( w), t = 1,..., c
y, chng ta s dng chin lc phn lp sau: Median Rule: Max Rule: Min Rule:
1 R k = arg max d i , j ( w) j R i =1
k = arg max max iR=1 d i , j ( w)
j
i i, j
( w),
trong :
26
i , j ( w) =
Nhng quy tc ny cung cp cc cch khc nhau lin kt cc b phn lp ring bit: median rule: Ly trung bnh tt c cc b phn lp, max rule: mc h tr (support degree) cho mi lp c quyt nh bi mc h tr cao nht trong cc b phn lp; min rule: Xc nh mc h tr cho mi lp bng mc h tr tho mn tt c cc b phn lp; Majority Voting la chn lp c h tr bi nhiu b phn lp nht.
27
u vo: L: Tp cc mu hun luyn gn nhn; U: Tp cc mu cha gn nhn; H: Thut ton hc gim st c bn;
: Ngng tin cy la chn mt mu mi;
C: Mt chin lc lin kt cc b phn lp; Thut ton: Lp k vng lp: Begin To tp U ' bng cch ly ngu nhin u mu t U U ' = u S dng L trn view1 v thut ton H hun luyn b phn lp h1 S dng L trn view2 v thut ton H hun luyn b phn lp h2 Dng h1 gn nhn cc mu trong U ' v chn cc mu c gn nhn mi c tin cy ln hn ngng 5. Dng h2 gn nhn cc mu trong U ' v chn cc mu c gn nhn mi c tin cy ln hn ngng 1. 2. 3. 4. Gi tp cc mu gn nhn mi va thu c l
SL
S NL gm tp cc
S NL
Hnh 8: Thut ton co-training mi vi th tc duy tr phn phi lp v lin kt cc b phn lp.
28
Chng 3
29
Anchor Text
30
3.2. Cc lp vn bn
H thng phn lp ni dung Web ca kho lun c xy dng da trn cy phn lp tin tc ca Bo in t VnExpress (http://vnexpress.net) ca cng ty truyn thng FPT. Chng ti la chn cc phn lp sau t cy phn lp ca VnExpress: Vi tnh, Phng tin, Sc kho, Th thao, Php lut, Vn ho. Vic chng ti quyt nh la chn cc phn lp ny l v nhng phn lp ny c cc c trng c tnh chuyn bit cao. Bng 2 m t ni dung lin quan n tng lp. Bng 2. Bng m t cc phn lp STT Tn phn lp Vnexpress 1 2 3 4 5 6 Cng ngh Php lut Phng tin Sc kho Th thao Vn ho Vi tnh Php lut t Xe my Sc kho Th thao Vn ho
M t cc ni dung lin quan Cng ngh thng tin v truyn thng Cc v n, v vic, cc vn bn mi, ... Ch yu l gii thiu cc loi t, xe my mi Sc kho, gii tnh, chm sc sc p, ... Bng , tennis, ...; cc cu th, trn u, ... m nhc, thi trang, in nh, m thut, ...
31
Thc nghim trong bi ton phn lp vn bn Bng 3: Cu hnh my tnh Thnh phn CPU RAM OS Ch s PIV, 2.26GHz 384 MB Linux Fedora 2.6.11
STT 1
Tc gi Jose Solorzan o
Ngun http://jexpert.us
html2text.ph p
Cng c lc nhiu theo tng trang web c th cho ton b cc file .html. Nguyn Vit Cng K46CA
text2telex.ph p
Cng c chuyn vn bn b m ho unicode ting Vit sang nh dng ting Vit kiu telex cho ton b cc file m html2text sinh ra.
Ngoi ra, trong qu trnh chun b d liu, chng ti vit mt s cng c chy trn nn Linux v Win vi b bin dch tng hp GNU GCC v b thng dch PHP nh lit k trong bng 5.
32
STT 1 2 3
Tn cng c
M t
reject_stop_word.php Cng c loi b cc t dng ca mt vn bn sau khi a v dng telex format_feature.php text2telex.php Cng c thng k trong mi vn bn th mt t xut hin bao nhiu ln Cng c chuyn vn bn b m ho unicode ting Vit sang nh dng ting Vit kiu telex cho ton b cc file m html2text sinh ra. Cng c dng ly cc AnchorText ca mt trang web
get_AnchorText.php
33
STT 1 2 3
M t Thc hin cc php tnh ton vi s ln c chiu di tu Lu tr KeyWord ca tng lp theo dng t th i xut hin trong lp j bao nhiu ln Mt s hm phc v cho cc lp
4 5 6 7
Cng c dng ly cc AnchorText ca mt trang web Phn chia ngu nhin vn bn cc lp vo cc tp test, train v tp cha gn nhn To ra mt b U t mt tp cc vn bn cha gn nhn X l b U va to ra: gn nhn lp, ly cc mu tin cy vo tp hun luyn v thc hin th tc SAE vi cc lp c chnh lch phn phi vt qu tham s
8 9 10
T thng tin KeyWord c c sau mt s vng lp, thc hin gn nhn cho tp kim th Chng trnh chnh thc hin cc thut ton bootstrapping Thc hin cc th tc ci tin v kt hp cc b phn lp.
34
35
Thc nghim trong bi ton phn lp vn bn Sau , chng ta tin hnh xy dng n-gram: Xt v d vi mnh thng tin ng cnh l d bo cng ngh thng tin Vit Nam nm 2005 th danh sch cc n-gram l:
Kt qu d, bo, cng, ngh, thng, tin, Vit, Nam, nm, 2005 d_bo, bo_cng, cng_ngh, ngh_thng, thng_tin, tin_Vit, Vit_Nam, Nam_nm, nm_2005. d_bo_cng, bo_cng_ngh, cng_ngh_thng, ngh_thng_tin, thng_tin_Vit, tin_Vit_Nam, Vit_Nam_nm, Nam_nm_2005. Bng 7: Danh sch cc n-gram
Vi cc n-gram c sinh ra nh trn (xem bng 7), chng ti tin hnh xy dng cc mnh thng tin ng cnh nh sau, v d mt mnh ch ra vn bn th di c cha cm t wt no n ln: [<di> cha <wt>: n ln] Do thut ton hc bn gim st self-training v co-training l mt tin trnh lp nn vic thu c tng c trng trong mt vn bn mi l rt c ngha. Do vy, chng ti quyt nh la chn tt c cc c trng tin hnh phn lp m khng loi b mt c trng no c.
36
Thc nghim trong bi ton phn lp vn bn thy rng b phn lp anchor text ) cho nn trong d on mi lp ta nh ngha mt b phn lp lin kt.
P (c j x) = P(c j x1 ) P(c j x2 )
(11)
3.6. Kt qu ca cc b phn lp
B phn lp gim st Nave Bayes da trn ni dung ca mt ti liu: Bng 8 biu din kt qu b phn lp ny vi cc o: chnh xc, hi tng, o F1. Bng 8: Cc o ca b phn lp gim st Nave Bayes da trn content chnh xc cong_nghe phap_luat phuong_tien suc_khoe the_thao van_hoa Trung bnh 0.944 0.714 0.857 0.778 1 0.727 0.837 hi tng 0.85 1 0.9 0.7 0.65 0.8 0.817 o F1 0.895 0.833 0.878 0.737 0.788 0.762 0.815
Da vo kt qu bng 8, ta thy cc o ca b phn lp gim st Nave Bayes l kh cao. o F1 trong trng hp cao nht ln n 89.5%. Do , ta hon ton c th tin cy vo d on ca b phn lp ny tin hnh cc bc lp selftraining. T bng kt qu , chng ta biu din th o F1 i vi tng lp nh hnh 10.
37
89.5 83.3
o F1
co ng _n gh e
ph ap _l ua t
su c_ kh oe
Hnh 10: th biu din o F1 ca b phn lp gim st Nave Bayes da trn content B phn lp bn gim st self-training gc v self-training ci tin da trn ni dung ca mt vn bn: Cc o c lit k bng 9.
ph uo ng _t ie
38
th e_ th ao
va n_ ho a
Bng 9: Cc o ca self-training (ban u/ci tin MAX/ ci tin MEDIAN) da trn content.
Ban u
Precis Recall ion cong_nghe phap_luat phuong_tien suc_khoe the_thao van_hoa Trung bnh
0.9 0.667 0.818 1 0.929 0.85 0.86 0.9 1 0.9 0.7 0.65 0.85 0.83
MAX
F1
0.9 0.8 0.857 0.824 0.765 0.85 0.833
MEDIAN
F1
0.878 0.833 0.857 0.824 0.8 0.85 0.840
F1
0.878 0.816 0.857 0.824 0.765 0.85 0.832
T bng cc o kt qu, ta biu din th o F1 trung bnh ca cc b phn lp bn gim st self-training (ban u/ MAX/ MEDIAN) nh hnh v 11.
39
100 90 80 70 60 50 40 30 20 10 0
85 85 85 80 76.5 76.5
o F1
ng _n gh e
lu at
c_ kh o
g_ tie
th a
ph uo n
ph a
co
3.7. Mt s nhn xt kt qu t c
T kt qu thu c trn chng ti c mt s nhn xt sau: - Self-training nng chnh xc so vi cc thut ton hc gim st thng thng: o F1 trung bnh trong trng hp hc gim st l 81.5%, trong khi o F1 trung bnh trong trng hp hc bn gim st self-training ban u l 83.3%, self-training vi qui tc lm trn MAX l 84%, self-training vi qui tc lm trn MEDIAN l 83.2% - T , chng ti nhn thy vic p dng cc qui tc lm trn c xut trong kho lun ny thc s em li hiu qu trong trng hp ca bi ton phn lp vn bn ny.
su
40
th e_
va n_ h
p_
oa
41
Ting Anh
[3]. Andrew McCallum, Kamal Nigam, A Comparison of Event Model for Naive Bayes Text Classification, Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization, 1998. [4]. Avrim Blum and Tom Mitchell, Combining labeled and unlabeled data with cotraining. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98), 1998. [5]. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):138, 1977. [6]. Chapelle, O., Zien, A., & Scholkopf, B. (Eds.), Semi supervised learning. MIT Press, 2006. [7]. Cozman, F., Cohen, I., & Cirelo, M., Semi-supervised learning of mixture models. ICML-03, 20th International Conference on Machine Learning, 2003. [8]. David Yarrowsky, Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 189-196. [9]. E. Riloff and R. Jones, Learning Dictionaries for Information Extraction by MultiLevel Bootstrapping.In Proceedings of the 16th National Conference on Artificial Intelligence, 1999.
42
Ti liu tham kho [10]. Ellen Rillof, Janyce Wiebe, Theresa Wilson, Learning Subjective Nouns using Extraction Pattern Bootstrapping. 2003 Conference on Natural Language
43
Ti liu tham kho [21]. Rosie Jones, Andrew McCallum, Kamal Nigam, Ellen Rillof, Bootstrapping for text learning Tasks, IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999. [22]. Susana Eyheramendy, David D. Lewis, David Madigan, On the Naive Bayes Model for Text Classification, to appear in Artificial Intelligence & Statistics 2003. [23]. Xiaojin Zhu, Semi-Supervised Learning Literature Survey. Computer Sciences TR 1530, University of Wisconsin Madison, February 22, 2006. [24]. http://en.wikipedia.org/wiki/
44