You are on page 1of 23

Trng i Hc Bch Khoa H Ni

Vin Cng Ngh Thng Tin v Truyn Thng

Bi tp ln: X l ngn ng t nhin


ti: Tm hiu phng php tch t trong vn bn ting vit theo
hng tip cn ca gii thut di truyn.

Ging vin hng dn: PGS.TS L Thanh Hng Sinh vin thc hin: 1234Nguyn Th Thy. Lng Th Hoi Thu Nguyn nh Hng Nguyn Phc Th HTTT-K53 HTTT-K53 HTTT-K53 HTTT-K53 SHSV:20082599 SHSV:20082588 SHSV:20081338 SHSV:2008256

H Ni, ngy 18-04-2012

Mc lc ITng quan 1- t vn 2- Tch t ting vit. Cc phng php tch t ting Vit hin nay 1- Vn tch t ting Vit 1.1- So snh gia ting Vit v ting Anh 1.2- Nhn xt 2- Cc hng tip cn ca k thut tch t ting Vit 2.1- Cc hng tip cn da trn t. 2.2- Cc hng tip cn da trn k t . 3- Mt s phng php tch t ting Vit hin nay 3.1- So khp t di nht (Longest Matching) 3.2- Hc da trn s ci bin (Transformation-based Learning -TBL) 3.3- Chuyn i trng thi trng s hu hn (Weighted- Finite State Transducer- WFST) 3.4- Phng php tch t da trn thng k t trn Internet v gii thut di truyn. (Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese - IGATEC) 3.5- Hc my s dng m hnh Markov n (Hidden Markov Models-HMM) 4- Kt lun. Gii thut di truyn. 1- Tng quan v gii thut di truyn 2- C s l thuyt. Phng php tch t da trn thng k Internet theo hng tip cn ca gii thut di truyn 1- Nghin cu thng k da trn Internet. 2- Phng php tch t da trn thng k Internet theo hng tip cn ca gii thut di truyn (IGATEC) 3- .Kt lun Tm hiu opensource Vntokenizer tch t trong vn bn ting vit. 1- Gii thiu chng trnh 2- Hng dn chy chng trnh 3- Cu trc m ngun.

II-

III-

IV-

V-

I-

Tng quan 1- t vn X l ngn ng t nhin (natural language processing - NLP) l mt nhnh ca tr tu nhn to tp trung vo cc ng dng trn ngn ng ca con ngi.

Trong tr tu nhn to th x l ngn ng t nhin l mt trong nhng phn kh nht v n lin quan n vic phi hiu ngha ngn ng-cng c hon ho nht ca t duy v giao tip Thc cht ca x l ngn ng t nhin l chuyn i m thanh thnh ng ngha. Vi mc ch l hiu c ngn ng, ngha. Cc bc phn tch ca XLNNTN: Phn tch hnh thi hc (Morphology): cch t c xy dng, cc tin t hu t Phn tch c php (Syntax): mi lin h v cu trc ng php gia cc t v ng. Phn tch ng ngha (Semantics ): ngha ca t, cm t v cch din t Din ngn (Discourse): quan h gia cc hoc cc cu. Thc chng: mc ch pht ngn, cch s dng ngn ng trong giao tip. Tri thc th gii: Cc chi thc v th gii v cc tri thc ngm Trong phn tch hnh thi hc tng t s c phn tch v cc k t khng phi ch (nh cc du cu) s c tch ra khi cc t. Trong ting Anh v nhiu ngn ng khc, cc t c phn tch vi nhau bng du cch. Tuy nhin trong ting Vit, du cch c dng phn tch cc ting (m tit) ch khng phi t. Cng vi cc ngn ng nh ting Trung, ting Hn,ting Nht, phn tch t trong ting Vit l mt cng vic khng h n gin. 2- Tch t ting Vit. i vi ting Anh hoc cc ngn ng khng n lp khc t l mt nhm cc k t c ngha c tch bit bng khong trng trong cu do vy vic tch t tr nn rt n gin. Cn i vi ngn ng n lp nh ting Vit, ting Hn, ting Thi li l mt bi ton kh. Bi nhng c tnh chnh ca ngn ng n lp nh sau: T dng nguyn th , hnh thc v ngha ca t c lp vi c php T c cu trc t ting. T bao gm t n v t phc (bao gm t ly v t ghp). Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text Categorization for Documents in Vietnamese) do H. Nguyn xut nm 2005 nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng cn dng n mt t in hay tp ng liu hc no. V vy trong ni dung bi tp ln ny chng em s tm hiu v phng php IGATEC v demo phng php s dng open source Vntokenizer.

II-

Cc phng php tch t hin nay 1- Vn tch t ting Vit 1.1- So snh gia ting Anh v ting Vit.

Nhng c im chnh ca ting anh v ting Vit:

Ting Vit
- L ngn ng n lp (isolate) hay cn gi l loi hnh phi hnh thi, khng bin hnh, n m tit. - T khng bin i hnh thi, ngha ng php nm ngoi t. - Phng thc ng php ch yu : trt t t v h t. - Ranh gii t khng c xc nh mc nhin bng khong trng.

Ting Anh
- L ngn ng khng n lp- loi hnh bin cch hay cn gi l loi hnh chit khut. - T c bin i hnh thi, ngha ng php nm trong t. - Phng thc ng php ch yu l ph t. - Kt hp gia cc hnh v l cht ch, kh xc nh, c nhn din bng khong trng hoc du cu.

- Tn ti loi t c bit t ch loi hay cn gi l ph danh t ch loi km theo vi danh t. - Hin tng cu to bng t ghp thm - C hin tng ly v ni li trong ting ph t (affix) vo t gc l rt ph bin. vit

1.2- Nhn xt Ting Vit l loi hnh phi hnh thi nn vic phn loi t (danh t, ng t, tnh t.) v ngha t l rt kh khn, cho d c s dng t in. Vic tin x l vn bn (tch t, tch on, tch cu) s thm phc tp vi phn x l cc h t, ph t, t ly Phng thc ng php ch yu l trt t t nn nu p dng phng php tnh xc xut xut hin ca t c th khng chnh xc nh mong i. Ranh gii t khng c xc nh mc nh bng khong trng. iu ny khin cho vic phn tch hnh thi (tch t) ting Vit tre nn kh khn. Vic nhn din ranh gii t l quan trng v lm tin cho cc x l tip theo sau . nh: kim tra li chnh t, gn nhn t loi, thng k tn sut t.. V ting Anh v ting Vit c nhng im khc bit nn chng ta khng th p dng y nguyn cc thut ton ting Anh cho ting Vit. 2- Cc hng tip cn ca k thut tch t ting Vit Da vo cc k thut tch t ca ting Hn, v nhng im tng ng gia ting Vit v ting Hn. Chng ta c th xy dng s cc hng tip cn ca k thut tch t ting Vit:

Vietnamese segmentation

Word-based

Character-based

Statistic

Dictionary

Hybrid

Unigram

N-gram

Component

Full word/ Phrase

Shortest Match

Longest Match

Overlap Match

2.1- Hng tip cn da trn t (Word-based approaches) Hng tip cn da trn t vi mc tiu tch c cc t hon chnh trong cu. Hng tip cn ny c th chia ra theo 3 hng : da trn thng k (statistics - based), da trn t in ( dictionary based) v hydrid ( kt hp nhiu phng php). Hng tip cn da trn thng k (statistic-based): da trn cc thng tin nh tn s xut hin ca t trong tp hun luyn ban u. Hng tip cn da trn t in (dictionary- based): tng ca hng tip cn ny l nhng cm t c tch rat vn bn phi c

so khp vi cc t trong t in. Ty thuc vo loi t in s dng so khp li c 2 hng tip cn : full word/ pharse v component. Trong full word/ pharse cn s dng mt t in hon chnh. Cn component th s dng t in thnh phn. Ty theo cch chn so khp t (match) hng tip cn full word/ pharse c th chia ra lm 3 loi: so khp di nht (longest match), so khp ngn nht (shortest match) v so khp kt hp (overlap). Trong so khp kt hp mi chui c pht sinh t vn bn c th chng lp ln chui khc nu chui c trong t in. Hin nay th hng tip cn so khp di nht c xem l phng php quan trng v c hiu qu nht trong hng tip cn da trn t in. Hng tip cn hybrid: Vi mc ch kt hp cc hng tip cn khc nhau tha hng c cc u im ca nhiu k thut v cc hng tip cn khc nhau nhm nng cao kt qa . Hng tip cn ny thng kt hp gia hng da trn thng k v da trn t in nhm tn dng cc mt mnh ca cc phng php ny . Tuy nhin hng tip cn Hybrid li mt nhiu thi gian x l , khng gian a v i hi nhiu chi ph.

2.2- Hng tip cn da trn k t Trong ting vit, hnh v nh nht l ting c hnh thnh bi nhiu k t trong bng ch ci . Hng tip cn ny n thun rt trch ra mt s lng nht nh cc ting trong vn bn nh rt trch t 1 k t (unigram) hay nhiu k t (n-gram) v cng mang li mt s kt qa nht nh c minh chng thng qua mt s cng trnh nghin cu c cng b , nh ca tc gi L An H [2003] xy dng tp ng liu th 10MB bng cch s dng phng php qui hoch ng ca i ha xc sut xut hin ca cc ng.Ri cng trnh nghin cu ca H. Nguyn[2005] lm theo hng tip cn l thay v s dng ng liu th , cng trnh tip cn theo hng xem Internet nh mt kho ng liu khng l , sau tin hnh thng k v s dng thut gii di truyn tm cch tch t ti u nht , v mt s cng trnh ca mt s tc gi khc.Khi so snh kt qa ca tc gi L An H v H.Nguynt th thy cng trnh ca H.Nguyn cho c kt qa tt hn khi tin hnh tch t , tuy nhin thi gian x l lu hn.u im ni bt ca hng tip cn da trn nhiu k t l tnh n gin , d ng dng , ngoi ra cn c thun li l t tn chi ph cho thao tc to ch mc v x l nhiu cu truy vn.Qua nhiu cng trnh nghin cu ca cc tc gi c cng b , hng tip cn tch t da trn nhiu k t , c th l cch tch t hai k t c cho l s la chn thch hp. 3- Mt s phng php tch t ting Vit hin nay

3.1- So khp t di nht (Longest Matching) Longest Matching l thut ton da trn t tng tham lam. N xt cc ting t tri qua phi, cc ting u tin di nht c th m xut hin trong t in s c tch ra lm mt t. Thut ton s dng khi xt ht cc ting. Thut ton ch ng khi khng c s nhp nhng nhng ting u ca t sau c th ghp vi t trc to thnh mt t c trong t in. Gii thut: V l danh sch cc ting cha xt. T l b t in. While V do Begin Wmax= t u danh sch V; // t di nht Foreach (v thuc t gm cc ting bt u trong V) If(length(v)> length(Wmax) and v thuc T) then Wmax= v; Loi i cc ting trong Wmax u danh sch V; End. V d : Ti l sinh vin trng i hc Bch Khoa H Ni

Bc 1 2 3 4 5 6 7

T di nht c th Ti L Sinh vin Trng i hc Bch Khoa H Ni

Cc ting cn li l sinh vin trng i hc Bch Khoa H Ni sinh vin trng i hc Bch Khoa H Ni trng i hc Bch Khoa H Ni i hc Bch Khoa H Ni Bch Khoa H Ni H Ni

u im: - Tch t nhanh n gin ch cn da vo t in. - chnh xc tng i cao. Hn ch: - chnh xc ph thuc vo hon ton vo tnh y v chnh xc ca t in. - Phng php ny s khng t c kt qu nu chui t trc c lin h vi cc t sau. V d : mt ng quan ti gii => mt ||ng|| quan ti|| gii. 3.2- Hc da trn s ci bin (Transformation-based Learning -TBL) y l cch tip cn da trn ng liu nh du. Theo cch tip cn ny, hun luyn cho my tnh bit cch nhn din ranh gii t ting Vit, ta c th cho my hc trn ng liu hng vn cu ting Vit c nh du ranh gii t ng.

Sau khi hc xong, my s xc nh c cc tham s (cc xc sut) cn thit cho m hnh nhn din t. u im: - c im ca phng php ny l kh nng t rt ra quy lut ca ngn ng. - N c nhng u im ca cch tip cn da trn lut nhng n khc phc c khuyt im ca vic xy dng cc lut mt cch th cng bi cc chuyn gia. - Cc lut c th nghim ti ch nh gi chnh xc v hiu ca lut (da trn ng liu hun luyn). - C kh nng kh mt s nhp nhng ca cc m hnh ngn ng theo kiu thng k. Hn ch: - Phng php ny dng ng liu c gn nhn ngn ng hc t ng cc quy lut . M vic xy dng mt tp ng liu t c y cc tiu ch ca tp ng liu trong ting Vit l iu rt kh, tn km nhiu v mt thi gian v cng sc. - H phi tri qua mt thi gian hun luyn kh lu c th rt ra cc lut tng i y . - Ci t phc tp. 3.3- Chuyn i trng thi trng s hu hn (Weighted- Finite State TransducerWFST) M hnh mng chuyn dch trng thi hu hn c trng s WFST c xut nm 1996. tng c bn l p dng WFST kt hp vi trng s l xc sut xut hin ca mi t trong ng liu. Dng WFST duyt qua cu cn xt. Cch duyt c trng s ln nht s l cch tch t c chn. Phng php ny cng c s dng trong cng trnh c cng b ca tc gi inh in [2001] , tc gi s dng WFST km vi mng Neural kh nhp nhng khi tch t , trong cng trnh tc gi xy dng h thng tch t gm tng WFST tch t v x l cc vn lin quan n mt s c th ring ca ngn ng ting Vit nh t ly , tn ring , .. v tng mng Neural dng kh nhp nhng v ng ngha sau khi tch t (nu c). S cc bc s l ca WFST Tng WFST: gm 3 bc Xy dng t in trng s: theo m hnh WFST, vic phn on t c xem nh mt s chuyn dch trng thi c xc sut (Stochastic Transduction). Chng ta mien t t in D l mt th bin trng thi hu hn c trng s. Gi s: - H: l tp cc t chnh t ting Vit (cn gi l ting) - P: l t loi ca t (POS: part - Of Speech). Mi cung ca D c th l: T mt phn t ca H ti mt phn t ca H, hoc T (k hiu kt thc t) ti mt phn t ca P

Cc nhn trong D biu th mt chi ph c lng (estimated cost) bng cng thc:

Cost = - log(f/N) Vi f: tn s ca t, N: kch thc tp mu


Bt u

i vi cc trng hp t mi cha gp, tc gi p dng xc sut c iu kin Goog Turning (Baayen) tnh ton trng s.

Xy dng kh nng phn on t: gim bt s bng n t hp khi sinh ra cc dy cc t c th t mt dy cc Tin x l ting trong cu, tc gi xut mt phng php mi l kt hp dng t in hn ch sinh ra cc bng n t hp. Khi pht hin thy mt cch phn on t no t<T0 khng ph hp (khng c trong t in, khng phi l t ly, khng phi l danh t ring) th tc gi loi b cc nhnh xut pht t cch phn on t . Tin x l La chon kh nng phn on t ti u: Sau k hi c mt danh sch cc cch phn on t c th c ca cu, tc gi chn trng hp phn on c trng s b nht Bt u nh sau: - V d: input = Tc truyn thong tin s cao Dictionary tc 8.68 truyn 12.31 truyn thng 12.31 thng tin 7.24 tin 7.33 s 6.09 tng 7.43 cao 6.95 Id(D)*D* = Tc # truyn thng # tin # s # tng # cao 48.79 (8.68 + 12.31 +7.33 +6.09 + 7.43 + 6.95 = 48.79) Id(D)*D* = Tc # truyn # thng tin # s # tng# cao. 48.70 (8.68 + 12.31 +7.24 +6.09 + 7.43 + 6.95 = 48.70) Do , ta c c phn on ti u l Tc # truyn # thng tin # s # tng # cao. Tng mng neural: m hnh mng neural m tc gi xut c dng lng gi 3 dy t loi: NNV,NVN, VNN (N: Noun, V: Verb). M hnh ny c hc bng chnh cc cu m cch phn on t vn cn nhp nhng sau khi qua m hnh th nht u im - chnh xc trn 97% [inh in et al, 2001] - M hnh cho kt qu phn on t vi tin cy (xc sut) km theo - Nh c tng mng neural nn m hnh c th kh nhp nhng cc trng hp tn WFST cho ra nhiu ng vin c kt qu ngang nhau

Tin x l

- Phng php ny cho kt qu vi chnh xc kh cao v mc ch ca tc gi mun nhm n vic tch t tht chnh xc l nn tng cho vic dch my Hn ch - Cng tng t nh phng php TBL, vic xy dng tp ng liu l rt cng phu, nhng tht s cn rt cn thit phc v cho mc ch dch my sau ny. 3.4- Phng php tch t da trn thng k t trn Internet v gii thut di truyn.(Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese - IGATEC) Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text Categorization for Documents in Vietnamese) do H. Nguyn xut nm 2005 nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng cn dng n mt t in hay tp ng liu hc no . Trong hng tip cn ny , tc gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet.
Online Extractor

segmentation Online Extractor segmentation segmentation Online Extractor

segment Online Extractor

H thng bao gm :2 phn a. Online Extractor : Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo chng hn . Sau , tc gi s dng cc cng thc di y tnh ton mc ph thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine. Tnh xc sut cc t xut hin trn Internet : count(w) p w MAX count(w1 & w 2) p(w1 & w 2) MAX

Trong MAX = 4 * 109 count(w) s lng vn bn trn Internet c tm thy c cha t w hoc cng cha w1 v w2 i vi count(w1&w2). Tnh xc sut ph thuc ca mt t ln mt t khc :

p(w1 | w 2)

p(w1 & w 2) pw1

Thng tin ph thuc ln nhau (mutual information) ca cc t ghp c cu to bi n ting ( cw = w1w2wn)


MI(cw)

n p w j p(w1 & w 2 & ..... & w n ) j 1

p( w1 & w 2 & ..... & w n )

b. GA Engine for Text Segmentation : mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Cc c th trong qun th c khi to ngu nhin , trong mi segment c gii hn trong khong 5 . GA engine sau thc hin cc bc t bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c cch tch t tt nht c th. Uu im: - Khng cn s dng bt c tp hun luyn hoc t in no. - Phng php tng i n gin. - Khng tn thi gian hun luyn. Hn ch: - So vi cc phng php trc, IGATEC c chnh xc thp hn LRMM v WFST nhng vn c chp nhn i vi mc ch tch t dnh cho phn loi vn bn. - Thi gian chy ban u kh chm do phi ly thng tin t internet m ng truyn Vit Nam cn hn ch. - Cha c th nghim trn cc tp d liu ln. 3.5- Hc my s dng m hnh Markov n (Hidden Markov Models-HMM) Phng php tch t da trn HMM v t in: p dng m hnh Markov n ta c th m hnh ha a bi ton tch t v mt m hnh xc xut di dng mt bi ton ti u. Tip , thut ton quy hoch ng Viterbi s gii quyt bi ton ti u - Trc ht ta s a ra cng thc tnh xc sut mt phn hoch. y cng chnh l cng thc o tt ca mt phn hoch. Xc sut cng cao th kh

nng ng ca phn hoch cng ln. V th bi ton ca ta a v bi ton ti u cho hm mc tiu l hm xc sut ca phn hoch. - Ta k hin phn hoch ang xt l W = W1 W2 W3 Wm - Hm mc tiu: P(W) = P(W1) = P(Wi+j|W) Trong : P(Wi): xc xut ca Wi P(Wi + 1|Wi): Xc sut chuyn t Wi sang Wi+1. Cc xc sut ny c tnh t d liu thu thp c, y l cc vn bn bng ting Vit. Nu ta hng n bi ton tch t tng qut th cc vn bn ny phi m bo khng b qu thin lch v mt lnh vc no. Tt nhin ty vo mc tiu s dng chng trnh, nu ta tp trung vo tch t cho mt lnh vc c th th khi thu thp d liu ta s tp trung vo lnh vc . T tng chnh ca thut ton l phng php quy hoch ng: Vng lp chnh i t S1 ti Sn ti mi v tr it it a s ch lu li cc gi tr nh sau: Da trn tnh cht Markov, ta nhn thy cc xuacs sut chuyn ch lien quan n t lin trc t hin ti. Do ti v tr it a ch quan tm ti cc v tr j<=i sao cho (WjWi) l mt t in trong t in. Mi v tr j tng ng vi mt t c kh nng l t cui cng ca on u t S1 n Si. V d : ta ang xt n i=2 trong cu hc sinh hc sinh hc I=2 th c 2 gi tr j tng ng l : j=1 => t cui cng l: hc sinh J=2 => t cui cng l sinh R rng l vi cch phn hoch i u tin m kt thc b cng mt t ( cng xt mt gi tr j th t ch cn lu li phn hoch ti u nht. Nh vy trong v d trn ti v tr i=2 ta ch cn lu li trong bng quy hoch ng cc thng tin nh sau: - ng vi i=2 v j=1 t cui cng l hc sinh c xc sut cc i l bao nhiu - ng vi i=2 v j=2 t cui cng l sinh th xc sut cc i l bao bao nhiu Vic tnh ton cc xc sut ti i s da trn cc gi tr tnh c ti cc v tr j-1. Cc xc sut chuyn v xc sut ca t u bit nh gi phng php - Phng php tch t ting Vit da trn m hnh Markov n v t in l mt phng php xc sut. Thc t cho thy y l mt phng php hiu qu hn so vi LongestWins (di nht thng) v Maximal Matching (so khp vi cc i hay cn gi l t t nht). Nguyn nhn l do phng php ny da trn m hnh xc sut ph hp vi thc t ca ngn ng do tn dng c cc thng tin thng k, thng tin xc sut h tr cho qu trnh tch t. - Mt khc, phng php ny vn khng gii quyt c s nhp nhng v ng ngha do khng xt ti ng cnh. 4- Kt lun Sau khi xem xt mt s hng tip cn trong tch t vn bn ting Vit , cc nghin cu c cng b u ch ra rng phng php tch t da trn t mang li kt qa c chnh xc kh cao , iu ny c c nh vo tp hun luyn ln , c nh du ranh gii gia cc t chnh xc gip cho vic hc rt ra cc lut tch t cho cc

vn bn khc c tt p , tuy nhin chng ta cng d nhn thy hiu sut ca phng php hon ton ph thuc vo tp ng liu hun luyn. Do khc phc s ph thuc ca t in, chng ta ngh s dng hng tip cn ca H.Nguyn (s c trnh by chi tit trong phn sau ) tch t . Hng tip cn da trn k t c u im l d thc hin , thi gian thc hin tng i nhanh , tuy nhin li cho kt qa khng chnh xc bng hng tip cn da trn t . Hng tip cn ny ni chung ph hp cho cc ng dng khng cn chnh xc tuyt i trong tch t vn bn nh ng dng lc spam mail , firewall ,Nhn chung vi hng tip cn ny nu chng ta c th ci tin nng cao chnh xc trong tch t th hng tip cn ny l hon ton kh thi v c kh nng thay th hng tip cn tch t da trn t v khng phi xy dng kho ng liu , mt cng vic i hi nhiu cng sc , thi gian v s h tr ca cc chuyn gia trong cc lnh vc khc nhau. III- Gii thut di truyn. 1- Tng quan v gii thut di truyn Gii thut di truyn l mt k thut ca khoa hc my tnh nhm tm kim gii php thch hp cho cc bi ton ti u t hp (combinatorial optimization). Gii thut di truyn l mt phn ngnh ca gii thut tin ha vn dng cc nguyn l ca tin ha nh di truyn, t bin, chn lc t nhin, v trao i cho. Gii thut di truyn thng c ng dng nhm s dng ngn ng my tnh m phng qu trnh tin ho ca mt tp hp nhng i din tru tng (gi l nhng nhim sc th) ca cc gii php c th (gi l nhng c th) cho bi ton ti u ha vn . Tp hp ny s tin trin theo hng chn lc nhng gii php tt hn. Thng thng, nhng gii php c th hin di dng nh phn vi nhng chui 0 v 1, nhng li mang nhiu thng tin m ha khc nhau. Qu trnh tin ha xy ra t mt tp hp nhng c th hon ton ngu nhin tt c cc th h. Trong tng th h, tnh thch nghi ca tp hp ny c c lng, nhiu c th c chn lc nh hng t tp hp hin thi (da vo th trng), c sa i (bng t bin hoc t hp li) hnh thnh mt tp hp mi. Tp hp ny s tip tc c chn lc lp i lp li trong cc th h k tip ca gii thut. 2- C s l thuyt Thut ton di truyn gm c bn quy lut c bn l lai ghp, t bin, sinh sn v chn lc t nhin nh sau: Qu trnh lai ghp (php lai) Qu trnh ny din ra bng cch ghp mt hay nhiu on gen t hai nhim sc th cha-m hnh thnh nhim sc th mi mang c tnh ca c cha ln m. Php lai ny c th m t nh sau: Chn ngu nhin hai hay nhiu c th trong qun th. Gi s chui nhim sc th ca cha v m u c chiu di l m.

Tm im lai bng cch to ngu nhin mt con s t 1 n m-1. Nh vy, im lai ny s chia hai chui nhim sc th cha-m thnh hai nhm nhim sc th con l m1 v m2. Hai chui nhim sc th con lc ny s l m11+m22 v m21+m12. a hai chui nhim sc th con vo qun th tip tc tham gia qu trnh tin ha Qu trnh t bin (php t bin) Qu trnh tin ha c gi l qu trnh t bin khi mt hoc mt s tnh trng ca con khng c tha hng t hai chui nhim sc th cha-m. Php t bin xy ra vi xc sut thp hn rt nhiu ln so vi xc sut xy ra php lai. Php t bin c th m t nh sau: Chn ngu nhin mt s k t khong 1 k m Thay i gi tr ca gen th k a nhim sc th con vo qun th tham gia qu trnh tin ha tip theo Qu trnh sinh sn v chn lc (php ti sinh v php chn) Php ti sinh: l qu trnh cc c th c sao chp da trn thch nghi ca n. thch nghi l mt hm c gn cc gi tr thc cho cc c th trong qun th ca n. Php ti sinh c th m phng nh sau: Tnh thch nghi ca tng c th trong qun th, lp bng cng dn cc gi tr thch nghi (theo th t gn cho tng c th) ta c tng thch nghi. Gi s qun th c n c th. Gi thch nghi ca c th th i l Fi, tng dn th i l Ft.Tng thch nghi l Fm . To s ngu nhin F c gi tr trong on t 0 n Fm Chn c th k u tin tha mn F Ft a vo qun th ca th h mi. Php chn: l qu trnh loi b cc c th xu v li nhng c th tt. Php chn c m t nh sau: + Sp xp qun th theo th t thch nghi gim dn + Loi b cc c th cui dy, ch li n c th tt nht.

Cu trc thut gii di truyn tng qut Bt u t =0; Khi to P(t) Tnh thch nghi cho cc c th thuc P(t); Khi (iu kin dng cha tha) lp t=t+1; Chn lc P(t) Lai P(t) t bin P(t) Ht lp Kt thc. IV- Phng php tch t da trn thng k Internet theo hng tip cn ca gii thut di truyn (IGATEC) 1-Nghin cu thng k da trn Internet. Chng ta u bit rng Internet l kho d liu v tn, do vy vic khai thc thng tin trn khng th thc hin th cng m chng ta cn thng qua s h tr ca mt cng c tm kim- v Google l la chn s 1 v cht lng v tt . V iu cng c chng minh c th khi c ngy cng nhiu cng trnh nghin cu v thng k trn Inrternet da vo

cng c tm kim Google. Da trn nhn xt ca Rudi& Paul(2005) t l xut hin ca t trn Internet l kh n inh, iu ny cho php ta thc hin cc tnh ton chnh xc v n nh v t ph thuc vo s lng trang web trn Internet tng ln theo thi gian. Hin nay cc cng trnh nghin cu theo hng tip cn mi ny ch yu c thc hin trn ting Anh, cn i vi ting Vit th c th ni IGATEC (Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese) l cng trnh u tin p dng phng php ny nhng t c nhng kt qu rt ng quan tm. 2-Phng php tch t da trn thng k Internet theo hng tip cn ca gii thut di truyn(IGATEC) Nh trnh by trn, h thng tch t theo phng php IGATEC c chia lm 2 phn Online Extractor : Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo chng hn . Sau , tc gi s dng cc cng thc di y tnh ton mc ph thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine. GA Engine for Text Segmentation : mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Cc c th trong qun th c khi to ngu nhin , trong mi segment c gii hn trong khong 5 . GA engine sau thc hin cc bc t bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c cch tch t tt nht c th. 2.1- Cng c trch xut thng tin t Google Chng ta chn Google l cng c tm kim bi nhng u th v tnh nhanh chng, chnh xc v ph bin ca n so vi cc cng c tm kim khc. Nhim v ca cng c trch xut thng tin t Google s ly thng tin v: - Tn s xut hin ca cc vn bn cha t (document frequency) trn cc trang web thc hin tnh ton theo cng thc MI, d on kh nng tn ti ca mt t. - Tn s cc vn bn ch t vi t kha i din cho ch dng tnh mc lin quan ca t vi cc loi ch cn phn loi. Cc cng thc tnh xc sut v tng h 2.1.1- Cc cng thc tnh xc sut Da vo nn tng ca cc cng trnh nghin cu thng k trn Internet ca Rudi v Paul (2005). Cc cng thc tnh xc sut t xut hin trn Internet. Gi count(w) l s lng trang web ch t w; count(w1&w2) l s trang web cha ng thi w1 &w2 Trong Max=4*109 ; 2.1.2-Cc cng thc tnh tng h (Mutual Information-MI) tng h MI cho bit thng tin ph thuc ln nhau ca cc t ghp c cu to bi n ting (Cw=w1w2wn) .i vi mi t mt ting ta quy c MI=p(w). i vi t 2 ting tr ln chng ta c th s dng:

Theo cch tnh ca H.Nguyen (2005). p(w1 & w 2 & ..... & w n ) MI(cw) n p w j p(w1 & w 2 & ..... & w n ) j 1

Hoc s dng cng thc MI ci tin nh sau : Gi s ta c: - cw=p(w1&w2....&wn-1) - Vi n chn: lw=p(w1&w2....&wn/2),rw=p(wn/2+1&wn/2+2....&wn). - Vi n l lw=p(w1&w2....&wn-1),rw=p(w2&w3....&wn).

MI(cw)=

2.2 Cng c tch t dng thut ton di truyn (Genetic Alogorith) Mc ch ca chng ta l tm ra cc cch tch t hp l nht cho vn bn, tuy nhin chng ta gp phi tr ngi l khng gian tm kim qu ln do s bng n t hp khi sinh ra dy cc t. Nh chng ta u bit, thut ton di truyn c bit n vi kh nng duyt tt qua nhng khng gian tm kim ln mt cch hiu qu v a ra nhng gii php ton cc ti u nht. GA thc hin tin ha mt s th h to ra mt qun th gm nhng c th ti u nh vo cc bc lai ghp(cross-over), t bin(multation), sinh sn(reproduction), v cch chn la c th. Cht lng ca mi c th c tnh ton da trn ch s fitness cho mi c th v qun th. 2.2.1- Kho st di ca t trn t in. Nh chng ta bit, thut ton di truyn i hi phi c rt nhiu tham s cho cc bc thc hin nh s c th trong qun th, s th h tin ha, t l lai ghp, t l t bin Do vy, cht lng ca la chn cc tham s trn s quyt nh kt qu ca thut ton di truyn. Chnh v tnh cht quan trng ca cc tham s nn vic la chn nn chng ta cn mt kho st nh v s lng t tng ng vi chiu di t trn t in thng dng ti http://dict.vietfun.com lm c s cho cc tham s sau ny. di t(ting) 1 2 3 4 >=5 2.2.2 Khi to qun th: a. Biu din c th : Tn s xut hin 8933 48995 5727 7040 2301 T l 12.2 67.1 7.9 9.7 3.1

Gi s vn bn u vo t bao gm n ting nh sau : T=s1s2sn .Mc ch ca qa trnh thc hin thut ton GA l tm cch tch ra cc t c ph hp cao nht : t=w1w2wm vi wk =sisj ( 1 <= k <= m , 1 <= i,j <= n). Sau mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Tc gi a ra v d nh sau : Ti 0 w1 L 1 w2 Sinh 0 w3 Vin 0 Trng i 1 w4 0 Hc 0 w5 Bch 1 w6 Khoa 1 H 0 w7 Ni 0

b. Khi to cc tham s : Trong bc ny cho GA chy , chng ta phi khi to gi tr cc tham s , bao gm : s th h tin ha , kch thc qun th , t l lai ghp, Qua nhiu ln th sai , tc gi a ra gi tr cho cc tham s nh sau : Tham s S th h tin ha Kch thc qun th T l lai ghp T l t bin Top N c th c chn T l t 1 ting (mono gram) T l t 2 ting (bi gram) T l t 3 ting (tri gram) T l t 4 ting (quad gram) Gi tr 100 50 95% 5% 100 10% 70% 10% 10%

c. Khi to c th : Mc ch ca thut ton di truyn l thc hin tin ha cc c th qua cc th h nhm t n hi t ca ch s thch nghi . y , chng ta thy rng , nu c th c khi to ngu nhin s c thch nghi thp , khi s phi c tin ha qua nhiu th h t c hi t cn thit , tuy nhin s th h tin ha cng nhiu th thi gian tiu tn v chi ph tnh ton cng cao .Do gii php ti u l khi to mt s c th ban u gn vi im hi t th s gip gim bt s th h tin ha .C hai cch khi to c th: khi to ngu nhin v khi to da trn phng php Maximum Matching : Forward / Backward. Khi to c th ngu nhin:

u tin, tt c cc t ghp wk to ra c di khng qu 4. Tip theo khi to ngu nhin cc c th c s lng t tng ng vi t l v di t trn, nhm to ra im xut pht tt cho qu trnh thc hin GA V d: Gi s ta c cu u vo: Nhng con khng long trong phim hot hnh rt ng yu Khi to c th bng phng php Maximum Matching : Forward / Backward Phng php ny c chnh xc kh cao nn s dng khi to c th ban u l ph hp , v nh th s to ra c cc cc th gn ng nht gim s ln tin ha .Bn cnh phng php cng kh n gin v thc hin tuyn tnh nn cng c thi gian tnh ton thp hn cc phng php khc.Sau khi khi to xong ,qun th s c tin ha qua cc qa trnh lai ghp , t bin , sinh sn. 2.2.3 Tin ha c th: a. Qa trnh lai ghp Phng php lai ghp c tc gi thc hin bng cch da trn mt im ngu nhin trong chui cc bit 0 , 1 ca cc c th . Khi c mt cp c th b m , th h con c to ra da trn s kt hp t phn u tin ca b vi phn cui ca m v ngc li .

b. Qa trnh t bin tng thc hin t bin c thc hin bng cch hon chuyn v tr ca 2 bit lin nhau ti mt v tr ngu nhin , cch lm ny th hin tnh t nhin do l mt ting nu kt hp vi ting trc nu khng ph hp th c th kt hp vi t ng sau ( bit ng sau ) nh th s to ra c cc t c ngha v loi b cc t v ngha nu ghp 2 bit 2 v tr ngu nhin trong cu. V d :

c. Qa trnh sinh sn Sau khi lai ghp v t bin chng ta s kt hp cc c th b m vi c th con va c to ra phc v cho bc chn c th . Sau khi kt hp , chng ta s chn lc cc c th trong qun th t c nhiu kt qa tch t tt . V d : `

d. Qa trnh chn c th Qa trnh chn la c th l buc rt quan trng trong qa trnh tin ha , v qa trnh ny s chn cc c th tt , chnh vic chn la ny s quyt nh n qa trnh tin ha th h tip theo , cng nh nh hng n hi t trong thut ton di truyn.Sau khi chn la , qun th s c sp xp theo gi tr ca thch nghi gim dn ,khi qa trnh chn lc c th s chn N c th c thch nghi cao nht hnh thnh nn mt qun th mi nhm chun b cho qa trnh tin ha tip theo . Cch thc la chn c th tho mn iu kin nh sau : m fit(id) fit( w1 w 2 .....w m) MIw k k 1 N fit(pop) fit(id i) i 1 Trong ,id =w1w2wm l mt c th trong qun th.

e. hi t Qu trnh thc hin thut ton GA c gng tng thch nghi (fitness) ca mi c th, iu ny cng c ngha l tng cht lng ca t c t . V do mi th h tin ha , ch s thch nghi ca qun th s tng ln dn n mt ngng hi t T . Khi sau mt qu trnh tin ha chng lch gia ch s thch nghi ca hai c th trong qun th s gim dn v tin dn v 0 hoc t n ngng hi t T m chng ta chn. 3- Kt lun Phng php do tc gi H.Nguyn xut c u im l khng cn s dng bt c tp hun luyn hoc t in no do khng mt thi gian hun luyn.Phng php tch t cng khng phc tp. Tuy nhin khi so snh vi cc phng php LRMM v WFST c trnh by trn th IGATEC c chnh xc thp hn nhng hon ton chp nhn c, thi gian chy ban u hi chm do phi ly thng tin t mng Internet. VITm hiu opensource Vntokenizer tch t trong vn bn ting vit. 1- Gii thiu chng trnh
VnTokenizer l chng trnh tch t ting vit t ng. Vic nghin cu pht trin v ci t chng trnh c thc hin bi mt nhm gio vin tr thuc khoa ton c tin hc, Trng i hc khoa hc t nhin, i hc Quc gia H Ni thc hin. Phin bn 4.0.1 ca chng trnh l s pht trin tip theo ca cc phin bn trn c s ci tin v nng cp cc tnh nng, ti u kh nng k tha v m rng chng trnh cho cc mc ch x l ting Vit v lu di. Tch oan v t vng t ng l bc tin x l khng th thiu i vi hu ht cc lnh vc x l t ng ngn ng t nhin.

2-

Hng dn chy chng trnh

Chng trnh c vit bng ngn ng lp trnh Java, s dng b cng c pht trin J2SDK 1.6 chy c chng trnh, my tnh cn ci t JRE t phin bn 1.6, c th ti v t trang web Java ca Sun MicroSystem: http//java.sun.com v ci t. Ch l chy chng trnh ch cn ci JRE, khng cn ci JDK.. Chng trnh c phn phi di 2 dng: chng trnh v chng trnh ngun, tng ng vi 2 tp nn vnTokenizer-bin.zip v vnTokenizer-src.zip. Vi ngi s dng thng thng: a tip nn vnTokenizer-bin.zip vo mt th mc. chy chng trnh, ngi s dng cn mt tp nn vnTokenizer.jar (cha cc m thc thi) v th mc resource (cha cc tp d liu v cu hnh ca chng trnh) tin s dng, 2 tp tin ch tokenizer.sh v tokenizer.bat c cung cp chy trong 2 h iu hnh ph bin l Unix/Linux v MS Windows.

2.1- D liu Trong mt ln chy vnTokenizer c th tch t mt tp hoc ng thi nhiu tp nm trong cng mt th mc. 1) Tch t mt tp: +) D liu cn cung cp cho chng trnh gm 1 tp vn bn ting Vit, dng th (v d nh tp README.txt ny). +) Kt qu: Mt tp vn bn kt qu tch t c ghi di nh dng n gin hoc nh dng XML, ty theo la chn ca ngi s dng . 2) Tch t nhiu tp nm trong mt th mc: +) D liu cn cung cp gm mt th mc cha cc tp vn bn th cn tch t (th mc input) v mt th mc trng (th mc output) cha kt qu tch t. +) Mc nh, chng trnh s t ng qut ton b th mc input v lc ra tt c cc tp c ui l ".txt". Ngi s dng c th thay i ui mc nh ny thnh ui bt k, v d ".seg" bng ty chn -e ca dng lnh +) Kt qu: Tp cc tp kt qu tch t trong th mc output, cc tp ny c cng tn vi tp input tng ng, tc l tp input/abc.txt s c kt qu l tp output/abc.txt. 2.2- Chy chng trnh

Tch t mt tp: vnTokenizer.sh -i <tp-input> -o <tp-output> [<cc-ty-chn>] Hai ty chn -i v -o l bt buc. Ngoi ra, ngi dng c th cung cp cc ty chn khng bt buc sau y: +) -xo : dng nh dng XML biu din kt qu thay v nh dng mc nh l vn bn th. +) -nu : khng s dng du gch di (no underscore) khi ghi kt qu. Nu ty chn ny c s dng th trong kt qu, cc m tit khng c ni vi nhau bng k t gch di, m bng k t trng. +) -sd : s dng m-un tch cu trc khi thc hin tch t. Nu ty chn ny c s dng th trc tin vnTokenizer thc hin tch vn bn input thnh mt tp cc cu, sau thc hin tch t tng cu mt.Mc nh th m-un tch cu khng c s dng, vnTokenizer thc hin tch t trn ton b vn bn. Cc ty chn ny c th c phi hp ng thi vi nhau cho ra kt qu mong mun. V d: a) vnTokenizer.sh -i samples/test0.txt -o samples/test0.tok.txt Tch t tp samples/test0.txt v ghi kt qu vo tp samples/test0.tok.txt b) vnTokenizer.sh -i samples/test0.txt -o samples/test0.tok.xml -xo Tng t nh a), tuy nhin tp kt qu samples/test0.tok.xml s c nh dng XML c) vnTokenizer.sh -i samples/test0.txt -o samples/test0.tok.txt -sd Tng t nh a) v s dng m-un tch cu trc khi tch t. Tch t mt th mc: Ngoi cc ty chn nh trn, khi tch t th mc, chng trnh cung cp thm ty chn khng bt buc +) -e : ch nh phn m rng ca cc tp cn tch. V d:

a) vnTokenizer.sh -i samples/input -o samples/output Thc hin tch t tt c cc tp samples/input/*.txt, ghi kt qu ra th mc samples/output. b) vnTokenizer.sh -i samples/input -o samples/output -e .xyz Thc hin tch t tt c cc tp samples/input/*.xyz, ghi kt qu ra th mc samples/output.

You might also like