You are on page 1of 8

10

Khoa hoc - k thuat

E XUAT NHAN DANG TIENG VIET NAM CHO IEN THOAI DI ONG
(Nguyen Van Khiem, Le Quan Ha, Hoang Tien Long, Nguyen Hu Tnh, Nguyen Ngoc Tham, o Hong Thy)*

TOM TAT
a 20 nam qua, nhan dang tieng noi van la mot no lc ln e tao ra tr tue cho may tnh, no lc khong ngng nay a mang lai ng dung trong quan ly ien thoai. Khi au vi nhan dang oc cac ch so t 0 en 9 trong ng dung nay (digit recognition), sau o la cac bai toan nhan dang cac t co lap (isolated word recognition). T sau thap nien 90, chung ta bat au bc vao lnh vc nhan dang tieng noi vi t vng ln, khi o cac yeu cau ve yeu to ben vng cua nhan dang tieng noi tr nen can thiet, ngha la: he thong khong de o v khi gap bat ky mot loi nhan dang, mot loi phan mem; khi gap mot tnh huong nhan dang ngoai y muon, he thong de dang khoi phuc e tiep tuc tien trnh nhan dang lien tuc.* S xuat hien cua viec nhan dang tieng noi tren ien thoai di ong va cac thiet b nhung a m ra loai hnh nghien cu mi ve cac ng dung tng tac gia con ngi va may tnh. Nhng hau het cac hoat ong trong lnh vc nay en nay a b gii han do cac van e ve oc quyen phan mem, hoac ch nhan dang nhng cau co cau truc ng phap n gian va b han che. Trong phan nghien cu nay, chung toi se trnh bay s lc ve Pocket Sphinx, mot he thong ma nguon m ve nhan dang tieng noi lien tuc t vng ln tren cac thiet b cam tay. Chung toi a nhung c tieng Viet t vng ln la 7660 t tieng Viet, at o chnh xac la 98,13% t le loi t 1,87%.

PROPOSAL FOR IETNAMESE RECOGNITION FOR MOBILE PHONE

ABSTRACT
Over the past 20 years, speech recognition has been still a major effort to create intelligence for the computer. The ceaseless effort has brought the application in the phone management. We started with recognition of reading numbers from 0 to 9 in this application (digit recognition), then the problems of isolated word recognition. Since the 1990s onwards, we have started to enter a field of speech recognition with large vocabulary. Thus, requests for the sustainability element of the speech recognition becomes necessary, that is, the system is not easily broken to meet any recognized errors or a software errors. When a situation of unintended recognition is encountered, the system shall easily restore to continue the ongoing process of recognition.

Khoa Cng Ngh Thng Tin, trng i Hc Cng Nghip Tp.HCM


*

Khoa hoc - k thuat

11

GII THIEU
Cac ng dung ve tieng noi tren thiet b nhung, ien thoai di ong thng oi hoi phai lien tuc va nhan dang thi gian thc. Rat nhieu ng dung ve giong noi hien tai, chang han nh ieu khien chuyen hng cua he thong nh v toan cau, chon nhac cho may hat nhac, hoac cac ng dung ve ngon ng t nhien nh thiet b chuyen oi ngon ng t giong noi (speech-to-speech translation) [tham khao them A.Waibel, A. Badran, A. W Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayfield Tomokiyo, J.Reichert, T. Schultz, D. Wallace, M. Woszczyna, va J. Zhang 2003],... eu oi hoi phai nhanh, chnh xac va linh ong. Viec trien khai va cai at cac ng dung tren cac thiet b nhung gap rat nhieu kho khan, trong o kho khan ln nhat la yeu cau nhan dang giong noi lien tuc cho mot ng canh t vng t va en ln. Ngoai ra con co cac tr ngai ve phan cng: CPU cua thiet b nhung khong ho tr kieu dau phay ong, bo nh RAM thieu, kha nang lu tr va bang thong tren thiet b nhung cung rat han che. V nhng ly do nay, ma cac cong viec ve nhan dang tieng noi trc ay [xem H. Franco, J. Zheng, J. Butzberger, F. Cesari, M. Frandsen, J. Arnold, V. R. R. Gadde, A. Stolcke, va V. Abrash 2002], [T. W. Kohler, C. Fugen, S. Stuker, va A. Waibel 2005] ch gii han vao nhan dang nhng cau co cau truc ng phap n gian. Ngoai nhng han che ve phan cng, chung ta con phai oi mat vi tr ngai trong viec xay dng he thong nhan dang. e xay dng he thong nay oi hoi phai s dung cac bo cong cu, nhng nhng bo cong cu nay thng co ban quyen vi gia rat at va

khong co ma nguon kem theo. ong thi, cac he ieu hanh tren thiet b nhung thng b thieu cac tnh nang cho cac nha phat trien khong giong nh tren cac he thong may tnh e ban.

HE THONG OCKETSPHINX

Bo nhan dang SPHINX la nen tang rat tot cho s phat trien nhan dang giong noi, va chung ang c s dung bi cac nha nghien cu trong cac lnh vc v du nh: he thong oi thoai va he thong may tnh ho tr hoc tap Trong so cac bo nhan dang CMU SPHINX, PocketSphinx la cong cu a c toi u cho nhan dang tieng noi tren thiet b nhung va ien thoai di ong.

TOI U HOA
Do phan cng cua thiet b nhung va ien thoai di ong so vi may PC co nhieu khac biet cho nen co cac lu y sau: Toc o truy cap bo nh cham To chc d lieu sao cho tng thch vi phan cng CPU Can thay oi cac oan ma khong phu hp he thong. V vay can phai thc hien mot so toi u sau. A. Toi u hoa bo nh + Anh xa tap tin I/O vao bo nh: oi vi cac thiet b nhung co bo nh RAM rat t, d lieu cua mo hnh cu am nen at che o read-only e no co the c oc trc tiep t ROM. Tren he ieu hanh cua cac thiet b nhung, bo nh ROM uc cau truc nh mot tap tin he thong, va nh vay no co the c truy cap trc tiep bang cach s dung chc nang anh xa tap tin vao bo nh, chang han nh mmap() tren UNIX hoac MapViewOfFile() tren Windows.

12

Khoa hoc - k thuat

+ Byte ordering: PocketSphinx can co nh dang cua d lieu khac vi d lieu cua SPHINX e cho phep chung c anh xa vao bo nh. V vay can sa oi bo huan luyen HMM, SPHINXTRAIN, e au ra cac tap tin phu hp vi he thong, v vay cho phep anh tap tin nay vao bo nh theo ung trat t byte. + nh tuyen d lieu: Cac CPU ngay nay eu co ho tr nh tuyen d lieu. V du, mot trng d lieu 32-bit th c yeu cau gan cho cac a ch co gia tr 4byte. Bi v cac trng d lieu trong cac file mo hnh co o dai khac nhau, nen chung ta can phai them d lieu vao cuoi no. Ket qua la trong khi phien ban hien tai co the oc c cac file mo hnh t cac phien ban trc, th cac file c tao ra t no khong the tng thch ngc. + Anh xa Triphone-senone: Toc o, nho gon la muc tieu cua PocketSphinx. Theo thc nghiem mo hnh d lieu cua PocketSphinx nen lu cau truc dang cay , ay la giai phap tot nhat nang cao nang suat bo nh. Ket qua s dung bo nh a giam va thi gian khi ong nhanh hn. B. Toi u hoa cap thap + S dung dau phay tnh: Bo vi x ly Strong ARM khong ho tr cac toan t cho kieu dau phay ong. V the s tnh toan tren dau phay ong se c mo phong trong phan mem bang cach s dung cac phep toan c cung cap bi trnh bien dch hay cua cac th vien runtime. No se gia lap cac chc nang cua mot bo x ly dau cham ong, nhng nh vay se lam cho viec tnh toan tren cac con so ln rat cham, nh viec lay ra ac tnh cu am va tnh toan Gaussian. V vay ta co the bieu dien mot so thap phan thanh phan so va mau so thng dung la mot so chia het cho hai (cho hieu qua tot nhat).

Viec s dung dau phay tnh chac chan lien quan en mot so loi lam tron, no xay ra sau moi lan thc hien phep tnh. Viec chon thuat toan khong nhng phai am bao lam giam so lng tnh toan, tang toc o, ong thi phai duy tr o chnh xac. V du, dung FFT e tach so mot so thc thanh phan nguyen va phan thap phan [H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus vol. 35, no. 6, pp. 849863, 1987]. Tuy nhien, khi s dung FFT tren day phay tnh a lam tang ang ke ty le loi t, trong mot so trng hp len en 20%. + Toi u hoa d lieu va cau truc ieu khien: Kien truc ARM a c toi u hoa rat nhieu cho viec tnh toan tren kieu d lieu so nguyen va Boolean. Hau het cac d lieu c cung cap bao gom mot trng "shift count", cho phep dch chuyen bit theo mot gia tr ma khong lam thay oi gia tr ban au. ARM la mot kien truc 32-bit vi 16 thanh ghi a nang. Viec gi d lieu trong cac thanh ghi rat quan trong e thc hien cac phep toan, no giup lam nhanh hn viec truy cap vao bo nh 32 bit tai mot thi iem, v vay tranh truy cap trc tiep vao thanh ghi khi co the. Nhn chung, mot trnh bien dch toi u hoa tot co the tao ra hieu qua s dung tap tin ang ky. Trong PocketSphinx, danh sach senones c cai at trong mang byte. Tuy nhien, khi mang byte ln c tai len bo em cua bo x ly, va viec truy cap lng ln byte se lam cham toc o cua CPU. V the, chung c thay oi sang mot vec t bit, dung vong lap e quet vector bit nay, va thao tac 32-bit word cung mot thi iem.

TOI U HOA THUAT TOAN


Chung toi thay rang so lng ln cac tnh toan c dung trong bon khu vc: tnh

Khoa hoc - k thuat

13

toan at tnh cu am (MFCC), tnh toan Gaussian (codebook), tnh toan mo hnh Gaussian hon hp, va anh gia HMM (tm kiem Viterbi). Khoang ty le cac thi gian c dung trong bon lnh vc c hien th trong Bang 1. Trong viec toi u hoa thuat toan, tnh toan mo hnh hon hp Gaussian (GMM) c chu y nhieu nhat, t cong viec trc o [A. Chan et al. 2004] chung ta a co framework rat tot cho tnh toan gan ung GMM. Trong framework nay, GMM c lng c chia lam 4 tang tnh toan: Bang 1. Ty le phan tram thi gian tnh toan

Thanh phan Codebook HMM MFCC Senone

Desktop 27.43% 24.68% 14.39% 7.67%

Nhung 24.59% 22.11% 11.51% 11.71%

mot vai s khac biet can lu y: Trong viec tnh toan cua mo hnh cu am ban lien tuc, mot "codebook" cua mat o Gaussian c chia se gia tat ca cac mo hnh hon hp. So lng Gaussians hon hp thng la 128 en 2048, ln hn nhieu so vi 16-32 c s dung cho CDHMM. He thong SCHMM c s thng la mieu ta cho cac vector ac tnh vi nhieu luong oc lap. Viec ap dung mo hnh bon tang, vi moi tang co cau truc khac nhau. Nhng trong khi o codebook c chia se gia tat ca cac tang, toan bo codebook phai c tnh moi tang. Gii han nay cho phep cac phep tnh tang GMM co the giam bt s tnh toan. Chung ta ap dung ky thuat sau ay cho moi tang: Tang khung: Chung ta ap dung chuan hoa khung (downsampling) [M. Woszczyna 1998]. Mac du cac ket qua nay lam mat tnh chnh xac, nhng no la cach duy nhat e nang toc o tren tang Gaussian. Tang GMM: Ap dung GMM c s oc lap ng canh [A. Lee, T. Kawahara, and K. Shikano 2001, vol. 1, pp. 6972], [A. Chan, M. Ravishankar, and A. Rudnicky 2005]. Tang Gaussian: Chung toi xem xet nhieu kha nang nh: Sub-VQ-based Gaussian Selection [M. Ravishankar, R. Bisiani, and E. Thayer 1997, pp. 151154] nhng tat ca chung eu khong cai thien c toc o nhieu. V the, chung toi quyet nh s dung phng phap cay c s Gaussian Selection. Tang thanh phan: PocketSphinx co

Tang khung: tnh tat ca GMM cho khung d lieu au vao. Tang GMM: tnh toan mot GMM n. Tang Gaussian: tnh toan mot Gaussian n. Tang thanh phan: tnh toan cac thanh phan lien quan en vector ac tnh. Lc o nay cho phep phan loai chnh xac cac ky thuat nang toc o khac nhau bi cac tang ma chung hoat ong tren o, va cho phep chung ta xac nh cac ky thuat khac nhau c ap dung ket hp vi nhau nh the nao. Tuy nhien, framework nay c ap dung chu yeu cho he thong s dung HMM phan bo lien tuc (CDHMM). Trong viec ap dung cac y tng cua no cho mo hnh HMM ban lien tuc (SCHMM), co

14

Khoa hoc - k thuat

san thanh phan tnh Gaussian [xem B. Pellom, R. Sarikaya, and J. H. L. Hansen vol. 8, no. 8, pp. 221224, July 2001]. Sau o s dung thong tin t cay c s e cai thien hieu qua tnh toan cua thanh phan nay. Trong tang khung, chung toi bc au ap dung chuan hoa khung (downsampling) mot cach n gian, bi ch can bo qua tat ca cac codebook va s tnh toan GMM moi khung khac. Tuy nhien, chung ta chnh sa ieu nay sau e tnh lai nh N Gaussians t khung trc va s dung cai o e tnh cac senones t khung hien tai. Ket qua: thc hien nhanh hn khoang nho (0.6%) va ket qua at c t le loi t giam khoang 10% Trong tang Gaussian, chung toi ap dung phien ban chnh sa cua thuat toan BBI, nh c mo ta trong [B. Pellom, R. Sarikaya, and J. H. L. Hansen vol. 8, no. 8, pp. 221224, July 2001]. Thuat toan nay a bo Gaussians vao trong cau truc cay kd ieu nay cho phep bo con Gaussian tm kiem nhanh trong khong gian ac tnh e a ra vector ac tnh. oi vi moi dong ac tnh cu am trong codebook, chung toi xay dng mot cay co o sau rieng biet (thng o sau 8 hoac 10) vi hop ngng Gaussian a nh san. Mac du cac loai cay c xay dng ngoai tuyen, chieu sau cua tm kiem trong cay co the c ieu khien nh mot tham so e giai ma tai thi gian chay. Cai nay cho phep cac yeu cau bo nh cho cay khong nhieu lam. Chung toi cung kham pha y tng han che so lng toi a Gaussians e tm kiem trong moi nut la. e thc hien kha thi, chung toi a sap xep danh sach cac Gaussians trong cac nut la.

D LIEU VA HUAN LUYEN TIENG VIET


A. Lexicon
Lexicon la b t in dng th hin cc t thnh cc n v pht m (phonemes). N l mt thnh phn quan trng trong h thng nhn dng ting ni. Chng ti xy dng c lexicon ting Vit theo phien am chun quc t. Lexicon ting Vit hn 12 nghn t s dng 41 phonemes cho c hai min Nam va Bc.

B. D Liu
D lieu hoc la mot phan khong the thieu trong nhan dang tieng noi. D lieu hoc quyet nh trc tiep en ket qua nhan dang. D lieu hoc gom hai phan la d lieu van ban va d lieu am thanh. D lieu am thanh la nhng tap tin am thanh thu am nhng cau trong d lieu van ban.

C. D liu vn bn
Tuy vao muc ch cua viec nghien cu va chng trnh ng dung nhan dang tieng noi khac nhau th co bo d lieu van ban khac nhau. Thng th bo d lieu van ban c chon theo chu e cua ng dung.

D. D lieu am thanh
D lieu am thanh phu thuoc vao bo d lieu van ban. No bao gom tat ca cac tap tin am thanh thu am cac cau trong bo d lieu van ban. Bo d lieu van ban cho nhan dang so gom 200 cau th bo d lieu am thanh la 200 tap tin am thanh. Chung toi ghi am d lieu thanh tap tin co uoi la .raw. Tap tin am thanh .raw co o nen cao, dung lng nho thch hp cho viec ghi am d lieu ln. Mot tap tin am thanh chuan la mot tap tin khong co tieng on va nhieu, cac t

Khoa hoc - k thuat

15

phai c oc ro rang. Bo d lieu am thanh phai c thu am ro rang, dt khoat tng t. Ngi thu am d lieu hoc cung ong vai tro rat quan trong. Ngi thu am nam trong o tuoi t 18 en

microphone.

HUAN LUYEN TIENG VIET


Bo cong cu huan luyen c chung toi s dung la SphinxTrain: ay la cong cu

Hnh 1. ng dng nhn dng ting ni Vit Nam trn in thoi di ng ang nhn dng cu TH CHA NH DN I U.

51 chia eu theo o tuoi, can bang ca giong nam va giong n. So lng ngi thu am ln, trai eu theo la tuoi, can bang so nam va n lam cho he thong tr len phong phu hn, linh hoat hn va kha nang thch ng cao. V du nh huan luyen 1000 ngi oc, khi ngi th 1001 oc th he thong de dang thch nghi vi giong cua ngi nay va cho ket qua nhan dang chnh xac.

huan luyen tieng Anh phat trien bi trng ai hoc Carnegie Mellon. Nay vi au vao cua chung toi la d lieu tieng Viet. Vi cau truc nh san, qua trnh huan luyen thc hien se tao ra cac file mo hnh HMM cua tieng Viet, cac tap tin HMM nay a c a vao PocketSphinx .

E. Tieng on va nhieu trong d lieu am thanh


Tieng on, nhieu anh hng rat ln en qua trnh hoc d lieu va nhan dang. Tieng on, nhieu la do nhieu nguyen nhan nh tieng on xe co, cong trng, ngi noi chuyen..., va nhieu chu yeu la do

KET LUAN VA CONG VIEC


SAP TI
Trong tng lai, chung toi se ap dung he thong nay cho mot cong viec vi mo hnh ngon ng cao hn va von t vng ln hn. Mot ng c vien cho s toi u hoa trong tng lai la thuat toan tm kiem Viterbi, cai ma chung toi a khong thao

16

Khoa hoc - k thuat

luan sau vao trong phan nay. Chung toi cung a trien khai POCKETSPHINX e nhan dang tieng noi Viet Nam tren he may pho bien Pocket PC, he ieu hanh WindowsCE va he ieu hanh Linux.

CAC KET QUA AT C

- Xay dng xong bo t ien lexicon tieng Viet hn 12 nghn t. - Xay dng xong mo hnh ngon ng cho tieng Viet vi d lieu hn 20.000 t. - a huan luyen c mo hnh cu am cho tieng Viet. - a nhung c tieng Viet cho PocketSphinx. - o chnh xac nhan dang PocketSphinx vi kch thc 7660 t tieng Viet, at o chnh xac la 98,13% t le loi t 1,87% da tren 150 cau kiem tra t vng ln

TAI LIEU THAM KHAO


[1]. Waibel, A. Badran, A. W Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayfield Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna, and J. Zhang. 2003. Speechalator: Two-way speech-to-speech translation in your hand. In Proceedings of NAACL-HLT. [2]. Kohler, T. W., Fugen, C., Stuker, S. and Waibel, A. 2005. Rapid porting of ASRsystems to mobile devices. In Proceedings of Interspeech. [3]. Franco, H., Zheng, J., Butzberger, J., Cesari, F., Frandsen, M., Arnold, J., Gadde, V. R. R., Stolcke, A. and Abrash V. 2002. Dynaspeak: SRIs scalable speech recognizer for embedded and mobile systems. In Proceedsings of HLT. [4]. Chan, A., Sherwani, J., Ravishankar, M. and Rudnicky, A. 2004. Four-layer categorization scheme of fast GMM computation techniques in large vocabulary

continuous speech recognition systems. In Proceedings of ICSLP. [5]. Chan, A., Ravishankar, M. and Rudnicky, A. 2005. On improvements of CIbased GMM selection. In Proceedings of Interspeech. [6]. Lee, A., Kawahara, T. and Shikano, K. 2001. Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, vol. 1, pp. 6972. [7]. Waibel, A., Badran, A., Black, A. W., Frederking, R., Gates, D., Lavie, A., Levin, L., Lenzo, K., Tomokiyo, L. M., Reichert, J., Schultz, T., Wallace, D., Woszczyna M. and Zhang J. 2003. Speechalator: Two-way speech-to-speech translation in your hand. In Proceedings of NAACL-HLT. [8]. Acero, A. and Stern, R. M. 1990. Environmental Robustness in Automatic Speech Recognition. Proc. of ICASSP, pp. 849-852. [9]. Aubert, X. and Dugast, C. 1995. Improved Acoustic-Phonetic Modelling in Philips Dictation System by Handling Liaisons and Multiple Pronunciations. Proc. of EuroSpeech95, vol. 2, pp. 767-770. [10]. Pellom, B., Sarikaya, R. and Hansen, J. H. L. 2001. Fast likelihood computation techniques in nearest-neighbor based search for continuous speech recognition. IEEE Signal Processing Letters, vol. 8, no. 8, pp. 221224. [11]. Bahl, L. R. and Bakis, R. 1989. Large Vocabulary Natural Language Continuous Speech Recognition. Proc of ICASSP89, pp.465-467. [12]. Bahl, L. R., Brown, P. F., de Souza, P. V. and Mercer, R. L. 1986. Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proc. of ICASSP86, pp.49-52. [13]. Bahl, L. R., de Souza, P. C., Gopalakrishnan, P. S., Nahamoo, D., Picheny, M.A. and Watson, T. J. 1994. Robust Methods for Using ContextDependent Features and Models in a

Khoa hoc - k thuat

17

Continuous Speech Recognizer. Proc. of ICASSP94, pp.I533-I536. [14]. Bahl, L. R., Jelinek, F. 1975. Decoding for Channels with Insertions, Deletions and Substiutions, with Applications to Speech Recognition. IEEE Trans. Information Theory, IT-21, pp. 404411. [15]. Kawahara, L. T. and Shikano K. 2001. Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, vol. 1, pp. 6972. [16]. Gold, B. and Morgan, N. 2000. Speech and Audio Signal Processing. John Wiley & Sons, INC, New York. [17]. Bourland, H. 1995. Towards Increasing Speech Recognition Error Rates. Proc. of EuroSpeech95, vol. 2, pp. 883-894. [18]. Lee, K-F., Hon, H-W. and Reedy, R. 1990. An Overview of the SPHINX Speech Recognition System. IEEE Trans. on Acoustic, Speech, Signal Processing, vol. 38, pp.35-45. [19]. Levinson, S. E. 1986. Continuously Variable Duration Hidden Markov Models for Automatic Speech Recognition. Computer Speech and Language, 1(1), pp. 29-45. [20]. Trask, R. L. 1996. A Dictionary of Phonetics and Phonology. Routledge. [21]. Young, S. J., Oh, Y. H. and Shin, G. C. 1997. Improved Lexicon Modeling for Continuous Speech Recognition. Proc. of ICASSP97, pp.1827-1830.

You might also like