Professional Documents
Culture Documents
E XUAT NHAN DANG TIENG VIET NAM CHO IEN THOAI DI ONG
(Nguyen Van Khiem, Le Quan Ha, Hoang Tien Long, Nguyen Hu Tnh, Nguyen Ngoc Tham, o Hong Thy)*
TOM TAT
a 20 nam qua, nhan dang tieng noi van la mot no lc ln e tao ra tr tue cho may tnh, no lc khong ngng nay a mang lai ng dung trong quan ly ien thoai. Khi au vi nhan dang oc cac ch so t 0 en 9 trong ng dung nay (digit recognition), sau o la cac bai toan nhan dang cac t co lap (isolated word recognition). T sau thap nien 90, chung ta bat au bc vao lnh vc nhan dang tieng noi vi t vng ln, khi o cac yeu cau ve yeu to ben vng cua nhan dang tieng noi tr nen can thiet, ngha la: he thong khong de o v khi gap bat ky mot loi nhan dang, mot loi phan mem; khi gap mot tnh huong nhan dang ngoai y muon, he thong de dang khoi phuc e tiep tuc tien trnh nhan dang lien tuc.* S xuat hien cua viec nhan dang tieng noi tren ien thoai di ong va cac thiet b nhung a m ra loai hnh nghien cu mi ve cac ng dung tng tac gia con ngi va may tnh. Nhng hau het cac hoat ong trong lnh vc nay en nay a b gii han do cac van e ve oc quyen phan mem, hoac ch nhan dang nhng cau co cau truc ng phap n gian va b han che. Trong phan nghien cu nay, chung toi se trnh bay s lc ve Pocket Sphinx, mot he thong ma nguon m ve nhan dang tieng noi lien tuc t vng ln tren cac thiet b cam tay. Chung toi a nhung c tieng Viet t vng ln la 7660 t tieng Viet, at o chnh xac la 98,13% t le loi t 1,87%.
ABSTRACT
Over the past 20 years, speech recognition has been still a major effort to create intelligence for the computer. The ceaseless effort has brought the application in the phone management. We started with recognition of reading numbers from 0 to 9 in this application (digit recognition), then the problems of isolated word recognition. Since the 1990s onwards, we have started to enter a field of speech recognition with large vocabulary. Thus, requests for the sustainability element of the speech recognition becomes necessary, that is, the system is not easily broken to meet any recognized errors or a software errors. When a situation of unintended recognition is encountered, the system shall easily restore to continue the ongoing process of recognition.
11
GII THIEU
Cac ng dung ve tieng noi tren thiet b nhung, ien thoai di ong thng oi hoi phai lien tuc va nhan dang thi gian thc. Rat nhieu ng dung ve giong noi hien tai, chang han nh ieu khien chuyen hng cua he thong nh v toan cau, chon nhac cho may hat nhac, hoac cac ng dung ve ngon ng t nhien nh thiet b chuyen oi ngon ng t giong noi (speech-to-speech translation) [tham khao them A.Waibel, A. Badran, A. W Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayfield Tomokiyo, J.Reichert, T. Schultz, D. Wallace, M. Woszczyna, va J. Zhang 2003],... eu oi hoi phai nhanh, chnh xac va linh ong. Viec trien khai va cai at cac ng dung tren cac thiet b nhung gap rat nhieu kho khan, trong o kho khan ln nhat la yeu cau nhan dang giong noi lien tuc cho mot ng canh t vng t va en ln. Ngoai ra con co cac tr ngai ve phan cng: CPU cua thiet b nhung khong ho tr kieu dau phay ong, bo nh RAM thieu, kha nang lu tr va bang thong tren thiet b nhung cung rat han che. V nhng ly do nay, ma cac cong viec ve nhan dang tieng noi trc ay [xem H. Franco, J. Zheng, J. Butzberger, F. Cesari, M. Frandsen, J. Arnold, V. R. R. Gadde, A. Stolcke, va V. Abrash 2002], [T. W. Kohler, C. Fugen, S. Stuker, va A. Waibel 2005] ch gii han vao nhan dang nhng cau co cau truc ng phap n gian. Ngoai nhng han che ve phan cng, chung ta con phai oi mat vi tr ngai trong viec xay dng he thong nhan dang. e xay dng he thong nay oi hoi phai s dung cac bo cong cu, nhng nhng bo cong cu nay thng co ban quyen vi gia rat at va
khong co ma nguon kem theo. ong thi, cac he ieu hanh tren thiet b nhung thng b thieu cac tnh nang cho cac nha phat trien khong giong nh tren cac he thong may tnh e ban.
HE THONG OCKETSPHINX
Bo nhan dang SPHINX la nen tang rat tot cho s phat trien nhan dang giong noi, va chung ang c s dung bi cac nha nghien cu trong cac lnh vc v du nh: he thong oi thoai va he thong may tnh ho tr hoc tap Trong so cac bo nhan dang CMU SPHINX, PocketSphinx la cong cu a c toi u cho nhan dang tieng noi tren thiet b nhung va ien thoai di ong.
TOI U HOA
Do phan cng cua thiet b nhung va ien thoai di ong so vi may PC co nhieu khac biet cho nen co cac lu y sau: Toc o truy cap bo nh cham To chc d lieu sao cho tng thch vi phan cng CPU Can thay oi cac oan ma khong phu hp he thong. V vay can phai thc hien mot so toi u sau. A. Toi u hoa bo nh + Anh xa tap tin I/O vao bo nh: oi vi cac thiet b nhung co bo nh RAM rat t, d lieu cua mo hnh cu am nen at che o read-only e no co the c oc trc tiep t ROM. Tren he ieu hanh cua cac thiet b nhung, bo nh ROM uc cau truc nh mot tap tin he thong, va nh vay no co the c truy cap trc tiep bang cach s dung chc nang anh xa tap tin vao bo nh, chang han nh mmap() tren UNIX hoac MapViewOfFile() tren Windows.
12
+ Byte ordering: PocketSphinx can co nh dang cua d lieu khac vi d lieu cua SPHINX e cho phep chung c anh xa vao bo nh. V vay can sa oi bo huan luyen HMM, SPHINXTRAIN, e au ra cac tap tin phu hp vi he thong, v vay cho phep anh tap tin nay vao bo nh theo ung trat t byte. + nh tuyen d lieu: Cac CPU ngay nay eu co ho tr nh tuyen d lieu. V du, mot trng d lieu 32-bit th c yeu cau gan cho cac a ch co gia tr 4byte. Bi v cac trng d lieu trong cac file mo hnh co o dai khac nhau, nen chung ta can phai them d lieu vao cuoi no. Ket qua la trong khi phien ban hien tai co the oc c cac file mo hnh t cac phien ban trc, th cac file c tao ra t no khong the tng thch ngc. + Anh xa Triphone-senone: Toc o, nho gon la muc tieu cua PocketSphinx. Theo thc nghiem mo hnh d lieu cua PocketSphinx nen lu cau truc dang cay , ay la giai phap tot nhat nang cao nang suat bo nh. Ket qua s dung bo nh a giam va thi gian khi ong nhanh hn. B. Toi u hoa cap thap + S dung dau phay tnh: Bo vi x ly Strong ARM khong ho tr cac toan t cho kieu dau phay ong. V the s tnh toan tren dau phay ong se c mo phong trong phan mem bang cach s dung cac phep toan c cung cap bi trnh bien dch hay cua cac th vien runtime. No se gia lap cac chc nang cua mot bo x ly dau cham ong, nhng nh vay se lam cho viec tnh toan tren cac con so ln rat cham, nh viec lay ra ac tnh cu am va tnh toan Gaussian. V vay ta co the bieu dien mot so thap phan thanh phan so va mau so thng dung la mot so chia het cho hai (cho hieu qua tot nhat).
Viec s dung dau phay tnh chac chan lien quan en mot so loi lam tron, no xay ra sau moi lan thc hien phep tnh. Viec chon thuat toan khong nhng phai am bao lam giam so lng tnh toan, tang toc o, ong thi phai duy tr o chnh xac. V du, dung FFT e tach so mot so thc thanh phan nguyen va phan thap phan [H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus vol. 35, no. 6, pp. 849863, 1987]. Tuy nhien, khi s dung FFT tren day phay tnh a lam tang ang ke ty le loi t, trong mot so trng hp len en 20%. + Toi u hoa d lieu va cau truc ieu khien: Kien truc ARM a c toi u hoa rat nhieu cho viec tnh toan tren kieu d lieu so nguyen va Boolean. Hau het cac d lieu c cung cap bao gom mot trng "shift count", cho phep dch chuyen bit theo mot gia tr ma khong lam thay oi gia tr ban au. ARM la mot kien truc 32-bit vi 16 thanh ghi a nang. Viec gi d lieu trong cac thanh ghi rat quan trong e thc hien cac phep toan, no giup lam nhanh hn viec truy cap vao bo nh 32 bit tai mot thi iem, v vay tranh truy cap trc tiep vao thanh ghi khi co the. Nhn chung, mot trnh bien dch toi u hoa tot co the tao ra hieu qua s dung tap tin ang ky. Trong PocketSphinx, danh sach senones c cai at trong mang byte. Tuy nhien, khi mang byte ln c tai len bo em cua bo x ly, va viec truy cap lng ln byte se lam cham toc o cua CPU. V the, chung c thay oi sang mot vec t bit, dung vong lap e quet vector bit nay, va thao tac 32-bit word cung mot thi iem.
13
toan at tnh cu am (MFCC), tnh toan Gaussian (codebook), tnh toan mo hnh Gaussian hon hp, va anh gia HMM (tm kiem Viterbi). Khoang ty le cac thi gian c dung trong bon lnh vc c hien th trong Bang 1. Trong viec toi u hoa thuat toan, tnh toan mo hnh hon hp Gaussian (GMM) c chu y nhieu nhat, t cong viec trc o [A. Chan et al. 2004] chung ta a co framework rat tot cho tnh toan gan ung GMM. Trong framework nay, GMM c lng c chia lam 4 tang tnh toan: Bang 1. Ty le phan tram thi gian tnh toan
mot vai s khac biet can lu y: Trong viec tnh toan cua mo hnh cu am ban lien tuc, mot "codebook" cua mat o Gaussian c chia se gia tat ca cac mo hnh hon hp. So lng Gaussians hon hp thng la 128 en 2048, ln hn nhieu so vi 16-32 c s dung cho CDHMM. He thong SCHMM c s thng la mieu ta cho cac vector ac tnh vi nhieu luong oc lap. Viec ap dung mo hnh bon tang, vi moi tang co cau truc khac nhau. Nhng trong khi o codebook c chia se gia tat ca cac tang, toan bo codebook phai c tnh moi tang. Gii han nay cho phep cac phep tnh tang GMM co the giam bt s tnh toan. Chung ta ap dung ky thuat sau ay cho moi tang: Tang khung: Chung ta ap dung chuan hoa khung (downsampling) [M. Woszczyna 1998]. Mac du cac ket qua nay lam mat tnh chnh xac, nhng no la cach duy nhat e nang toc o tren tang Gaussian. Tang GMM: Ap dung GMM c s oc lap ng canh [A. Lee, T. Kawahara, and K. Shikano 2001, vol. 1, pp. 6972], [A. Chan, M. Ravishankar, and A. Rudnicky 2005]. Tang Gaussian: Chung toi xem xet nhieu kha nang nh: Sub-VQ-based Gaussian Selection [M. Ravishankar, R. Bisiani, and E. Thayer 1997, pp. 151154] nhng tat ca chung eu khong cai thien c toc o nhieu. V the, chung toi quyet nh s dung phng phap cay c s Gaussian Selection. Tang thanh phan: PocketSphinx co
Tang khung: tnh tat ca GMM cho khung d lieu au vao. Tang GMM: tnh toan mot GMM n. Tang Gaussian: tnh toan mot Gaussian n. Tang thanh phan: tnh toan cac thanh phan lien quan en vector ac tnh. Lc o nay cho phep phan loai chnh xac cac ky thuat nang toc o khac nhau bi cac tang ma chung hoat ong tren o, va cho phep chung ta xac nh cac ky thuat khac nhau c ap dung ket hp vi nhau nh the nao. Tuy nhien, framework nay c ap dung chu yeu cho he thong s dung HMM phan bo lien tuc (CDHMM). Trong viec ap dung cac y tng cua no cho mo hnh HMM ban lien tuc (SCHMM), co
14
san thanh phan tnh Gaussian [xem B. Pellom, R. Sarikaya, and J. H. L. Hansen vol. 8, no. 8, pp. 221224, July 2001]. Sau o s dung thong tin t cay c s e cai thien hieu qua tnh toan cua thanh phan nay. Trong tang khung, chung toi bc au ap dung chuan hoa khung (downsampling) mot cach n gian, bi ch can bo qua tat ca cac codebook va s tnh toan GMM moi khung khac. Tuy nhien, chung ta chnh sa ieu nay sau e tnh lai nh N Gaussians t khung trc va s dung cai o e tnh cac senones t khung hien tai. Ket qua: thc hien nhanh hn khoang nho (0.6%) va ket qua at c t le loi t giam khoang 10% Trong tang Gaussian, chung toi ap dung phien ban chnh sa cua thuat toan BBI, nh c mo ta trong [B. Pellom, R. Sarikaya, and J. H. L. Hansen vol. 8, no. 8, pp. 221224, July 2001]. Thuat toan nay a bo Gaussians vao trong cau truc cay kd ieu nay cho phep bo con Gaussian tm kiem nhanh trong khong gian ac tnh e a ra vector ac tnh. oi vi moi dong ac tnh cu am trong codebook, chung toi xay dng mot cay co o sau rieng biet (thng o sau 8 hoac 10) vi hop ngng Gaussian a nh san. Mac du cac loai cay c xay dng ngoai tuyen, chieu sau cua tm kiem trong cay co the c ieu khien nh mot tham so e giai ma tai thi gian chay. Cai nay cho phep cac yeu cau bo nh cho cay khong nhieu lam. Chung toi cung kham pha y tng han che so lng toi a Gaussians e tm kiem trong moi nut la. e thc hien kha thi, chung toi a sap xep danh sach cac Gaussians trong cac nut la.
B. D Liu
D lieu hoc la mot phan khong the thieu trong nhan dang tieng noi. D lieu hoc quyet nh trc tiep en ket qua nhan dang. D lieu hoc gom hai phan la d lieu van ban va d lieu am thanh. D lieu am thanh la nhng tap tin am thanh thu am nhng cau trong d lieu van ban.
C. D liu vn bn
Tuy vao muc ch cua viec nghien cu va chng trnh ng dung nhan dang tieng noi khac nhau th co bo d lieu van ban khac nhau. Thng th bo d lieu van ban c chon theo chu e cua ng dung.
D. D lieu am thanh
D lieu am thanh phu thuoc vao bo d lieu van ban. No bao gom tat ca cac tap tin am thanh thu am cac cau trong bo d lieu van ban. Bo d lieu van ban cho nhan dang so gom 200 cau th bo d lieu am thanh la 200 tap tin am thanh. Chung toi ghi am d lieu thanh tap tin co uoi la .raw. Tap tin am thanh .raw co o nen cao, dung lng nho thch hp cho viec ghi am d lieu ln. Mot tap tin am thanh chuan la mot tap tin khong co tieng on va nhieu, cac t
15
phai c oc ro rang. Bo d lieu am thanh phai c thu am ro rang, dt khoat tng t. Ngi thu am d lieu hoc cung ong vai tro rat quan trong. Ngi thu am nam trong o tuoi t 18 en
microphone.
Hnh 1. ng dng nhn dng ting ni Vit Nam trn in thoi di ng ang nhn dng cu TH CHA NH DN I U.
51 chia eu theo o tuoi, can bang ca giong nam va giong n. So lng ngi thu am ln, trai eu theo la tuoi, can bang so nam va n lam cho he thong tr len phong phu hn, linh hoat hn va kha nang thch ng cao. V du nh huan luyen 1000 ngi oc, khi ngi th 1001 oc th he thong de dang thch nghi vi giong cua ngi nay va cho ket qua nhan dang chnh xac.
huan luyen tieng Anh phat trien bi trng ai hoc Carnegie Mellon. Nay vi au vao cua chung toi la d lieu tieng Viet. Vi cau truc nh san, qua trnh huan luyen thc hien se tao ra cac file mo hnh HMM cua tieng Viet, cac tap tin HMM nay a c a vao PocketSphinx .
16
luan sau vao trong phan nay. Chung toi cung a trien khai POCKETSPHINX e nhan dang tieng noi Viet Nam tren he may pho bien Pocket PC, he ieu hanh WindowsCE va he ieu hanh Linux.
- Xay dng xong bo t ien lexicon tieng Viet hn 12 nghn t. - Xay dng xong mo hnh ngon ng cho tieng Viet vi d lieu hn 20.000 t. - a huan luyen c mo hnh cu am cho tieng Viet. - a nhung c tieng Viet cho PocketSphinx. - o chnh xac nhan dang PocketSphinx vi kch thc 7660 t tieng Viet, at o chnh xac la 98,13% t le loi t 1,87% da tren 150 cau kiem tra t vng ln
continuous speech recognition systems. In Proceedings of ICSLP. [5]. Chan, A., Ravishankar, M. and Rudnicky, A. 2005. On improvements of CIbased GMM selection. In Proceedings of Interspeech. [6]. Lee, A., Kawahara, T. and Shikano, K. 2001. Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, vol. 1, pp. 6972. [7]. Waibel, A., Badran, A., Black, A. W., Frederking, R., Gates, D., Lavie, A., Levin, L., Lenzo, K., Tomokiyo, L. M., Reichert, J., Schultz, T., Wallace, D., Woszczyna M. and Zhang J. 2003. Speechalator: Two-way speech-to-speech translation in your hand. In Proceedings of NAACL-HLT. [8]. Acero, A. and Stern, R. M. 1990. Environmental Robustness in Automatic Speech Recognition. Proc. of ICASSP, pp. 849-852. [9]. Aubert, X. and Dugast, C. 1995. Improved Acoustic-Phonetic Modelling in Philips Dictation System by Handling Liaisons and Multiple Pronunciations. Proc. of EuroSpeech95, vol. 2, pp. 767-770. [10]. Pellom, B., Sarikaya, R. and Hansen, J. H. L. 2001. Fast likelihood computation techniques in nearest-neighbor based search for continuous speech recognition. IEEE Signal Processing Letters, vol. 8, no. 8, pp. 221224. [11]. Bahl, L. R. and Bakis, R. 1989. Large Vocabulary Natural Language Continuous Speech Recognition. Proc of ICASSP89, pp.465-467. [12]. Bahl, L. R., Brown, P. F., de Souza, P. V. and Mercer, R. L. 1986. Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proc. of ICASSP86, pp.49-52. [13]. Bahl, L. R., de Souza, P. C., Gopalakrishnan, P. S., Nahamoo, D., Picheny, M.A. and Watson, T. J. 1994. Robust Methods for Using ContextDependent Features and Models in a
17
Continuous Speech Recognizer. Proc. of ICASSP94, pp.I533-I536. [14]. Bahl, L. R., Jelinek, F. 1975. Decoding for Channels with Insertions, Deletions and Substiutions, with Applications to Speech Recognition. IEEE Trans. Information Theory, IT-21, pp. 404411. [15]. Kawahara, L. T. and Shikano K. 2001. Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, vol. 1, pp. 6972. [16]. Gold, B. and Morgan, N. 2000. Speech and Audio Signal Processing. John Wiley & Sons, INC, New York. [17]. Bourland, H. 1995. Towards Increasing Speech Recognition Error Rates. Proc. of EuroSpeech95, vol. 2, pp. 883-894. [18]. Lee, K-F., Hon, H-W. and Reedy, R. 1990. An Overview of the SPHINX Speech Recognition System. IEEE Trans. on Acoustic, Speech, Signal Processing, vol. 38, pp.35-45. [19]. Levinson, S. E. 1986. Continuously Variable Duration Hidden Markov Models for Automatic Speech Recognition. Computer Speech and Language, 1(1), pp. 29-45. [20]. Trask, R. L. 1996. A Dictionary of Phonetics and Phonology. Routledge. [21]. Young, S. J., Oh, Y. H. and Shin, G. C. 1997. Improved Lexicon Modeling for Continuous Speech Recognition. Proc. of ICASSP97, pp.1827-1830.