You are on page 1of 38

BO CO THC TP

Tm hiu Hadoop, MapReduce, v cc bi ton ng dng

Gio vin hng dn: T Minh Phng Sinh vin: V Minh Ngc

Mc lc
Phn I. Gii thiu chung ......................................................................................................................... 5 1.1. 1.2. Hadoop l{ g?......................................................................................................................... 5 MapReduce l{ g? .................................................................................................................. 5

Phn II. Ci t Hadoop ......................................................................................................................... 7 1. 1. 2. 3. 4. 5. 6. 7. a. b. c. d. 8. 9. Ci t my o Ubuntu 10.10 (32 bit) trn VMware ................................................................. 7 Ci t Vmware tools cho Ubuntu ............................................................................................. 7 Ci openSSH cho ubuntu ............................................................................................................ 7 Ci java: ...................................................................................................................................... 7 Thm user hadoop vo nhm hadoop....................................................................................... 8 Cu hnh ssh ............................................................................................................................... 9 V hiu ha IPv6 ...................................................................................................................... 11 Download v ci t hadoop ................................................................................................... 12 Download Hadoop 0.20.2 v lu vo th mc /usr/local/ .................................................. 12 Cu hnh ............................................................................................................................... 12 nh dng cc tn node ....................................................................................................... 13 Chy hadoop trn cm mt node ........................................................................................ 13 Chy mt v d MapReduce ..................................................................................................... 14 Ci t v s dng Hadoop trn Eclipse .................................................................................. 17

Phn III. Thnh phn ca Hadoop ........................................................................................................ 20 1. 2. Mt s thut ng. .................................................................................................................... 20 C|c trnh nn ca Hadoop ...................................................................................................... 21 2.1. 2.2. 2.3. 2.4. 2.5. NameNode....................................................................................................................... 21 DataNode......................................................................................................................... 21 Secondary NameNode .................................................................................................... 22 JobTracker....................................................................................................................... 22 TaskTracker .................................................................................................................... 23

Phn IV. Lp trnh MapReduce c bn................................................................................................. 25 1. 2. Tng quan mt chng trnh MapReduce............................................................................ 25 Cc loi d liu m Hadoop h tr .......................................................................................... 26 2.1. Mapper............................................................................................................................. 27

V Minh Ngc

2.2. 2.3.

Reducer............................................................................................................................. 28 Partitioner chuyn hng u ra t Mapper................................................................ 29

Phn V. S lc v cc thut ton tin sinh........................................................................................... 30 5.1. Thut ton Blast ........................................................................................................................ 30 5.2. Thut ton Landau-Vishkin ........................................................................................................ 31 5.2.1. Mt s khi nim ................................................................................................................ 31 5.2.2. Khp xu xp x (Approximate String Matching) ............................................................... 32 5.2.3. Gii php quy hoch ng .................................................................................................. 32 Phn VI. S lc v BlastReduce .......................................................................................................... 34 6.1. Tm tt: ..................................................................................................................................... 34 6.2. Read Mapping........................................................................................................................... 34 6.3. Thut ton BlastReduce ............................................................................................................ 35 6.3.1. MerReduce: tnh cc Mer ging nhau ................................................................................ 36 6.3.2. SeedReduce: kt hp cc Mer nht qun .......................................................................... 37 6.3.3. ExtendReduce: m rng cc ht ging ............................................................................... 37

V Minh Ngc

Li ni u
Knh ch{o c|c thy c! Sau mt thi gian thc tp tt nghip, sau }y l{ bn b|o c|o nhng g em ~ l{m c trong thi gian qua. Ni dung chnh trong thi gian thc tp va qua l{ S dng Hadoop v{ framework MapReduce gii quyt b{i to|n tinh sinh hc BLAST. Theo cm ngh ca em th Hadoop l{ mt ng dng mi v{ cng khng d nm bt, v{ vic l{m sao thut to|n BLAST c th x l song song trn Hadoop cng kh| kh. Nhng vi s gip ca thy hng dn T Minh Phng, v{ c|c anh ch trong cng ti VCCorp th em cng phn n{o nm bt c vn . Tuy bn b|o c|o cn s s{i, nhng l{ tin cho nhng phn k tip. Em s c gng ho{n thin hn, v{ ho{n chnh t{i v{o b{i cui kho|. Mt ln na em xin c|m n c|c thy c ~ nh hng v{ hng dn trong sut thi gian hc tp v{ trong thi gian thc tp va qua.

V Minh Ngc

Phn I. Gii thiu chung


1.1. Hadoop l g?
Mc ch : Mong muon cua cac doanh nghiep la tn dng lng d lieu khng l a ra quyt nh kinh doanh, Hadoop giup cac cng ty x ly khoi lng c terabyte v{ thm ch l{ petabytes d liu phc tp tng i hiu qu vi chi ph thp hn. C|c doanh nghip ang n lc tm kim thong tin quy gia t khi lng ln d liu phi cu trc c to ra bi c|c web log, cng c clickstream, c|c sn phm truyn thng x~ hi. Ch nh yeu to o dn lam tang s quan tam en cong ngh m~ ngun m Hadoop. Hadoop, mt d |n phn mm qun l d liu Apache vi nh}n trong khung phn mm MapReduce ca Google, c thit k h tr c|c ng dng s dng c s lng ln d liu cu trc v{ phi cu trc. Khng ging nh c|c h qun tri c s d liu truyn thng, Hadoop c thit k l{m vic vi nhiu loi d liu v{ d liu ngun. Cng ngh HDFS ca Hadoop cho php khi lng ln cng vic c chia th{nh c|c khi d liu nh hn c nh}n rng v{ ph}n phi trn c|c phn cng ca mt cluster e x l nhanh hn. Cng ngh n{y ~ c s dng rng r~i bi mt s trang web ln nht th gii, chng hn nh Facebook, eBay, Amazon, Baidu, v{ Yahoo. C|c nh{ quan s|t nhn mnh rng Yahoo l{ mt trong nhng nh{ ng gp ln nht i vi Hadoop.

1.2. MapReduce l g?
MapReduce l{ mt m hnh lp trnh (programming model), ln u b|o c|o trong b{i b|o ca Jefferey Dean v{ Sanjay Ghemawat hi ngh OSDI 2004. MapReduce ch l{ mt tng, mt abstraction. hin thc n th cn mt implementation c th. Google c mt implementation ca MapReduce bng C++. Apache c Hadoop, mt implementation m~ ngun m kh|c trn Java th phi (t nht ngi dng dng Hadoop qua mt Java interface). Khi d liu ln c t chc nh mt tp hp gm rt nhiu cp (key, value) x l khi d liu n{y, lp trnh vin vit hai h{m map v{ reduce. H{m map c input l{ mt cp (k1, v1) v{ output l{ mt danh s|ch c|c cp (k2, v2). Ch rng c|c input v{ output keys v{ values c th thuc v c|c kiu d liu kh|c nhau, ty h. Nh vp h{m map c th c vit mt c|ch hnh thc nh sau: map(k1,v1) -> list(k2,v2) MR s |p dng h{m map (m{ ngi dng MR vit) v{o tng cp (key, value) trong khi d liu v{o, chy rt nhiu phin bn ca map song song vi nhau trn c|c m|y tnh ca cluster. Sau giai on n{y th chng ta c mt tp hp rt nhiu cp (key, value) thuc kiu (k2, v2) gi l{ c|c cp (key, value) trung gian. MR cng s nhm c|c cp n{y theo tng key, nh vy c|c cp (key, value) trung gian c cng k2 s nm cng mt nhm trung gian.

V Minh Ngc

Giai on hai MR s |p dng h{m reduce (m{ ngi dng MR vit) v{o tng nhm trung gian. Mt c|ch hnh thc, h{m n{y c th m t nh sau: reduce(k2, list (v2)) -> list(v3) Trong k2 l{ key chung ca nhm trung gian, list(v2) l{ tp c|c values trong nhm, v{ list(v3) l{ mt danh s|ch c|c gi| tr tr v ca reduce thuc kiu d liu v3. Do reduce c |p dng v{o nhiu nhm trung gian c lp nhau, chng li mt ln na c th c chy song song vi nhau. V d c bn nht ca MR l{ b{i m t (Ting Anh). R r{ng }y l{ mt b{i to|n c bn v{ quan trng m{ mt search engine phi l{m. Nu ch c v{i chc files th d ri, nhng nh rng ta c nhiu triu hay thm ch nhiu t files ph}n b trong mt cluster nhiu nghn m|y tnh. Ta lp trnh MR bng c|ch vit 2 h{m c bn vi pseudo-code nh sau:
void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result));

Ch vi hai primitives n{y, lp trnh vin c rt nhiu flexibility ph}n tch v{ x l c|c khi d liu khng l. MR ~ c dng l{m rt nhiu vic kh|c nhau, v d nh distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation, large-scale graph computation

V Minh Ngc

Phn II. Ci t Hadoop


1. Ci t my o Ubuntu 10.10 (32 bit) trn VMware
S dng VMware Workstation 7.0.0 build-203739 (32-bit) H iu h{nh Ubuntu Desktop Edittion 10.10 (32-bit) To user mc nh l{ hadoop

1. Ci t Vmware tools cho Ubuntu


a. Kch hot t{i khon root $sudo passwd root - bn in pass cho t{i khon hadoop - in tip 2 ln pass mi cho t{i khon root b. C{i t tools cho Ubuntu - ng nhp li bng t{i khon root - Chn c{i Vmware tools nh hnh sau - V{o m|y o Ubuntu, gii nn file VMwareTools-8.1.3-203739.tar.gz v{ chy file vmware-install.pl - Bm enter chn c|c ty chn mc nh t trong du mc vung

2. Ci openSSH cho ubuntu


$ sudo apt-get install openssh-server openssh-client

3. Ci java:
Hadoop yu cu java 1.5.x. Tuy nhin, bn 1.6.x c khuyn khch khi s dng cho Hadoop, di }y m t c|ch thc c{i java : a. Thm Canonical i t|c Repository v{o kho apt ca bn $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" b. Cp nht danh s|ch ngun $ sudo apt-get update c. C{i t sun-java6-jdk $ sudo apt-get install sun-java6-jdk d. Kim tra

V Minh Ngc

user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

4. Thm user hadoop vo nhm hadoop

V Minh Ngc

5. Cu hnh ssh
Hadoop yu cu truy cp SSH qun l c|c node ca n, v d nh iu kin mt m|y tnh t xa cng vi m|y cc b ca bn nu nh bn mun Hadoop l{m vic trn . Trong thit lp n node cho haddop , chng ta cu hnh ssh truy cp ti localhost cho user hadoop m{ chng ta to ra phn trc. a. ng nhp t t{i khon hadoop b. S dng dng lnh // Khng nhp g trong 3 ln hi, ch n xung dng Enter

V Minh Ngc

hadoop@hadoop:~$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 19:f5:c2:b2:19:25:83:25:8f:ec:45:f7:4a:c3:59:25 hadoop@hadoop The key's randomart image is: +--[ RSA 2048]----+ | .o= + E.. | | ..= O = . | | o * B o | | . . O + | | . S . | | | | | | | | | +-----------------+ c. Bn phi cho php SSH truy cp ti m|y cc b ca bn vi kha mi: hadoop@hadoop:~$ cd ~/.ssh hadoop@hadoop:~/.ssh$ cat id_rsa.pub >> authorized_keys

d. Kim tra c|c c{i t SSH bng c|ch kt ni vi m|y tnh cc b ca bn vi user hadoop. Bc n{y cng cn thit lu tr du v}n tay ca m|y bn trong file know_host. Nu bn c bt c cu hnh c bit cho SSH ging nh mt cng SSH khng chun, bn c th nh ngha li trong $HOME/.ssh/config

10

V Minh Ngc

hadoop@hadoop:~/.ssh$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is 0a:3d:86:06:28:82:7f:3a:35:0b:83:d5:35:ee:b8:b1. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux hadoop 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010 i686 GNU/Linux Ubuntu 10.10 Welcome to Ubuntu! * Documentation: https://help.ubuntu.com/

The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.

6. V hiu ha IPv6
Mt vn vi IPv6 trn Ubuntu l{ vic s dng 0.0.0.0 cho c|c ty chn cu hnh Hadoop cho c|c mng c lin quan n nhau s cho kt qu Hadoop lin kt n c|c a ch IPv6 ca my Ubuntu box. a. v hiu ha IPv6 trong Ubuntu 10.10, m /etc/sysctl.conf trong editor bn thm dng sau v{o cui file: #disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

b. Khi ng li m|y thay i c hiu qu. c. kim tra li bn c th s dng dng lnh sau
$ cd / $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Kt qu tr v l{ 0 tc l{ IPv6 vn cn c kch hot, bng 1 l{ ~ c v hiu ha.

V Minh Ngc

11

7. Download v ci t hadoop
a. Download Hadoop 0.20.2 v lu vo th mc /usr/local/ $ $ $ $ cd /usr/local sudo tar xzf hadoop-0.20.2.tar.gz sudo mv hadoop-0.20.2 hadoop sudo chown -R hadoop:hadoop hadoop

b. Cu hnh i. hadoop-env.sh C{i t JAVA_HOME. Thay i


# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Th{nh :
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun

ii. conf/core-site.xml
<!-- In: conf/core-site.xml --> <property> <name>hadoop.tmp.dir</name> <value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

iii. conf/mapred-site.xml
<!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

12

V Minh Ngc

</description> </property>

iv. conf/hdfs-site.xml
<!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>

c. nh dng cc tn node u tin khi ng Hadoop va ca bn l{ nh dng li h thng tp tin Hadoop m{ c thc hin trn u ca h thng tp tin ca bn. Bn cn phi l{m vic n{y trong ln u chy. Bn chy lnh sau:
hadoop@ubuntu:~$ /hadoop/bin/hadoop namenode -format

Kt qu:

d. Chy hadoop trn cm mt node S dng c}u lnh : $ /bin/start-all.sh Kt qu nh sau:

V Minh Ngc

13

Mt tool kh| thut tin kim tra xem c|c tin trnh Hadoop ang chy l{ jps:

Bn cng c th kim tra vi netstart nu Hadoop ang nghe trn c|c cng ~ c cu hnh:

e.

Dng hadoop trn cm mt node S dng lnh : /bin/stop-all.sh

8. Chy mt v d MapReduce
Chng ta chy v d WordCount c sn trong phn v d ca Hadoop. N x m c|c t trong file v{ s ln xut hin. file u v{o v{ u ra l{ dng text, mi dng trong file u ra cha t v{ s ln xut hin, ph}n c|ch vi nhau bi du TAB. a. Download d liu u v{o Download 3 cun s|ch t Project Gutenberg: The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson The Notebooks of Leonardo Da Vinci Ulysses by James Joyce Chn file trong Plain Text UTF-8, sau copy v{o th mc tmp ca Hadoop: /tmp/gutenberg , kim tra li nh sau:

14

V Minh Ngc

Restart li hadoop cluster: hadoop@ubuntu:~$ /bin/start-all.sh b. Copy d liu v{o HDFS


01 02 03 04 05 06 07 08 09 10 hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop /tmp/gutenberg gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop Found 1 items drwxr-xr-x - hadoop supergroup 0 /user/hadoop/gutenberg hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop Found 3 items -rw-r--r-3 hadoop supergroup 674566 /user/hadoop/gutenberg/pg20417.txt -rw-r--r-3 hadoop supergroup 1573112 /user/hadoop/gutenberg/pg4300.txt -rw-r--r-3 hadoop supergroup 1423801 /user/hadoop/gutenberg/pg5000.txt hadoop@ubuntu:/usr/local/hadoop$ dfs -copyFromLocal dfs -ls 2010-05-08 17:40 dfs -ls gutenberg 2011-03-10 11:38 2011-03-10 11:38 2011-03-10 11:38

c. Chy MapReduce job S dng c}u lnh sau: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-<version>examples.jar wordcount gutenberg gutenberg-output Trong c}u lnh n{y bn sa <version> th{nh phin bn m{ bn ang s dng. Bn c th kim tra trong th mc c{i Hadoop c cha file *.jar n{y. C}u lnh n{y s c tt c c|c file trong th mc butenberg t HDFS, x l v{ lu kt qu v{o gutenberg-output. Kt qu u ra nh sau:

V Minh Ngc

15

Kim tra kt qu nu lu th{nh cng:

16

V Minh Ngc

Nu bn mun sa i c|c thit lp ca Hadoop ging nh tng s task Reduce ln, bn c th s dng ty chn -D nh sau: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2examples.jar wordcount -D mapred.reduce.tasks=16 gutenberg gutenberg-output d. Ly kt qu t HDFS kim tra c|c file, bn c th copy n t HDFS n h thng file a phng. Ngo{i ra, bn c th s dng lnh sau:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenbergoutput/part-r-00000gutenberg-output

Trong v d n{y, chng ta c th copy nh sau:

9. Ci t v s dng Hadoop trn Eclipse


a. Download v{ c{i t plug-in C|c bn download: Eclipse SDK Version: 3.5.2 Hadoop plug-in cho Eclipse: hadoop-0.20.1-eclipseplugin.jar Copy hadoop-0.20.1-eclipse-plugin.jar v{o trong th mc plug-ins ca Eclipse b. C{i t MapReduce location Khi ng Eclipse, bn bm v{o nt trong vng :

V Minh Ngc

17

Sau chn Other => MapRecude => OK

Kch chut phi v{o phn trng ca Location trong TAB Map/Recude Locations, chn New Hadoop location V{ in c|c tham s nh hnh di:

18

V Minh Ngc

Khi ng Hadoop cluster nh trn, v{ kim tra DFS nh hnh di }y

V Minh Ngc

19

Phn III. Thnh phn ca Hadoop

1. Mt s thut ng.
MapReduce job l{ mt n v ca cng vic m{ kh|ch h{ng (client) mun c thc hin: n bao gm d liu u v{o, chng trnh MapReduce, v{ thng tin cu hnh. Hadoop chy c|c cng vic (job) n{y bng c|ch chia n th{nh c|c nhim v (task), trong c hai kiu chnh l{ : c|c nhim v map (map task) v{ c|c nhim v reduce (reduce task) C hai loi node iu kin qu| trnh thc hin cng vic (job): mt jobtracker v{ mt s tasktracker. Jobtracker kt hp tt c c|c cng vic trn h thng bng c|ch lp lch cng vic chy trn c|c tasktracker. Tasktracker chy c|c nhim v (task) v{ gi b|o c|o thc hin cho jobtracker, c|i lu gi c|c bn nghi v qu| trnh x l tng th cho mi cng vic (job) Hadoop chia u v{o cho mi cng vic MapReduce v{o c|c mnh (piece) c kch thc c nh gi l{ c|c input split hoc l{ c|c split. Hadoop to ra mt task map cho mi split, c|i chy mi nhim v map do ngi s dng nh ngha cho mi bn ghi (record) trong split. C rt nhiu c|c split , iu n{y c ngha l{ thi gian x l mi split nh hn so vi thi gian x l to{n b u v{o. V vy, nu chng ta x l c|c split mt c|ch song song, th qu| trnh x l s tt hn c}n bng ti, nu c|c split nh, khi mt chic m|y tnh nhanh c th x l tng ng nhiu split trong qu| trnh thc hin cng vic hn l{ mt m|y tnh chm. Ngay c khi c|c m|y tnh ging ht nhau, vic x l khng th{nh cng hay c|c cng vic kh|c ang chy ng thi l{m cho cn bng ti nh mong mun, v{ cht lng ca c}n bng ti tng nh l{ chia c|c splits th{nh c|c ht mn hn Mt kh|c, nu chia t|ch qu| nh, sau chi ph cho vic qun l c|c split v{ ca to ra c|c map task bt u chim rt nhiu tng thi gian ca qu| trnh x l cng vic. i vi hu ht cng vic, kch thc split tt nht thng l{ kch thc ca mt block ca HDFS, mc nh l{ 64MB, mc d n c th thay i c cho mi cluster ( cho tt c c|c file mi c to ra) hoc nh r khi mi file c to ra. Hadoop l{m tt nht c|c cng vic ca n chy c|c map task trn mt node khi m{ d liu u v{o ca n c tr ngay trong HDFS. N c gi l{ ti u ha d liu a phng. B}y gi chng ta s l{m r ti sao kch thc split ti u li bng kch thc ca block: n l{ kch thc ln nht ca mt u v{o m{ c th c m bo c lu trn mt node n. Nu split c chia th{nh 2 block, n s khng chc l{ bt c node HDFS n{o lu tr c hai block, v vaayjmootj s split phi c chuyn trn mng n node chy map tast, nh vy r r{ng l{ s t hiu qu hn vic chy to{n b map task s dng d liu cc b. C|c map task ghi u ra ca chng trn a c b, khng phi l{ v{o HDFS. Ti sao li nh vy? u ra ca map l{ u ra trung gian, n c x l bi reduce task to ra

20

V Minh Ngc

u ra cui cng , v{ mt khi cng vic c ho{n th{nh u ra ca map c th c b i. V vy vic lu tr n trong HDFS, vi c|c nh}n bn, l{ khng cn thit. Nu c|c node chy maptask b li trc khi u ra map ~ c s dng bi mt reduce task, khi Hadoop s t ng chy li map task trn mt node kh|c to ra mt u ra map. Khi chy Hadoop c ngha l{ chy mt tp c|c trnh nn - daemon, hoc c|c chng trnh thng tr, trn c|c m|y ch kh|c nhau trn mng ca bn. Nhng trnh nn c vai tr c th, mt s ch tn ti trn mt m|y ch, mt s c th tn ti trn nhiu m|y ch. C|c daemon bao gm: NameNode DataNode SecondaryNameNode JobTracker TaskTracker

2. Cc trnh nn ca Hadoop
2.1. NameNode L{ mt trnh nn quan trng nht ca Hadoop - c|c NameNode. Hadoop s dng mt kin trc master/slave cho c lu tr ph}n t|n v{ x l ph}n t|n. H thng lu tr ph}n t|n c gi l{ Hadoop File System hay HDFS. NameNode l{ master ca HDFS ch o c|c trnh nn DataNode slave thc hin c|c nhim v I/O mc thp. NadeNode l{ nh}n vin k to|n ca HDFS; n theo di c|ch c|c tp tin ca bn c ph}n kia th{nh c|c block, nhng node n{o lu c|c khi , v{ kim tra sc khe tng th ca h thng tp ph}n t|n. Chc nng ca NameNode l{ nh (memory) v{ I/O chuyn s}u. Nh vy, m|y ch l tr NameNode thng khng lu tr bt c d liu ngi dng hoc thc hin bt c mt tnh to|n n{o cho mt ng dng MapReduce gim khi lng cng vic trn m|y. iu n{y c ngha l{ m|y ch NameNode khng gp i (double) nh l{ DataNode hay mt TaskTracker. C iu |ng tic l{ c mt kha cnh tiu cc n tm quan trng ca NameNode n c mt im ca tht bi ca mt cm Hadoop ca bn. i vi bt c mt trnh nn kh|c, nu c|c nt m|y ca chng b hng v l do phn mm hay phn cng, c|c Hadoop cluster c th tip tc hot ng thng sut hoc bn c th khi ng n mt c|ch nhanh chng. Nhng khng th |p dng cho c|c NameNode. 2.2. DataNode Mi m|y slave trong cluster ca bn s lu tr (host) mt trnh nn DataNode thc hin c|c cng vic n{o ca h thng file ph}n t|n - c v{ ghi c|c khi HDFS

V Minh Ngc

21

ti c|c file thc t trn h thng file cc b (local filesytem). Khi bn mun c hay ghi mt file HDFS, file c chia nh th{nh c|c khi v{ NameNode s ni cho c|c client ca bn ni c|c mi khi trnh nn DataNode s nm trong . Client ca bn lin lc trc tip vi c|c trnh nn DataNode x l c|c file cc b tng ng vi c|c block. Hn na, mt DataNode c th giao tip vi c|c DataNode kh|c nh}n bn c|c khi d liu ca n d phng. Hnh 2.1 minh ha vai tr ca NameNode v{ DataNode. Trong c|c s liu n{y ch ra 2 file d liu, mt c|i /user/chuck/data1 v{ mt c|i kh|c /user/james/data2. File Data1 chim 3 khi, m{ c biu din l{ 1 2 3. V{ file Data2 gm c|c khi 4 v{ 5. Ni dung ca c|c file c ph}n t|n trong c|c DataNode. Trong minh ha n{y, mi block c 3 nh}n bn. Cho v d, lock 1 (s dng data1) l{ c nh}n bn hn 3 ln trn hu ht c|c DataNodes. iu n{y m bo rng nu c mt DataNode gp tai nn hoc khng th truy cp qua mng c, bn vn c th c c c|c tp tin. C|c DataNode thng xuyn b|o c|o vi c|c NameNode. Sa khi khi to, mi DataNode thng b|o vi NameNode ca c|c khi m{ n hin ang lu tr. Sau khi Mapping ho{n th{nh, c|c DataNode tip tc thm d kin NameNode cung cp thng tin v thay i cc b cng nh nhn c hng dn to, di chuyn hoc xa c|c blocks t a a phng (local). 2.3. Secondary NameNode C|c Secondary NameNode (SNN) l{ mt trnh nn h tr gi|m s|t trng th|i ca c|c cm HDFS. Ging nh NameNode, mi cm c mt SNN, v{ n thng tr trn mt m|y ca mnh. Khng c c|c trnh nn DataNode hay TaskTracker chy trn cng mt server. SNN kh|c vi NameNode trong qu| trnh x l ca n khng nhn hoc ghi li bt c thay i thi gian thc ti HDFS. Thay v{o , n giao tip vi c|c NameNode bng c|ch chp nhng bc nh ca siu d liu HDFS (HDFS metadata) ti nhng khong x|c nh bi cu hnh ca c|c cluster. Nh ~ cp trc , NameNode l{ mt im truy cp duy nht ca li (failure) cho mt cm Hadoop, v{ c|c bc nh chp SNN gip gim thiu thi gian ngng (downtime) v{ mt d liu. Tuy nhin, mt NameNode khng i hi s can thip ca con ngi cu hnh li c|c cluster s dng SSN nh l{ NameNode chnh. 2.4. JobTracker Trnh nn JobTracker l{ mt lin lc gia ng dng ca bn { Hadoop. Mt khi bn gi m~ ngun ca bn ti c|c cm (cluster), JobTracker s quyt nh k hoch thc hin bng c|ch x|c nh nhng tp tin n{o s x l, c|c nt c giao c|c nhim v kh|c nhau, v{ theo di tt c c|c nhim v khi dng ang chy. Nu mt nhim v (task) tht bi (fail), JobTracker s t ng chy li nhim v , c th trn mt node kh|c, cho n mt gii hn n{o c nh sn ca vic th li n{y.

22

V Minh Ngc

Ch c mt JobTracker trn mt cm Hadoop. N thng chy trn mt m|y ch nh l{ mt nt master ca cluster.

2.5.

TaskTracker Nh vi c|c trnh nn lu tr, c|c trnh nn tnh to|n cng phi tu}n theo kin trc master/slave: JobTracker l{ gi|m s|t tng vic thc hin chung ca mt cng vic MapRecude v{ c|c taskTracker qun l vic thc hin c|c nhim v ring trn mi node slave. Hnh 2.2 minh ha tng t|c n{y. Mi TaskTracker chu tr|ch nhim thc hin c|c task ring m{ c|c JobTracker giao cho. Mc d c mt TaskTracker duy nht cho mt node slave, mi TaskTracker c th sinh ra nhiu JVM x l c|c nhim v Map hoc Reduce song song. Mt trong nhng tr|ch nhim ca c|c TaskTracker l{ lin tc lin lc vi JobTracker. NeeusJobTracker khng nhn c nhp p t mootjTaskTracker trong vng mt lng thi gian ~ quy nh, n s cho rng TaskTracker ~ b treo (cashed) v{ s gi li nhim v tng ng cho c|c nt kh|c trong cluster.

Hnh 2.2 Tng tc gia JobTracker v TaskTracker. Sau khi client gi JobTracker bt u cng vic x l d liu, cc phn vng JobTracker lm vic v giao cc nhim v Map v Recude khc nhau cho mi TaskTracker trong cluster.

V Minh Ngc

23

Hnh 2.3 Cu trc lin kt ca mt nhm Hadoop in hnh. l mt kin trc master/slave trong NameNode v JobTracker l Master v DataNode & TaskTracker l slave.

Cu trc lin kt n{y c mt node Master l{ trnh nn NameNode v{ JobTracker v{ mt node n vi SNN trong trng hp node Master b li. i vi c|c cm nh, th SNN c th thng ch trong mt node slave. Mt kh|c, i vi c|c cm ln, ph}n t|ch NameNode v{ JobTracker th{nh hai m|y ring. C|c m|y slave, mi m|y ch lu tr mt DataNode v{ Tasktracker, chy c|c nhim v trn cng mt node ni lu d liu ca chng. Chng ti s thit lp mt cluster Hadoop y vi mu nh trn bng c|ch u tin thit lp c|c nt Master v{ kim so|t knh gia c|c node. Nu mt cluster Hadoop ca bn ~ c sn, bn c th nhay qua phn c{i t knh Secure Shell (SSH) gia c|c node. Bn cng c mt v{i la chn chy Hadoop l{ s dng trn Mt m|y n, hoc ch gi ph}n t|n. Chng s hu dng ph|t trin. Cu hnh Haddop chy trong hai node hoc c|c cluster chun (ch ph}n t|n y ) c cp trong chng 2.3

24

V Minh Ngc

Phn IV. Lp trnh MapReduce c bn


1. Tng quan mt chng trnh MapReduce Nh chng ta bit, mt chng trnh MapReuduce x l d liu bng cch tao thc vi cc cp (key/value) theo cng thc chung: map: (K1,V1) list(K2,V2) reduce: (K2,list(V2)) list(K3,V3)

Trong phn ny chng ta hc chi tit hn v tng giai on trong chng trnh MapReduce in hnh. Hnh 3.1 biu din biu cao cp ca ton b qu trnh, v chng ti tip tc m x tng phn:

V Minh Ngc

25

2. Cc loi d liu m Hadoop h tr


MapReduce framework c mt cc nh ngha cp kha key/value tun t c th di chuyn chng qua mng, v ch cc lp h tr kiu tun t c chng nm ging nh key v value trong framework. C th hn, cc lp m implement giao din Writable c th lm value, v cc lp m implement giao din WritableComparable<T> c th lm c key v value. Lu rng giao din WritableComparable<T> l mt s kt hp ca Writeable v giao din java.lang.Comparable<T>. Chng ta cn yu cu so snh cc kha bi v chng s c sp xp giai on reduce, trong khi gi tr th n gin c cho qua. Hadoop i km mt s lp c nh ngha trc m implement WritableComparable, bao gm cc lp b cho tt c cc loi d liu c bn nh trong bng 3.1 sau:

Bn cng c th ty chnh mt kiu d liu bng cch implement Writable (hay WritableComparable<T>). Nh v d 3.2 sau, lp biu din cc cnh trong mng, nh ng bay gia hai thnh ph:

26

V Minh Ngc

import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableComparable; public class Edge implements WritableComparable<Edge>{ private String departureNode; //Node khoi hanh private String arrivalNode; //Node den public String getDepartureNode(){ return departureNode; } @Override public void readFields(DataInput in) throws IOException { // TODO Auto-generated method stub departureNode = in.readUTF(); arrivalNode = in.readUTF(); } @Override public void write(DataOutput out) throws IOException { // TODO Auto-generated method stub out.writeUTF(departureNode); out.writeUTF(arrivalNode); } @Override public int compareTo(Edge o) { // TODO Auto-generated method stub return (departureNode.compareTo(o.departureNode) != 0)? departureNode.compareTo(departureNode): arrivalNode.compareTo(o.arrivalNode); } }

Lp Edge thc hin hai phng thc readFields() v write() ca giao din Writeable. Chng lm vic vi lp Java DataInput v DataOutput tun t ni dung ca cc lp. Th hin phng php compareTo() cho interface Comparable. N tr li gi tr -1, 0, +1. Vi kiu d liu c nh ngha ti giao din, chng ta c th tin hnh giai on u tin ca x l lung d liu nh trong hnh 3.1: mapper.

2.1.

Mapper

phc lm mt Mapper, mt lp implements t interface Mapper v k tha t lp MapReduceBase. Lp MapReduceBase, ng vai tr l lp c s cho c mapper v reducer. N

V Minh Ngc

27

bao gm hai phng thc hot ng hiu qu nh l hm khi to v hm hy ca lp: void configure(JobConf job) trong hm nay, bn c th trch xut cc thng s

ci t hoc bng cc file XML cu hnh hoc trong cc lp chnh ca ng dng ca bn. Gi ci hm ny trc khi x l d liu. void close() Nh hnh ng cui trc khi chm dt nhim v map, hm ny nn c gi bt c khi no kt thc kt ni c s d liu, cc file ang m.

Giao din Mapper chu trch nhim cho bc x l d liu. N s dng Java Generics ca mu Mapper<K1,V1,K2,V2> ch m cc lp key v cc lp value m implements t interface WriteableComparable v Writable. Phng php duy nht ca n x l cc cp (key/value) nh sau:
void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter ) throws IOException

Phng thc ny to ra mt danh sch (c th rng) cc cp (K2, V2) t mt cp u vo (K1, V1). OuputCollector nhn kt qu t u ra ca qu trnh mapping, v Reporter cung cp cc ty chn ghi li thng tin thm v mapper nh tin trin cng vic. Hadoop cung cu mt vi ci t Mapper hu dng. Bn c th thy mt vi ci nh trong bn 3.2 sau: Bng 3.2. Mt vi lp thc hin Mapper c nh ngha trc bi Hadoop - IdentityMapper<K,V> : vi ci t Mapper <K, V, K, V> v nh x u vo trc tip vo u ra - InverseMapper<K,V> : vi ci t Mapper<K, V, V, K> v o ngc cp (K/V) - RegexMapper<K> : vi ci Mapper<K, Text, Text, LongWritable> v sinh ra cp (match, 1) cho mi nh x (match) biu thc thng xuyn. - TokenCountMapper<K> : vi ci t Mapper<K, Text, Text, LongWritable> sinh ra mt cp (token, 1) khi mt gi tr u vo l tokenized.

2.2. Reducer
Vi bt c ci t Mapper, mt reducer u tin phi m rng t lp MapReduce base cho php cu hnh v dn dp. Ngoi ra, n cng phi implement giao din Reducer ch c mt phng thc duy nht sau:
void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter ) throws IOException

Khi nhn c cc task t u ra ca cc Mapper khc nhau, n sp xp cc d liu n theo cc kha ca cc cp (key/value) v nhm li cc gi tr cng kha. Hm reduce() c gi sau , n sinh ra mt danh sch (c th rng) cc cp (K3, V3) bng cch lp li trn cc gi tr

28

V Minh Ngc

c lin kt vi kha cho. OutputCollector nhn t u ra ca qu trnh reduce v ghi n ra u ra file. Reporter cung cp ty chn ghi li thng tin thm v reducer nh l mt tin trin cng vic. Bng 3.3 lit k mt vi reducer c bn c trin khai cung cp bi Hadoop - IdentityReducer<K, V> : vi ci t Reducer <K, V, K, V> v nh x u vo trc tip vo u ra - LongSumReducer<K> : vi ci t Reducer <K, LongWritable, K, LongWritable> v quyt nh thng hp tt c cc gi tr tng tng vi cc key cho C mt bc quan trng gia 2 bc map v reduce: ch o kt qu ca cc Mapper ti cc Reducer. y l trch nhim ca partitioner (phn vng).

2.3. Partitioner chuyn hng u ra t Mapper


Vi nhiu reducer, chng ta cn mt vi cch xc nh mt trong nhng cp (key/value) l u ra ca mt mapper c gi i. Hnh vi mc nh l bm key xc nh reducer. Hadoop thc thi kin lc ny bng cch s dng lp HashPartitioner. Thnh thong lp ny s lm vic hiu qu. Tr li v d Edge nh trong phn 3.2.1. Gi s bn s dng lp Edge phn tch d liu thng tin chuyn bay xc nh s lng hnh khc khi hnh t mi sn bay. V d nh d liu sau:
(San Francisco, Los Angeles) Chuck Lam (San Francisco, Dallas) James Warren ...

Nu bn s dng HashPartitioner, hai dng s c gi ti 2 reducer khc nhau. S cc im khi hnh s c x l 2 ln v c hai ln u sai. Lm th no chng ta c th ty chnh Partitioner cho ng dng ca bn? Trong tinh hnh ny, chng ta mun tt c cc ng bay vi mt im khi hnh s c gi ti cng mt reducer. iu ny c d lm bng cch bm departureNode ca Edge:
public class EdgePartitioner implements Partitioner<Edge, Writable>{ @Override public int getPartition(Edge key, Writable value, numPartitions){ return key.getDepartureNode().hashCode() % numPartitions; } @Override public void configure(JobConf conf) { } }

int

V Minh Ngc

29

Phn V. S lc v cc thut ton tin sinh


5.1. Thut ton Blast
tng ca BLAST da trn c s x|c sut rng nhng chui bt cp trnh t (alignment) thng s hu nhiu on chui con c tnh tng t cao. Nhng chui con n{y c m rng tng tnh tng t trong qu| trnh tm kim. Thut to|n ca BLAST c 2 phn, mt phn tm kim v{ mt phn |nh gi| thng k da trn kt qu tm c. Thut to|n tm kim ca BLAST bao gm 3 bc sau: Bc 1: BLAST tm kim c|c chui con ngn vi chiu d{i c nh W c tnh tng t cao (khng cho php khong trng gaps) gia chui truy vn v{ c|c chui trong c s d liu. Nhng chui con vi chiu d{i W c BLAST gi l{ mt t (word). Gi| tr W tham kho cho Protein l{ 3 v{ DNA l{ 11. Nhng chui con n{y c |nh gi| cho im da trn ma trn thay th (Substitutionsmatrix) BLOSUM hoc PAM, nhng chui con n{o c s im ln hn mt gi| tr ngng T (threshold value) th c gi l{ tm thy v{ c BLAST gi l{ Hits. V d, khi cho sn c|c chui AGTTAH v{ ACFTAQ v{ mt t c chiu d{i W = 3, BLAST s x|c nh chui con TAH v{ TAQ vi s im theo ma trn PAM l{ 3 + 2 + 3 = 8 v{ gi chng l{ mt Hit. Bc 2: BLAST tip tc tm kip nhng cp Hits tip theo da trn c s nhng Hit ~ tm c trong bc 1. Nhng cp Hits n{y c BLAST gii hn bi mt gi| tr cho trc d, gi l{ khong c|ch gia nhng Hits. Nhng cp Hits c khong c|ch ln hn d s b BLAST b qua. Gi| tr d ph thuc v{o d{i W bc 1, v d nu W = 2 th gi| tr d ngh l{ d=16. Bc 3: Cui cng BLAST m rng nhng cp Hits ~ tm c theo c hai chiu v{ ng thi |nh s im. Qu| trnh m rng kt thc khi im ca c|c cp Hits khng th m rng thm na. Mt im ch }y l{ phin bn gc ca BLAST khng cho php ch trng (gap) trong qu| trnh m rng, nhng phin bn mi hn ~ cho php ch trng. Nhng cp Hits sau khi m rng c im s cao hn mt gi| tr ngng S (threshold value) th c BLAST gi l{ "cp im s cao" (high scoring pair) HSP.

30

V Minh Ngc

V d, vi chui AGTTAHTQ v{ ACFTAQAC vi Hit TAH v{ TAQ s c m rng nh sau: AGTTAHTQ xxx||||x ACFTAQAC Nhng cp HSP ~ tm c c BLAST sp xp theo gi| tr |nh gi| gim dn, a ra m{n hnh, v{ thc hin phn |nh gi| thng k trn nhng cp HSP n{y. Trong phn |nh gi| thng k, BLAST da trn c s |nh gi| ca mt cp HSP tnh ra mt gi| tr gi l{ ''Bit-Score'', gi| tr n{y khng ph thuc v{o ma trn thay th v{ c s dng |nh gi| cht lng ca c|c bt cp. Gi| tr c{ng cao chng t kh nng tng tu ca c|c bt cp c{ng cao. Ngo{i ra BLAST tnh to|n mt gi| tr trng i E-Score (Expect-Score) ph thuc v{o Bit-Score. Gi| tr E-Score n{y th hin x|c sut ngu nhin ca c|c bt cp, gi| tr c{ng thp c{ng chng t nhng bt cp n{y c ph|t sinh theo quy lut t nhin, t ph thuc v{o tnh ngu nhin.

5.2. Thut ton Landau-Vishkin


5.2.1. Mt s khi nim Cho c|c chui v{ ch c|i , chng ta c mt v{i nh ngha sau: vi | | v{ | | ( ) trn mt bng

l{ mt chui rng l{ chui con (substring) ca khi v{ vi v{ . Nu ta ni rng l{ chui con ho{n to{n ca l{ tin t ca nu v{ vi nu th ta ni rng l{ tin t ho{n to{n ca l{ hu tt ca nu vi . Nu th chng ta ni rng l{ hu t ho{n to{n ca . Chng ta cng ni rng khi l{ hu t th ca ( l{ hu t ca bt u t v tr ) Tin t chung d{i nht ( ) ca v{ l{ chui ln nht m{ v{ . Nu th . Ch l{ biu din ca . Nu v{ l{ r r{ng trong ng cnh th chng ta vit n gin l{ . Phn m rng chung d{i nht ( ) ca v{ ti v tr l{ d{i ca ca v{ . Nu v{ l{ r r{ng trong ng cnh, chng ta c th vit n gin l{ . l{ v{ chui chui l{

Trong phn n{y chng ta gi chui

V Minh Ngc

31

5.2.2. Khp xu xp x (Approximate String Matching) nh ngha 1: Edit distance (Khong c|ch sa i) Khong c|ch sa i gia hai x}u hot ng c{n thit chuyn i th{nh ngha nh sau: Thay th: Khi mt k t ca hay v{ l{ s lng ti thiu c|c th{nh , trong c|c hot ng c nh ca

c thay th bng mt k t

Thm: Khi mt k t ca c thm v{o v tr j ca Xa: Khi mt k t c xa khi th{nh c gi l{ bn ghi sa i

Mt chui c|c hot ng cn thit chuyn i (edit transcript) ca th{nh .

Mt xp h{ng (alignment) ca v{ l{ mt di din ca c|c hot ng |p dng trn v{ , thng t mt chui ln trn mt chui kh|c, v{ l{m y bng c|c du gch ngang (-) v{o v tr trong v{ ti nhng ch m{ mt khong trng c thm v{o mi k t hoc khong trng trn mt trong hai string i din l{ k t duy nht hoc khong trng duy nht trn v{ . nh ngha 2: Approximate string matching with im kh|c) differences (khp x}u xp x vi k

Khp x}u vi k im kh|c gia mt khun mu v{ vn bn l{ vn ca vic tm kim mi cp v tr ( trong sao cho khong c|ch sa i gia v{ nhiu nht l{ .

5.2.3. Gii php quy hoch ng Chng ta c th tm thy khong c|ch sa i khong c|ch: gia gia gia v{ v{ v{

gia hai chui

v{

Bng c|ch gii quyt quan h quy: min {

Mi quan h n{y c th c tnh to|n bng mt ma trn quy hoch ng n gin s dng mt bng quy hoch ng .

32

V Minh Ngc

5.2.4. C bn v thut to|n Landau-Vishkin Landau-Vishkin trnh din mt thut to|n cho vn khp x}u xp x vi k im kh|c. Thut to|n n{y chia th{nh hai pha: pha tin x l v{ pha lp. Trong pha tin x l, c|c pattern v{ text c tin x l vi tnh to|n c

Trong pha lp, thut to|n lp ln trn mi ng cho ca bng quy hoch ng v{ tm ra tt c c|c xp h{ng (match) ca vi nhiu nht im kh|c.

V Minh Ngc

33

Phn VI. S lc v BlastReduce


6.1. Tm tt:
Th h tip theo ca m|y trnh t DNA sinh ra mt chui d liu vi tc cha tng thy, nhng c|c thut to|n alignment x l trnh t n truyn thng c gng u tranh theo kp vi chng. BlastReduce l{ mt thut to|n c mapping song song mi ti u cho sp xp d liu chui t c|c m|y tham chiu ti b gen, s dng trong ph}n tch s a dng ca sinh hc, bao gm kh|m ph| SNP, kiu gen, v{ c| th gen. N c m hnh ha sau khi s dng thut to|n lin kt chui BLAST, nhng s dng c{i t hadoop ca MapReduce x l song song trn nhiu node tnh to|n. |nh gi| hiu qu ca n, BlastReduce ~ c s dng map c liu chui th h tip theo vi mt tham chiu ti h gen ca vi khun mt lot c|c cu hnh. Kt qu cho thy quy m ca BlastReduce tng tuyn tnh theo s lng c|c x l chui, v{ vi s tng tc nh tng s lng b vi x l. Trong mt cu hnh kim tn vi 24 b vi x l, BlastReduce nhanh gp 250 ln BLAST x l trn mt nh}n, v{ gim thi gian x l t v{i ng{y xung cn v{i pht cng mc nhy cm.

6.2. Read Mapping


Sau khi lp trnh t AND mi c to ra c thng c xp h{ng hoc |nh x vi chui b gen tham kho tm c|c vng m{ din ra vic c tng khong mt. Mt thut to|n |nh x c (read mapping) b|o c|o tt c c|c sp h{ng (alignment) m{ c im trong ngng im, thng th hin nh s lng ti a c th chp nhn c ca s kh|c nhau gia read v{ b gen tham chiu (ni chung hu ht l{ v{o khong 1%-10% ca chiu d{i read). C|c thut to|n lin kt c th cho php ch c|c mismatche l{ kh|c nhau, vn k-mismatch, hoc n cng c th xem xt sp h{ng c du c|ch (gapped alignment) trong trng hp thm hoc xa k t, vn n k-kh|c nhau). Thut to|n xp h{ng chui Smith-Waterman c intnh to|n c|c xp h{ng c du c|ch s dng quy hoch ng. N xem xt tt c c|c xp h{ng c th ca mt cp c|c chui vi thi gian t l thun vi c d{i ca chng. Mt bin th ca thut to|n SmithWaterman, c gi l{ xp h{ng di, bn cht cng s dng quy hoch ng nhng hn ch vit tm kim c|c xp h{ng vi mt s lng nh s kh|c bit. Vi mt cp n c|c chui tnh to|n mt xp h{ng Smith-Waterman th thng l{ mt hot ng nhanh, nhng s nn tnh to|n khng kh thi khi s lng c|c chui tng. Thay v{o , c|c nh{ nghin cu s dng k thut ht ging v{ m rng y nhanh tc tm kim rt ging vi xp h{ng. C|i quan trng l{ s quan s|t m{ s sp h{ng rt ging nhau phi chc chn c ngha sp xp. Bng c|ch s dng nguyn l lng chim b c}u, vi 20bp c align vi mt c|i kh|c, bn phi c t nht mt xp h{ng 10bp trong mt xp h{ng n{o . Ni chung, mt aligment c d{i y l{ m bp c vi e mismatch phi cha t nht mt xp h{ng m/(e+1) bp. Mt s thut to|n xp chui tun t, bao gm c thut to|n ph bin l{ cng c BLAST v{ MUMmer s dng k thut n{y xp h{ng nhanh. Trong giai on ht ging, c|c cng c n{y tm kim c|c chui con m{ ging nhau gia hai chui. V d, BLAST x}y dng

34

V Minh Ngc

mt bng bm c d{i c nh gi c|c chui con c gi l{ k-mers ca chui tham kho tm ht ging, v{ MUMmer x}y dng c}y hu t ca chui tham kho tm bin chiu d{i ln nht ca ht ging. Sau trong pha m rng c|c cng c tnh to|n gi| chnh x|c trong d{i xp h{ng Smith-Waterman gii hn vi chui con tng i ngn gn c|c ht ging c chia s. K thut n{y c th gim |ng k thi gian cn thit xp h{ng c|c chi ti mt mc nhy cm. D vy, s nhy c tng bng nhiu c|c kh|c nhau, chiu d{i ht ging gim, hay s c|c lng c|c ht ging match ngu nhin s l{m tng tng thi gian tnh to|n. Thut to|n xp h{ng k-difference Landau-Vishkin l{ mt thu to|n quy hoch ng thay th x|c nh nu hai xp h{ng hai chui vi hu ht c k-difference. Khng ging nh thut to|n quy hoch ng Smith-Waterman, m{ x}y dng tt c c|c xp h{ng c th, thut to|n Landau-Vishkin x}y dng ch c|c sp h{ng ging nhau ti m{ c s lng c|c im kh|c l{ c nh bng c|ch tnh to|n c bao nhiu k t trong sking c th c xp h{ng vi i=0 ti k im kh|c nhau. S lng c|c k t m{ c sp h{ng s dng I im kh|c c tnh to|n t kt qu ca (i-1) bng c|ch tnh to|n chnh x|c phn m rng c th sau m u mt mismatch, mt im thm v{o hoc xa t cui ca xp h{ng i-1. Thut to|n kt thc khi i=k+1, cho thy khng tm ti xp h{ng k-difference cho c|c trnh t, hoc kt thc ca chui ~ t c. Thut to|n nay rt nhan hn so vi thu to|n Smitch-WaterMan y vi s lng k nh, bi v ch mt s lng nh c|c xp h{ng tim nng.

6.3. Thut ton BlastReduce


BlastReduce l{ mt thut to|n c |nh x song song (parallel read mapping algorithm) vit bng Java vi Hadoop. N c m hnh trn thut to|n BLAST, v{ c ti u cho |nh x c|c on read nh t c|c m|y chui th h tip theo ti b gen tham kho. Ging nh BLAST, n l{ thut to|n ht ging v{ m rng, s dng c|c t c d{i c nh l{m ht ging. Nhng khng ging vi BLAST, BlastReduce s dng thut to|n Landau-Vishkin m rng c|c ht ging mt c|ch nhanh chng tm c|c xp h{ng vi hu ht k-difference. C|i thut to|n m rng n{y thch hp hn cho c|c on read ngn vi s lng nh c|c kh|c bit (thng k=1 hoc k=2 cho 25-50bp read). Kch thc ht ging (s) l{ t ng c tnh to|n da trn d{i ca read v{ s lng ti a c|c kh|c bit (k) do ngi s dng a ra. u v{o cho ng dng l{ mt file a fasta cha mt hoc nhiu chui tham kho. C|c file n{y u tin c chuyn i thanh SequenceFile nn ca Hadoop thch hp x l vi Hadoop.SequenceFile m{ khng h tr c|c chui c trnh t ln hn 65.565 k t c|c chui d{i ph}n t|ch th{nh c|c khi. C|c chui c lu tr dng cp key-value trong SequenceFile nh (id, SeqInfo) trong SeqInfor l{ mt b (sequence, start-offset, tag), trong start_offset l{ lch (offset) ca khi cha trong chui y . Nhng c|i khi n{y xp xp chng ln nhau vi s-1 bp tt c c|c ht ging c biu din ch mt ln v{ c|c chui tham kho s c b|o rng tag=1. Sau khi chuyn i, SequenceFile c copy v{o HDFS thut to|n read mapping c th c thc thi.

V Minh Ngc

35

Thut to|n read mapping yu cu 3 vng MapReduce, v{ nh m t di hnh 1. Hai vng u tin l{ MerReduce v{ SeedReduce, tm tt c c|c match ln nht m{ c d{i t nht l{ s, v{ vng cui cng l{ ExtendReduce, m rng c|c ht ging vi thut to|n Landau-Vishkin v{o tt c c|c xp cp vi hu ht k-difference.

6.3.1. MerReduce: tnh cc Mer ging nhau Vng MapReduce tm c|c mer c d{i s m{ ging nhau gia chui read v{ chui tham kho. m{m map x l tt c c|c khi mt c|ch c lp, v{ c|c mer ch c trong read hoc ch trong chui tham kho s t ng c loi b. ExtenReduce cn c|c chui flanking xp chnh x|c c|c ht ging cho xp h{ng, nhng HDFS th khng hiu qu cho truy cp ngu nhin. Do , Flanking chui (ln ti < di ca read> - s + k bp) bao gm c|c mer ca chui read v{ chui tham kho v vy chng s c sn khi cn thit. Map: i vi mi mer trong chui u v{o, u ra ca h{m map l{ (mer, Merpos), trong MerPos l{ cp (id, position, tag, left_flank, right_flank). Nu chui l{ read (tag =0) v th to ra c|c bn ghi MerPost cho c|c chui b xung o ngc. M{m map to ra tt c l{ s(M+N) mer, trong M l{ tng d{i ca chui read, v{ N l{ tng d{i ca chui tham kho. Sau khi tt c c|c h{m map ho{n th{nh, Madoop s sp xp ni b c|c cp key-value, nhm chng tt c c|c cp m{ c cng mer v{o mt danh s|ch duy nht c|c bn ghi Merpos. Reduce: h{m reduce to ra c|c thng tin v tr v c|c mer m{ ging nhau t nht gia mt chui tham kho v{ mt chui c. N i hi phi c hai ng tuyn thng gia mt danh s|ch c|c bn ghi Merpos vi mi mer. N u tin s qut danh s|ch tm bn ghi Merpos t chui tham kho. Sau n qut danh s|ch ln th hai v{ kt qu u ra l{ mt cp keyvalue (read_id, ShareMer) cho mi mer n{o xut hin trong read v{ chui tham kho. Mt sharedMer l{ mt cm bao gm (read_position, ref_id, ref_position, read_left_flank, read_right_flank, ref_left_flank,

Hnh 1. Tng quan v thut ton BlastReduce s dng 3 vng MapReduce. Cc file tm c s dng mt cch ni b bi MapReduce l Shared.

36

V Minh Ngc

ref_right_flank).

6.3.2. SeedReduce: kt hp cc Mer nht qun Vng MapReduce gim thiu s lng c|c ht ging bng c|ch s|t nhp c|c mer ging nhau v{o trong mt ht ging ln. Hai mer ging nhau s kt hp nu chng lch 1bp trong chui read v{ chui reference. Hai mer ph hp c th c trn li mt c|ch an to{n khi chng |nh x thi c|c xp on ging nhau. Map: H{m map to ra c|c cp ging (read_id, SharedMer) ging vi u v{o. Sau khi h{m Map kt thc, tt c c|c bn ghi SharedMer t chui c ~ cho s c nhm ni b vi nhau trong pha Reduce Reduce: vi mi danh s|ch SharedMer u tin c sp xp bng c|ch c v tr, v{ c|c mer ph hp c t (collasces) v{o c|c ht ging. Ht ging cui cng chnh x|c c ni tt c li th{nh mt chui ln nh c d{i ti thiu l{ s bp. u ra l{ c|c cp (read_id, ShareSeed) trong SharedSeed l{ mt cp bao gm (read_position, seed_length, ref_id, target_position, read_left_flank, read_right_flank, ref_left_flank, ref_right_flank).

6.3.3. ExtendReduce: m rng cc ht ging C|i vng MapReduce m rng c|c ht ging xp h{ng (alignment) v{o trong mt xp h{ng khng chnh x|c s dng thut to|n k-dfference Landau-Vishkin. Map: i vi mi SharedSeed, on m~ c gng m rng c|c ht ging ging nhau v{ ni c|c xp h{ng vi nhiu nht k-difference. Nu nh lin kt tn ti, u ra l{ cp (read_id, AlignmentInfo), trong AlignmentInfo l{ mt cp (ref_id, ref_align_start, ref_align_end, num_differences). Sau khi tt c c|c h{m map ho{n th{nh, Hadoop nhm tt c c|c AlignmentInfor m{ ging read cho h{m reduce. Reduce: C|c h{m reduce lc c|c xp h{ng trng lp, v chng c th cha nhiu ht ging trong cng mt xp h{ng (alignment). Vi mi read, u tin sp xp c|c bn nghi Alignment bng trng ref_align_start, v{ sau u ra duy nht l{ cp (read_id, AlignmentInfo) m{ kh|c nhau trng ref_align_start. u ra ca ExtenReduce l{ mt file cha tt c c|c xp h{ng m{ mi read vi kdifference. File n{y c copy v{o trong HDFS th{nh mt h thng file nh k, hoc tp tin HDFS c th c x l vi c|c cng c b|o c|o i km.

V Minh Ngc

37

Ti liu tham kho


[1]. Michael C. Schatz - BlastReduce: High Performance Short Read Mapping with MapReduce [2]. Martin Tompa - Biological Sequence Analysis [3]. Stephen F. Altschul', Warren Gish', Webb Miller Eugene W. Myers3 and David J. Lipmanl - Basic Local Alignment Search Tool [4]. eTutorials.org - Basic local alignment search tool (blast) [5]. Rodrigo Cesar de Castro Miranda1, Mauricio Ayala-Rincon1, and Leon Solon1 Modifications of the Landau-Vishkin Algorithm Computing Longest Common Extensions via Suffix Arrays and Efficient RMQ computations [6]. Ricardo Baeza-Yates and Gaston H. Gonnet - A New Approach to String searching

38

V Minh Ngc

You might also like