Professional Documents
Culture Documents
Gio vin hng dn: T Minh Phng Sinh vin: V Minh Ngc
Mc lc
Phn I. Gii thiu chung ......................................................................................................................... 5 1.1. 1.2. Hadoop l{ g?......................................................................................................................... 5 MapReduce l{ g? .................................................................................................................. 5
Phn II. Ci t Hadoop ......................................................................................................................... 7 1. 1. 2. 3. 4. 5. 6. 7. a. b. c. d. 8. 9. Ci t my o Ubuntu 10.10 (32 bit) trn VMware ................................................................. 7 Ci t Vmware tools cho Ubuntu ............................................................................................. 7 Ci openSSH cho ubuntu ............................................................................................................ 7 Ci java: ...................................................................................................................................... 7 Thm user hadoop vo nhm hadoop....................................................................................... 8 Cu hnh ssh ............................................................................................................................... 9 V hiu ha IPv6 ...................................................................................................................... 11 Download v ci t hadoop ................................................................................................... 12 Download Hadoop 0.20.2 v lu vo th mc /usr/local/ .................................................. 12 Cu hnh ............................................................................................................................... 12 nh dng cc tn node ....................................................................................................... 13 Chy hadoop trn cm mt node ........................................................................................ 13 Chy mt v d MapReduce ..................................................................................................... 14 Ci t v s dng Hadoop trn Eclipse .................................................................................. 17
Phn III. Thnh phn ca Hadoop ........................................................................................................ 20 1. 2. Mt s thut ng. .................................................................................................................... 20 C|c trnh nn ca Hadoop ...................................................................................................... 21 2.1. 2.2. 2.3. 2.4. 2.5. NameNode....................................................................................................................... 21 DataNode......................................................................................................................... 21 Secondary NameNode .................................................................................................... 22 JobTracker....................................................................................................................... 22 TaskTracker .................................................................................................................... 23
Phn IV. Lp trnh MapReduce c bn................................................................................................. 25 1. 2. Tng quan mt chng trnh MapReduce............................................................................ 25 Cc loi d liu m Hadoop h tr .......................................................................................... 26 2.1. Mapper............................................................................................................................. 27
V Minh Ngc
2.2. 2.3.
Phn V. S lc v cc thut ton tin sinh........................................................................................... 30 5.1. Thut ton Blast ........................................................................................................................ 30 5.2. Thut ton Landau-Vishkin ........................................................................................................ 31 5.2.1. Mt s khi nim ................................................................................................................ 31 5.2.2. Khp xu xp x (Approximate String Matching) ............................................................... 32 5.2.3. Gii php quy hoch ng .................................................................................................. 32 Phn VI. S lc v BlastReduce .......................................................................................................... 34 6.1. Tm tt: ..................................................................................................................................... 34 6.2. Read Mapping........................................................................................................................... 34 6.3. Thut ton BlastReduce ............................................................................................................ 35 6.3.1. MerReduce: tnh cc Mer ging nhau ................................................................................ 36 6.3.2. SeedReduce: kt hp cc Mer nht qun .......................................................................... 37 6.3.3. ExtendReduce: m rng cc ht ging ............................................................................... 37
V Minh Ngc
Li ni u
Knh ch{o c|c thy c! Sau mt thi gian thc tp tt nghip, sau }y l{ bn b|o c|o nhng g em ~ l{m c trong thi gian qua. Ni dung chnh trong thi gian thc tp va qua l{ S dng Hadoop v{ framework MapReduce gii quyt b{i to|n tinh sinh hc BLAST. Theo cm ngh ca em th Hadoop l{ mt ng dng mi v{ cng khng d nm bt, v{ vic l{m sao thut to|n BLAST c th x l song song trn Hadoop cng kh| kh. Nhng vi s gip ca thy hng dn T Minh Phng, v{ c|c anh ch trong cng ti VCCorp th em cng phn n{o nm bt c vn . Tuy bn b|o c|o cn s s{i, nhng l{ tin cho nhng phn k tip. Em s c gng ho{n thin hn, v{ ho{n chnh t{i v{o b{i cui kho|. Mt ln na em xin c|m n c|c thy c ~ nh hng v{ hng dn trong sut thi gian hc tp v{ trong thi gian thc tp va qua.
V Minh Ngc
1.2. MapReduce l g?
MapReduce l{ mt m hnh lp trnh (programming model), ln u b|o c|o trong b{i b|o ca Jefferey Dean v{ Sanjay Ghemawat hi ngh OSDI 2004. MapReduce ch l{ mt tng, mt abstraction. hin thc n th cn mt implementation c th. Google c mt implementation ca MapReduce bng C++. Apache c Hadoop, mt implementation m~ ngun m kh|c trn Java th phi (t nht ngi dng dng Hadoop qua mt Java interface). Khi d liu ln c t chc nh mt tp hp gm rt nhiu cp (key, value) x l khi d liu n{y, lp trnh vin vit hai h{m map v{ reduce. H{m map c input l{ mt cp (k1, v1) v{ output l{ mt danh s|ch c|c cp (k2, v2). Ch rng c|c input v{ output keys v{ values c th thuc v c|c kiu d liu kh|c nhau, ty h. Nh vp h{m map c th c vit mt c|ch hnh thc nh sau: map(k1,v1) -> list(k2,v2) MR s |p dng h{m map (m{ ngi dng MR vit) v{o tng cp (key, value) trong khi d liu v{o, chy rt nhiu phin bn ca map song song vi nhau trn c|c m|y tnh ca cluster. Sau giai on n{y th chng ta c mt tp hp rt nhiu cp (key, value) thuc kiu (k2, v2) gi l{ c|c cp (key, value) trung gian. MR cng s nhm c|c cp n{y theo tng key, nh vy c|c cp (key, value) trung gian c cng k2 s nm cng mt nhm trung gian.
V Minh Ngc
Giai on hai MR s |p dng h{m reduce (m{ ngi dng MR vit) v{o tng nhm trung gian. Mt c|ch hnh thc, h{m n{y c th m t nh sau: reduce(k2, list (v2)) -> list(v3) Trong k2 l{ key chung ca nhm trung gian, list(v2) l{ tp c|c values trong nhm, v{ list(v3) l{ mt danh s|ch c|c gi| tr tr v ca reduce thuc kiu d liu v3. Do reduce c |p dng v{o nhiu nhm trung gian c lp nhau, chng li mt ln na c th c chy song song vi nhau. V d c bn nht ca MR l{ b{i m t (Ting Anh). R r{ng }y l{ mt b{i to|n c bn v{ quan trng m{ mt search engine phi l{m. Nu ch c v{i chc files th d ri, nhng nh rng ta c nhiu triu hay thm ch nhiu t files ph}n b trong mt cluster nhiu nghn m|y tnh. Ta lp trnh MR bng c|ch vit 2 h{m c bn vi pseudo-code nh sau:
void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result));
Ch vi hai primitives n{y, lp trnh vin c rt nhiu flexibility ph}n tch v{ x l c|c khi d liu khng l. MR ~ c dng l{m rt nhiu vic kh|c nhau, v d nh distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation, large-scale graph computation
V Minh Ngc
3. Ci java:
Hadoop yu cu java 1.5.x. Tuy nhin, bn 1.6.x c khuyn khch khi s dng cho Hadoop, di }y m t c|ch thc c{i java : a. Thm Canonical i t|c Repository v{o kho apt ca bn $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" b. Cp nht danh s|ch ngun $ sudo apt-get update c. C{i t sun-java6-jdk $ sudo apt-get install sun-java6-jdk d. Kim tra
V Minh Ngc
user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
V Minh Ngc
5. Cu hnh ssh
Hadoop yu cu truy cp SSH qun l c|c node ca n, v d nh iu kin mt m|y tnh t xa cng vi m|y cc b ca bn nu nh bn mun Hadoop l{m vic trn . Trong thit lp n node cho haddop , chng ta cu hnh ssh truy cp ti localhost cho user hadoop m{ chng ta to ra phn trc. a. ng nhp t t{i khon hadoop b. S dng dng lnh // Khng nhp g trong 3 ln hi, ch n xung dng Enter
V Minh Ngc
hadoop@hadoop:~$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 19:f5:c2:b2:19:25:83:25:8f:ec:45:f7:4a:c3:59:25 hadoop@hadoop The key's randomart image is: +--[ RSA 2048]----+ | .o= + E.. | | ..= O = . | | o * B o | | . . O + | | . S . | | | | | | | | | +-----------------+ c. Bn phi cho php SSH truy cp ti m|y cc b ca bn vi kha mi: hadoop@hadoop:~$ cd ~/.ssh hadoop@hadoop:~/.ssh$ cat id_rsa.pub >> authorized_keys
d. Kim tra c|c c{i t SSH bng c|ch kt ni vi m|y tnh cc b ca bn vi user hadoop. Bc n{y cng cn thit lu tr du v}n tay ca m|y bn trong file know_host. Nu bn c bt c cu hnh c bit cho SSH ging nh mt cng SSH khng chun, bn c th nh ngha li trong $HOME/.ssh/config
10
V Minh Ngc
hadoop@hadoop:~/.ssh$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is 0a:3d:86:06:28:82:7f:3a:35:0b:83:d5:35:ee:b8:b1. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux hadoop 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010 i686 GNU/Linux Ubuntu 10.10 Welcome to Ubuntu! * Documentation: https://help.ubuntu.com/
The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.
6. V hiu ha IPv6
Mt vn vi IPv6 trn Ubuntu l{ vic s dng 0.0.0.0 cho c|c ty chn cu hnh Hadoop cho c|c mng c lin quan n nhau s cho kt qu Hadoop lin kt n c|c a ch IPv6 ca my Ubuntu box. a. v hiu ha IPv6 trong Ubuntu 10.10, m /etc/sysctl.conf trong editor bn thm dng sau v{o cui file: #disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
b. Khi ng li m|y thay i c hiu qu. c. kim tra li bn c th s dng dng lnh sau
$ cd / $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
V Minh Ngc
11
7. Download v ci t hadoop
a. Download Hadoop 0.20.2 v lu vo th mc /usr/local/ $ $ $ $ cd /usr/local sudo tar xzf hadoop-0.20.2.tar.gz sudo mv hadoop-0.20.2 hadoop sudo chown -R hadoop:hadoop hadoop
Th{nh :
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
ii. conf/core-site.xml
<!-- In: conf/core-site.xml --> <property> <name>hadoop.tmp.dir</name> <value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
iii. conf/mapred-site.xml
<!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
12
V Minh Ngc
</description> </property>
iv. conf/hdfs-site.xml
<!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
c. nh dng cc tn node u tin khi ng Hadoop va ca bn l{ nh dng li h thng tp tin Hadoop m{ c thc hin trn u ca h thng tp tin ca bn. Bn cn phi l{m vic n{y trong ln u chy. Bn chy lnh sau:
hadoop@ubuntu:~$ /hadoop/bin/hadoop namenode -format
Kt qu:
V Minh Ngc
13
Mt tool kh| thut tin kim tra xem c|c tin trnh Hadoop ang chy l{ jps:
Bn cng c th kim tra vi netstart nu Hadoop ang nghe trn c|c cng ~ c cu hnh:
e.
8. Chy mt v d MapReduce
Chng ta chy v d WordCount c sn trong phn v d ca Hadoop. N x m c|c t trong file v{ s ln xut hin. file u v{o v{ u ra l{ dng text, mi dng trong file u ra cha t v{ s ln xut hin, ph}n c|ch vi nhau bi du TAB. a. Download d liu u v{o Download 3 cun s|ch t Project Gutenberg: The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson The Notebooks of Leonardo Da Vinci Ulysses by James Joyce Chn file trong Plain Text UTF-8, sau copy v{o th mc tmp ca Hadoop: /tmp/gutenberg , kim tra li nh sau:
14
V Minh Ngc
c. Chy MapReduce job S dng c}u lnh sau: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-<version>examples.jar wordcount gutenberg gutenberg-output Trong c}u lnh n{y bn sa <version> th{nh phin bn m{ bn ang s dng. Bn c th kim tra trong th mc c{i Hadoop c cha file *.jar n{y. C}u lnh n{y s c tt c c|c file trong th mc butenberg t HDFS, x l v{ lu kt qu v{o gutenberg-output. Kt qu u ra nh sau:
V Minh Ngc
15
16
V Minh Ngc
Nu bn mun sa i c|c thit lp ca Hadoop ging nh tng s task Reduce ln, bn c th s dng ty chn -D nh sau: hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2examples.jar wordcount -D mapred.reduce.tasks=16 gutenberg gutenberg-output d. Ly kt qu t HDFS kim tra c|c file, bn c th copy n t HDFS n h thng file a phng. Ngo{i ra, bn c th s dng lnh sau:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenbergoutput/part-r-00000gutenberg-output
V Minh Ngc
17
Kch chut phi v{o phn trng ca Location trong TAB Map/Recude Locations, chn New Hadoop location V{ in c|c tham s nh hnh di:
18
V Minh Ngc
V Minh Ngc
19
1. Mt s thut ng.
MapReduce job l{ mt n v ca cng vic m{ kh|ch h{ng (client) mun c thc hin: n bao gm d liu u v{o, chng trnh MapReduce, v{ thng tin cu hnh. Hadoop chy c|c cng vic (job) n{y bng c|ch chia n th{nh c|c nhim v (task), trong c hai kiu chnh l{ : c|c nhim v map (map task) v{ c|c nhim v reduce (reduce task) C hai loi node iu kin qu| trnh thc hin cng vic (job): mt jobtracker v{ mt s tasktracker. Jobtracker kt hp tt c c|c cng vic trn h thng bng c|ch lp lch cng vic chy trn c|c tasktracker. Tasktracker chy c|c nhim v (task) v{ gi b|o c|o thc hin cho jobtracker, c|i lu gi c|c bn nghi v qu| trnh x l tng th cho mi cng vic (job) Hadoop chia u v{o cho mi cng vic MapReduce v{o c|c mnh (piece) c kch thc c nh gi l{ c|c input split hoc l{ c|c split. Hadoop to ra mt task map cho mi split, c|i chy mi nhim v map do ngi s dng nh ngha cho mi bn ghi (record) trong split. C rt nhiu c|c split , iu n{y c ngha l{ thi gian x l mi split nh hn so vi thi gian x l to{n b u v{o. V vy, nu chng ta x l c|c split mt c|ch song song, th qu| trnh x l s tt hn c}n bng ti, nu c|c split nh, khi mt chic m|y tnh nhanh c th x l tng ng nhiu split trong qu| trnh thc hin cng vic hn l{ mt m|y tnh chm. Ngay c khi c|c m|y tnh ging ht nhau, vic x l khng th{nh cng hay c|c cng vic kh|c ang chy ng thi l{m cho cn bng ti nh mong mun, v{ cht lng ca c}n bng ti tng nh l{ chia c|c splits th{nh c|c ht mn hn Mt kh|c, nu chia t|ch qu| nh, sau chi ph cho vic qun l c|c split v{ ca to ra c|c map task bt u chim rt nhiu tng thi gian ca qu| trnh x l cng vic. i vi hu ht cng vic, kch thc split tt nht thng l{ kch thc ca mt block ca HDFS, mc nh l{ 64MB, mc d n c th thay i c cho mi cluster ( cho tt c c|c file mi c to ra) hoc nh r khi mi file c to ra. Hadoop l{m tt nht c|c cng vic ca n chy c|c map task trn mt node khi m{ d liu u v{o ca n c tr ngay trong HDFS. N c gi l{ ti u ha d liu a phng. B}y gi chng ta s l{m r ti sao kch thc split ti u li bng kch thc ca block: n l{ kch thc ln nht ca mt u v{o m{ c th c m bo c lu trn mt node n. Nu split c chia th{nh 2 block, n s khng chc l{ bt c node HDFS n{o lu tr c hai block, v vaayjmootj s split phi c chuyn trn mng n node chy map tast, nh vy r r{ng l{ s t hiu qu hn vic chy to{n b map task s dng d liu cc b. C|c map task ghi u ra ca chng trn a c b, khng phi l{ v{o HDFS. Ti sao li nh vy? u ra ca map l{ u ra trung gian, n c x l bi reduce task to ra
20
V Minh Ngc
u ra cui cng , v{ mt khi cng vic c ho{n th{nh u ra ca map c th c b i. V vy vic lu tr n trong HDFS, vi c|c nh}n bn, l{ khng cn thit. Nu c|c node chy maptask b li trc khi u ra map ~ c s dng bi mt reduce task, khi Hadoop s t ng chy li map task trn mt node kh|c to ra mt u ra map. Khi chy Hadoop c ngha l{ chy mt tp c|c trnh nn - daemon, hoc c|c chng trnh thng tr, trn c|c m|y ch kh|c nhau trn mng ca bn. Nhng trnh nn c vai tr c th, mt s ch tn ti trn mt m|y ch, mt s c th tn ti trn nhiu m|y ch. C|c daemon bao gm: NameNode DataNode SecondaryNameNode JobTracker TaskTracker
2. Cc trnh nn ca Hadoop
2.1. NameNode L{ mt trnh nn quan trng nht ca Hadoop - c|c NameNode. Hadoop s dng mt kin trc master/slave cho c lu tr ph}n t|n v{ x l ph}n t|n. H thng lu tr ph}n t|n c gi l{ Hadoop File System hay HDFS. NameNode l{ master ca HDFS ch o c|c trnh nn DataNode slave thc hin c|c nhim v I/O mc thp. NadeNode l{ nh}n vin k to|n ca HDFS; n theo di c|ch c|c tp tin ca bn c ph}n kia th{nh c|c block, nhng node n{o lu c|c khi , v{ kim tra sc khe tng th ca h thng tp ph}n t|n. Chc nng ca NameNode l{ nh (memory) v{ I/O chuyn s}u. Nh vy, m|y ch l tr NameNode thng khng lu tr bt c d liu ngi dng hoc thc hin bt c mt tnh to|n n{o cho mt ng dng MapReduce gim khi lng cng vic trn m|y. iu n{y c ngha l{ m|y ch NameNode khng gp i (double) nh l{ DataNode hay mt TaskTracker. C iu |ng tic l{ c mt kha cnh tiu cc n tm quan trng ca NameNode n c mt im ca tht bi ca mt cm Hadoop ca bn. i vi bt c mt trnh nn kh|c, nu c|c nt m|y ca chng b hng v l do phn mm hay phn cng, c|c Hadoop cluster c th tip tc hot ng thng sut hoc bn c th khi ng n mt c|ch nhanh chng. Nhng khng th |p dng cho c|c NameNode. 2.2. DataNode Mi m|y slave trong cluster ca bn s lu tr (host) mt trnh nn DataNode thc hin c|c cng vic n{o ca h thng file ph}n t|n - c v{ ghi c|c khi HDFS
V Minh Ngc
21
ti c|c file thc t trn h thng file cc b (local filesytem). Khi bn mun c hay ghi mt file HDFS, file c chia nh th{nh c|c khi v{ NameNode s ni cho c|c client ca bn ni c|c mi khi trnh nn DataNode s nm trong . Client ca bn lin lc trc tip vi c|c trnh nn DataNode x l c|c file cc b tng ng vi c|c block. Hn na, mt DataNode c th giao tip vi c|c DataNode kh|c nh}n bn c|c khi d liu ca n d phng. Hnh 2.1 minh ha vai tr ca NameNode v{ DataNode. Trong c|c s liu n{y ch ra 2 file d liu, mt c|i /user/chuck/data1 v{ mt c|i kh|c /user/james/data2. File Data1 chim 3 khi, m{ c biu din l{ 1 2 3. V{ file Data2 gm c|c khi 4 v{ 5. Ni dung ca c|c file c ph}n t|n trong c|c DataNode. Trong minh ha n{y, mi block c 3 nh}n bn. Cho v d, lock 1 (s dng data1) l{ c nh}n bn hn 3 ln trn hu ht c|c DataNodes. iu n{y m bo rng nu c mt DataNode gp tai nn hoc khng th truy cp qua mng c, bn vn c th c c c|c tp tin. C|c DataNode thng xuyn b|o c|o vi c|c NameNode. Sa khi khi to, mi DataNode thng b|o vi NameNode ca c|c khi m{ n hin ang lu tr. Sau khi Mapping ho{n th{nh, c|c DataNode tip tc thm d kin NameNode cung cp thng tin v thay i cc b cng nh nhn c hng dn to, di chuyn hoc xa c|c blocks t a a phng (local). 2.3. Secondary NameNode C|c Secondary NameNode (SNN) l{ mt trnh nn h tr gi|m s|t trng th|i ca c|c cm HDFS. Ging nh NameNode, mi cm c mt SNN, v{ n thng tr trn mt m|y ca mnh. Khng c c|c trnh nn DataNode hay TaskTracker chy trn cng mt server. SNN kh|c vi NameNode trong qu| trnh x l ca n khng nhn hoc ghi li bt c thay i thi gian thc ti HDFS. Thay v{o , n giao tip vi c|c NameNode bng c|ch chp nhng bc nh ca siu d liu HDFS (HDFS metadata) ti nhng khong x|c nh bi cu hnh ca c|c cluster. Nh ~ cp trc , NameNode l{ mt im truy cp duy nht ca li (failure) cho mt cm Hadoop, v{ c|c bc nh chp SNN gip gim thiu thi gian ngng (downtime) v{ mt d liu. Tuy nhin, mt NameNode khng i hi s can thip ca con ngi cu hnh li c|c cluster s dng SSN nh l{ NameNode chnh. 2.4. JobTracker Trnh nn JobTracker l{ mt lin lc gia ng dng ca bn { Hadoop. Mt khi bn gi m~ ngun ca bn ti c|c cm (cluster), JobTracker s quyt nh k hoch thc hin bng c|ch x|c nh nhng tp tin n{o s x l, c|c nt c giao c|c nhim v kh|c nhau, v{ theo di tt c c|c nhim v khi dng ang chy. Nu mt nhim v (task) tht bi (fail), JobTracker s t ng chy li nhim v , c th trn mt node kh|c, cho n mt gii hn n{o c nh sn ca vic th li n{y.
22
V Minh Ngc
2.5.
TaskTracker Nh vi c|c trnh nn lu tr, c|c trnh nn tnh to|n cng phi tu}n theo kin trc master/slave: JobTracker l{ gi|m s|t tng vic thc hin chung ca mt cng vic MapRecude v{ c|c taskTracker qun l vic thc hin c|c nhim v ring trn mi node slave. Hnh 2.2 minh ha tng t|c n{y. Mi TaskTracker chu tr|ch nhim thc hin c|c task ring m{ c|c JobTracker giao cho. Mc d c mt TaskTracker duy nht cho mt node slave, mi TaskTracker c th sinh ra nhiu JVM x l c|c nhim v Map hoc Reduce song song. Mt trong nhng tr|ch nhim ca c|c TaskTracker l{ lin tc lin lc vi JobTracker. NeeusJobTracker khng nhn c nhp p t mootjTaskTracker trong vng mt lng thi gian ~ quy nh, n s cho rng TaskTracker ~ b treo (cashed) v{ s gi li nhim v tng ng cho c|c nt kh|c trong cluster.
Hnh 2.2 Tng tc gia JobTracker v TaskTracker. Sau khi client gi JobTracker bt u cng vic x l d liu, cc phn vng JobTracker lm vic v giao cc nhim v Map v Recude khc nhau cho mi TaskTracker trong cluster.
V Minh Ngc
23
Hnh 2.3 Cu trc lin kt ca mt nhm Hadoop in hnh. l mt kin trc master/slave trong NameNode v JobTracker l Master v DataNode & TaskTracker l slave.
Cu trc lin kt n{y c mt node Master l{ trnh nn NameNode v{ JobTracker v{ mt node n vi SNN trong trng hp node Master b li. i vi c|c cm nh, th SNN c th thng ch trong mt node slave. Mt kh|c, i vi c|c cm ln, ph}n t|ch NameNode v{ JobTracker th{nh hai m|y ring. C|c m|y slave, mi m|y ch lu tr mt DataNode v{ Tasktracker, chy c|c nhim v trn cng mt node ni lu d liu ca chng. Chng ti s thit lp mt cluster Hadoop y vi mu nh trn bng c|ch u tin thit lp c|c nt Master v{ kim so|t knh gia c|c node. Nu mt cluster Hadoop ca bn ~ c sn, bn c th nhay qua phn c{i t knh Secure Shell (SSH) gia c|c node. Bn cng c mt v{i la chn chy Hadoop l{ s dng trn Mt m|y n, hoc ch gi ph}n t|n. Chng s hu dng ph|t trin. Cu hnh Haddop chy trong hai node hoc c|c cluster chun (ch ph}n t|n y ) c cp trong chng 2.3
24
V Minh Ngc
Trong phn ny chng ta hc chi tit hn v tng giai on trong chng trnh MapReduce in hnh. Hnh 3.1 biu din biu cao cp ca ton b qu trnh, v chng ti tip tc m x tng phn:
V Minh Ngc
25
Bn cng c th ty chnh mt kiu d liu bng cch implement Writable (hay WritableComparable<T>). Nh v d 3.2 sau, lp biu din cc cnh trong mng, nh ng bay gia hai thnh ph:
26
V Minh Ngc
import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableComparable; public class Edge implements WritableComparable<Edge>{ private String departureNode; //Node khoi hanh private String arrivalNode; //Node den public String getDepartureNode(){ return departureNode; } @Override public void readFields(DataInput in) throws IOException { // TODO Auto-generated method stub departureNode = in.readUTF(); arrivalNode = in.readUTF(); } @Override public void write(DataOutput out) throws IOException { // TODO Auto-generated method stub out.writeUTF(departureNode); out.writeUTF(arrivalNode); } @Override public int compareTo(Edge o) { // TODO Auto-generated method stub return (departureNode.compareTo(o.departureNode) != 0)? departureNode.compareTo(departureNode): arrivalNode.compareTo(o.arrivalNode); } }
Lp Edge thc hin hai phng thc readFields() v write() ca giao din Writeable. Chng lm vic vi lp Java DataInput v DataOutput tun t ni dung ca cc lp. Th hin phng php compareTo() cho interface Comparable. N tr li gi tr -1, 0, +1. Vi kiu d liu c nh ngha ti giao din, chng ta c th tin hnh giai on u tin ca x l lung d liu nh trong hnh 3.1: mapper.
2.1.
Mapper
phc lm mt Mapper, mt lp implements t interface Mapper v k tha t lp MapReduceBase. Lp MapReduceBase, ng vai tr l lp c s cho c mapper v reducer. N
V Minh Ngc
27
bao gm hai phng thc hot ng hiu qu nh l hm khi to v hm hy ca lp: void configure(JobConf job) trong hm nay, bn c th trch xut cc thng s
ci t hoc bng cc file XML cu hnh hoc trong cc lp chnh ca ng dng ca bn. Gi ci hm ny trc khi x l d liu. void close() Nh hnh ng cui trc khi chm dt nhim v map, hm ny nn c gi bt c khi no kt thc kt ni c s d liu, cc file ang m.
Giao din Mapper chu trch nhim cho bc x l d liu. N s dng Java Generics ca mu Mapper<K1,V1,K2,V2> ch m cc lp key v cc lp value m implements t interface WriteableComparable v Writable. Phng php duy nht ca n x l cc cp (key/value) nh sau:
void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter ) throws IOException
Phng thc ny to ra mt danh sch (c th rng) cc cp (K2, V2) t mt cp u vo (K1, V1). OuputCollector nhn kt qu t u ra ca qu trnh mapping, v Reporter cung cp cc ty chn ghi li thng tin thm v mapper nh tin trin cng vic. Hadoop cung cu mt vi ci t Mapper hu dng. Bn c th thy mt vi ci nh trong bn 3.2 sau: Bng 3.2. Mt vi lp thc hin Mapper c nh ngha trc bi Hadoop - IdentityMapper<K,V> : vi ci t Mapper <K, V, K, V> v nh x u vo trc tip vo u ra - InverseMapper<K,V> : vi ci t Mapper<K, V, V, K> v o ngc cp (K/V) - RegexMapper<K> : vi ci Mapper<K, Text, Text, LongWritable> v sinh ra cp (match, 1) cho mi nh x (match) biu thc thng xuyn. - TokenCountMapper<K> : vi ci t Mapper<K, Text, Text, LongWritable> sinh ra mt cp (token, 1) khi mt gi tr u vo l tokenized.
2.2. Reducer
Vi bt c ci t Mapper, mt reducer u tin phi m rng t lp MapReduce base cho php cu hnh v dn dp. Ngoi ra, n cng phi implement giao din Reducer ch c mt phng thc duy nht sau:
void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter ) throws IOException
Khi nhn c cc task t u ra ca cc Mapper khc nhau, n sp xp cc d liu n theo cc kha ca cc cp (key/value) v nhm li cc gi tr cng kha. Hm reduce() c gi sau , n sinh ra mt danh sch (c th rng) cc cp (K3, V3) bng cch lp li trn cc gi tr
28
V Minh Ngc
c lin kt vi kha cho. OutputCollector nhn t u ra ca qu trnh reduce v ghi n ra u ra file. Reporter cung cp ty chn ghi li thng tin thm v reducer nh l mt tin trin cng vic. Bng 3.3 lit k mt vi reducer c bn c trin khai cung cp bi Hadoop - IdentityReducer<K, V> : vi ci t Reducer <K, V, K, V> v nh x u vo trc tip vo u ra - LongSumReducer<K> : vi ci t Reducer <K, LongWritable, K, LongWritable> v quyt nh thng hp tt c cc gi tr tng tng vi cc key cho C mt bc quan trng gia 2 bc map v reduce: ch o kt qu ca cc Mapper ti cc Reducer. y l trch nhim ca partitioner (phn vng).
Nu bn s dng HashPartitioner, hai dng s c gi ti 2 reducer khc nhau. S cc im khi hnh s c x l 2 ln v c hai ln u sai. Lm th no chng ta c th ty chnh Partitioner cho ng dng ca bn? Trong tinh hnh ny, chng ta mun tt c cc ng bay vi mt im khi hnh s c gi ti cng mt reducer. iu ny c d lm bng cch bm departureNode ca Edge:
public class EdgePartitioner implements Partitioner<Edge, Writable>{ @Override public int getPartition(Edge key, Writable value, numPartitions){ return key.getDepartureNode().hashCode() % numPartitions; } @Override public void configure(JobConf conf) { } }
int
V Minh Ngc
29
30
V Minh Ngc
V d, vi chui AGTTAHTQ v{ ACFTAQAC vi Hit TAH v{ TAQ s c m rng nh sau: AGTTAHTQ xxx||||x ACFTAQAC Nhng cp HSP ~ tm c c BLAST sp xp theo gi| tr |nh gi| gim dn, a ra m{n hnh, v{ thc hin phn |nh gi| thng k trn nhng cp HSP n{y. Trong phn |nh gi| thng k, BLAST da trn c s |nh gi| ca mt cp HSP tnh ra mt gi| tr gi l{ ''Bit-Score'', gi| tr n{y khng ph thuc v{o ma trn thay th v{ c s dng |nh gi| cht lng ca c|c bt cp. Gi| tr c{ng cao chng t kh nng tng tu ca c|c bt cp c{ng cao. Ngo{i ra BLAST tnh to|n mt gi| tr trng i E-Score (Expect-Score) ph thuc v{o Bit-Score. Gi| tr E-Score n{y th hin x|c sut ngu nhin ca c|c bt cp, gi| tr c{ng thp c{ng chng t nhng bt cp n{y c ph|t sinh theo quy lut t nhin, t ph thuc v{o tnh ngu nhin.
l{ mt chui rng l{ chui con (substring) ca khi v{ vi v{ . Nu ta ni rng l{ chui con ho{n to{n ca l{ tin t ca nu v{ vi nu th ta ni rng l{ tin t ho{n to{n ca l{ hu tt ca nu vi . Nu th chng ta ni rng l{ hu t ho{n to{n ca . Chng ta cng ni rng khi l{ hu t th ca ( l{ hu t ca bt u t v tr ) Tin t chung d{i nht ( ) ca v{ l{ chui ln nht m{ v{ . Nu th . Ch l{ biu din ca . Nu v{ l{ r r{ng trong ng cnh th chng ta vit n gin l{ . Phn m rng chung d{i nht ( ) ca v{ ti v tr l{ d{i ca ca v{ . Nu v{ l{ r r{ng trong ng cnh, chng ta c th vit n gin l{ . l{ v{ chui chui l{
V Minh Ngc
31
5.2.2. Khp xu xp x (Approximate String Matching) nh ngha 1: Edit distance (Khong c|ch sa i) Khong c|ch sa i gia hai x}u hot ng c{n thit chuyn i th{nh ngha nh sau: Thay th: Khi mt k t ca hay v{ l{ s lng ti thiu c|c th{nh , trong c|c hot ng c nh ca
c thay th bng mt k t
Mt xp h{ng (alignment) ca v{ l{ mt di din ca c|c hot ng |p dng trn v{ , thng t mt chui ln trn mt chui kh|c, v{ l{m y bng c|c du gch ngang (-) v{o v tr trong v{ ti nhng ch m{ mt khong trng c thm v{o mi k t hoc khong trng trn mt trong hai string i din l{ k t duy nht hoc khong trng duy nht trn v{ . nh ngha 2: Approximate string matching with im kh|c) differences (khp x}u xp x vi k
Khp x}u vi k im kh|c gia mt khun mu v{ vn bn l{ vn ca vic tm kim mi cp v tr ( trong sao cho khong c|ch sa i gia v{ nhiu nht l{ .
5.2.3. Gii php quy hoch ng Chng ta c th tm thy khong c|ch sa i khong c|ch: gia gia gia v{ v{ v{
v{
Mi quan h n{y c th c tnh to|n bng mt ma trn quy hoch ng n gin s dng mt bng quy hoch ng .
32
V Minh Ngc
5.2.4. C bn v thut to|n Landau-Vishkin Landau-Vishkin trnh din mt thut to|n cho vn khp x}u xp x vi k im kh|c. Thut to|n n{y chia th{nh hai pha: pha tin x l v{ pha lp. Trong pha tin x l, c|c pattern v{ text c tin x l vi tnh to|n c
Trong pha lp, thut to|n lp ln trn mi ng cho ca bng quy hoch ng v{ tm ra tt c c|c xp h{ng (match) ca vi nhiu nht im kh|c.
V Minh Ngc
33
34
V Minh Ngc
mt bng bm c d{i c nh gi c|c chui con c gi l{ k-mers ca chui tham kho tm ht ging, v{ MUMmer x}y dng c}y hu t ca chui tham kho tm bin chiu d{i ln nht ca ht ging. Sau trong pha m rng c|c cng c tnh to|n gi| chnh x|c trong d{i xp h{ng Smith-Waterman gii hn vi chui con tng i ngn gn c|c ht ging c chia s. K thut n{y c th gim |ng k thi gian cn thit xp h{ng c|c chi ti mt mc nhy cm. D vy, s nhy c tng bng nhiu c|c kh|c nhau, chiu d{i ht ging gim, hay s c|c lng c|c ht ging match ngu nhin s l{m tng tng thi gian tnh to|n. Thut to|n xp h{ng k-difference Landau-Vishkin l{ mt thu to|n quy hoch ng thay th x|c nh nu hai xp h{ng hai chui vi hu ht c k-difference. Khng ging nh thut to|n quy hoch ng Smith-Waterman, m{ x}y dng tt c c|c xp h{ng c th, thut to|n Landau-Vishkin x}y dng ch c|c sp h{ng ging nhau ti m{ c s lng c|c im kh|c l{ c nh bng c|ch tnh to|n c bao nhiu k t trong sking c th c xp h{ng vi i=0 ti k im kh|c nhau. S lng c|c k t m{ c sp h{ng s dng I im kh|c c tnh to|n t kt qu ca (i-1) bng c|ch tnh to|n chnh x|c phn m rng c th sau m u mt mismatch, mt im thm v{o hoc xa t cui ca xp h{ng i-1. Thut to|n kt thc khi i=k+1, cho thy khng tm ti xp h{ng k-difference cho c|c trnh t, hoc kt thc ca chui ~ t c. Thut to|n nay rt nhan hn so vi thu to|n Smitch-WaterMan y vi s lng k nh, bi v ch mt s lng nh c|c xp h{ng tim nng.
V Minh Ngc
35
Thut to|n read mapping yu cu 3 vng MapReduce, v{ nh m t di hnh 1. Hai vng u tin l{ MerReduce v{ SeedReduce, tm tt c c|c match ln nht m{ c d{i t nht l{ s, v{ vng cui cng l{ ExtendReduce, m rng c|c ht ging vi thut to|n Landau-Vishkin v{o tt c c|c xp cp vi hu ht k-difference.
6.3.1. MerReduce: tnh cc Mer ging nhau Vng MapReduce tm c|c mer c d{i s m{ ging nhau gia chui read v{ chui tham kho. m{m map x l tt c c|c khi mt c|ch c lp, v{ c|c mer ch c trong read hoc ch trong chui tham kho s t ng c loi b. ExtenReduce cn c|c chui flanking xp chnh x|c c|c ht ging cho xp h{ng, nhng HDFS th khng hiu qu cho truy cp ngu nhin. Do , Flanking chui (ln ti < di ca read> - s + k bp) bao gm c|c mer ca chui read v{ chui tham kho v vy chng s c sn khi cn thit. Map: i vi mi mer trong chui u v{o, u ra ca h{m map l{ (mer, Merpos), trong MerPos l{ cp (id, position, tag, left_flank, right_flank). Nu chui l{ read (tag =0) v th to ra c|c bn ghi MerPost cho c|c chui b xung o ngc. M{m map to ra tt c l{ s(M+N) mer, trong M l{ tng d{i ca chui read, v{ N l{ tng d{i ca chui tham kho. Sau khi tt c c|c h{m map ho{n th{nh, Madoop s sp xp ni b c|c cp key-value, nhm chng tt c c|c cp m{ c cng mer v{o mt danh s|ch duy nht c|c bn ghi Merpos. Reduce: h{m reduce to ra c|c thng tin v tr v c|c mer m{ ging nhau t nht gia mt chui tham kho v{ mt chui c. N i hi phi c hai ng tuyn thng gia mt danh s|ch c|c bn ghi Merpos vi mi mer. N u tin s qut danh s|ch tm bn ghi Merpos t chui tham kho. Sau n qut danh s|ch ln th hai v{ kt qu u ra l{ mt cp keyvalue (read_id, ShareMer) cho mi mer n{o xut hin trong read v{ chui tham kho. Mt sharedMer l{ mt cm bao gm (read_position, ref_id, ref_position, read_left_flank, read_right_flank, ref_left_flank,
Hnh 1. Tng quan v thut ton BlastReduce s dng 3 vng MapReduce. Cc file tm c s dng mt cch ni b bi MapReduce l Shared.
36
V Minh Ngc
ref_right_flank).
6.3.2. SeedReduce: kt hp cc Mer nht qun Vng MapReduce gim thiu s lng c|c ht ging bng c|ch s|t nhp c|c mer ging nhau v{o trong mt ht ging ln. Hai mer ging nhau s kt hp nu chng lch 1bp trong chui read v{ chui reference. Hai mer ph hp c th c trn li mt c|ch an to{n khi chng |nh x thi c|c xp on ging nhau. Map: H{m map to ra c|c cp ging (read_id, SharedMer) ging vi u v{o. Sau khi h{m Map kt thc, tt c c|c bn ghi SharedMer t chui c ~ cho s c nhm ni b vi nhau trong pha Reduce Reduce: vi mi danh s|ch SharedMer u tin c sp xp bng c|ch c v tr, v{ c|c mer ph hp c t (collasces) v{o c|c ht ging. Ht ging cui cng chnh x|c c ni tt c li th{nh mt chui ln nh c d{i ti thiu l{ s bp. u ra l{ c|c cp (read_id, ShareSeed) trong SharedSeed l{ mt cp bao gm (read_position, seed_length, ref_id, target_position, read_left_flank, read_right_flank, ref_left_flank, ref_right_flank).
6.3.3. ExtendReduce: m rng cc ht ging C|i vng MapReduce m rng c|c ht ging xp h{ng (alignment) v{o trong mt xp h{ng khng chnh x|c s dng thut to|n k-dfference Landau-Vishkin. Map: i vi mi SharedSeed, on m~ c gng m rng c|c ht ging ging nhau v{ ni c|c xp h{ng vi nhiu nht k-difference. Nu nh lin kt tn ti, u ra l{ cp (read_id, AlignmentInfo), trong AlignmentInfo l{ mt cp (ref_id, ref_align_start, ref_align_end, num_differences). Sau khi tt c c|c h{m map ho{n th{nh, Hadoop nhm tt c c|c AlignmentInfor m{ ging read cho h{m reduce. Reduce: C|c h{m reduce lc c|c xp h{ng trng lp, v chng c th cha nhiu ht ging trong cng mt xp h{ng (alignment). Vi mi read, u tin sp xp c|c bn nghi Alignment bng trng ref_align_start, v{ sau u ra duy nht l{ cp (read_id, AlignmentInfo) m{ kh|c nhau trng ref_align_start. u ra ca ExtenReduce l{ mt file cha tt c c|c xp h{ng m{ mi read vi kdifference. File n{y c copy v{o trong HDFS th{nh mt h thng file nh k, hoc tp tin HDFS c th c x l vi c|c cng c b|o c|o i km.
V Minh Ngc
37
38
V Minh Ngc