Professional Documents
Culture Documents
Bi ging
TIN SINH HC NG DNG
(Applied bioinformatics)
NGUYN C BCH
H NI, 8/2013
5
5
5
7
12
16
18
18
CHNG 2
NN TNG SINH HC CA TIN SINH HC
2.1. Axit nucleic v protein
2.2. Cu trc ca axit nucleic
2.3. Genome v nghin cu genome
2.4. Pht hin gene v xc nh chc nng gene trong genome
2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene
2.6. Proteome v lnh vc nghin cu protein (proteomics)
2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt
2.8. Phn tch mi quan h tin ha ca cc sinh vt
Tm tt chng 2
Cu hi n tp chng 2
19
19
19
19
24
26
29
29
30
31
33
33
CHNG 3
TM KIM V QUN L TI LIU NGHIN CU
3.1. Phng php tm kim thng tin
3.2. Cch tm ti liu phc v nghin cu
3.3. Lm quen vi Pubmed
3.4. Cch qun l ti liu nghin cu
Tm tt chng 3
Cu hi n tp chng 3
35
35
35
35
36
37
38
38
PHN 2
C S D LIU SINH HC
NG K TRNH T VO C S D LIU
40
40
40
40
41
41
41
43
45
46
46
46
46
47
50
50
52
52
52
52
53
55
58
61
62
62
Tm tt chng 5
Cu hi n tp chng 5
65
65
PHN 3
CC CNG C PHN TCH
KHAI THC V X L D LIU TRNH T SINH HC
66
66
66
66
66
66
66
68
70
71
72
72
CHNG 7
LM QUEN VI CC CNG C PHN TCH CSDL SINH HC
7.1. Lm quen vi cc cng c phn tch c bn
7.1.1. Tm v copy trnh t
7.1.2. Nhm cng c tm kim trnh t ging nhau
7.2. Tm cc vng chc nng, vng bo th
7.2.1. Cn nhiu trnh t (multi sequence alignment)
7.2.2. Xy dng bn gii hn (restriction map contruction)
7.2.3. D on cu trc bc 2 v bc 3 ca phn t protein
7.2.4. Phn tch trnh t axit nucleic
7.2.5. Thit k mi cho PCR v mu d lai axit nucleic
7.2.6. Xc nh khung c m
7.2.7. Tm cc bi bo khoa hc
7.2.8. Lp rp trnh t
7.2.9. Phn tch quan h tin ha
7.2.10. Phn tch protein
7.2.11. Nghin cu biu hin gene
7.3. Cc nhm cng c phn tch
7.3.1. Cng c phn tch ca NCBI
7.3.2. Nhm cng c ca EMBL
7.3.3. Nhm cng c ca ExPASy
7.3.4. Cc nhm cng c khc
Tm tt chng 7
Cu hi n tp chng 7
74
74
74
74
75
79
79
81
83
84
85
86
87
87
88
90
90
91
91
92
95
97
97
98
CHNG 8
LM QUEN VI PHN TCH D LIU SINH HC
8.1. Tm d liu trong cc ngn hng CSDL
8.1.1. D liu trnh t
8.1.2. D liu cu trc
8.1.3. Cc d liu khc
8.2. Phn tch trnh t
8.2.1. So snh trnh t
8.2.2. Phn tch khung c m v vng trnh t m ha
8.2.3. Tm kim Promoter v cc vng iu ha hot ng gene
8.2.4. Tm kim vng chc nng ca protein (functional motif searching)
8.2.5. D on v m phng tng tc protein
99
99
99
99
99
102
102
102
106
106
109
110
CHNG 9
CN TRNH T V NGUYN L CA CN TRNH T
9.1. Gii thiu v cn trnh t
9.2. Nguyn l ca cn trnh t
9.3. Cn nhiu trnh t v nguyn l cn nhiu trnh t
113
113
113
114
118
119
125
125
127
129
131
133
133
Khi nim
Vic pht hin DNA l vt cht mang thng tin di truyn v xc nh m hnh
cu trc ca DNA m ra thi k pht trin ca sinh hc phn t. DNA m ha cho
mRNA v cc loi RNA khc. Protein c dch m t phn t mRNA s thc hin
nhiu chc nng sinh hc trong t bo k c iu ha hot ng ca gene cng nh cc
qu trnh sinh hc. Mc d vic xc nh trnh t genome ca cc sinh vt hin nay
tr nn n gin nhng lm sng t thng tin di truyn cha trong genome v s
hot ng chc nng cng nh mi tng tc gia cc gene vn cn l mt thch thc
ln. Chng hn ngi, mi t bo cha 23 cp NST v kch thc genome khong
3,2.109 cp nucleotide trong cha khong 23.000 gene (1). n nay v c bn cc
qu trnh phin m v dch m c bit nhng xc nh c chnh xc s
lng gene, v tr v s tng tc ca cc gene ny vn cn l cu hi kh.
International Human Genome Sequencing Consortium (2004). "Finishing the euchromatic sequence of the
human genome.". Nature 431 (7011): 93145. Bibcode
Vi s pht trin nhanh chng ca cc k thut v cng ngh mi, d liu sinh
hc m ch yu l trnh t nucleotide, amino acid, c to ra hng ngy cng nhiu.
Vic thu thp, lu tr, cho php truy cp, tm kim, phn tch v so snh mi lin quan
gia cc d liu trong cc c s d liu khng l l nhim v ca tin sinh hc. Thc t
i hi cc nh tin sinh hc, khoa hc my tnh cn phi pht trin cc thut ton mi
nng cao chnh xc v gim thi gian cho cc nh nghin cu sinh hc.
Tin sinh hc l mt lnh vc nghin cu a ngnh, mc nht nh, n c
t trn nn tng ca sinh hc phn t (ngun cung cp CSDL cn phn tch), khoa
hc my tnh (cung cp cc phn cng cho vic phn tch v mng li my tnh so
snh, i chiu cc kt qu phn tch), cc thut ton phn tch d liu. Ba yu t
ny c vai tr sng cn i vi tin sinh hc. Bn thn sinh hc phn t cng l mt
lnh vc tng i mi c da trn nn tng ca nhiu mn khoa hc c bn m
quan trng nht l di truyn hc, ha sinh hc, t bo hc Chnh v vy vic ra i,
nghin cu tin sinh hc cng nh ng dng tin sinh hc cng i hi kin thc c bn
lin ngnh v hiu bit v khoa hc my tnh. Di y l mt vi im mc lch s
quan trng cho s pht trin ca sinh hc phn t v tin sinh hc.
Nm
Pht minh
1930 Tiselius a ra k thut in di phn tch protein trong dung dch
1951 Pauling v Corey xut cu trc xon alpha v phin gp np beta
1953 Watson v Crick xut m hnh chui xon kp DNA da trn d liu thu c t kt
qu phn tch nhiu x tia X ca Franklin and Wilkins
1954 Nhm nghin cu ca Perutz pht trin phng php dng nguyn t nng (heavy
atom) gii quyt kh khn trong vic kt tinh protein.
1955 Trnh t ca protein u tin c phn tch l insulin b bi F. Sanger.
1970 Thut ton ca Needleman-Wunsch cho vic cn trnh t (alignment) c cng b.
1972 Phn t DNA ti t hp c to ra bi Paul Berg v nhm nghin cu ca mnh.
1973 C s d liu Protein c cng b bi Brookhaven
1974 Vint Cerf v Robert Kahn pht trin phng thc giao tip my tnh TCP lm nn tng
cho internet.
1975 in di 2 chiu c pht trin bi P. H. O'Farrell
Phng php Southern blot c m t v cng b bi E. M. Southern
1977 C d liu protein, PDB, chnh thc ra i
Maxam v Walter Gilbert (Harvard) v Frederick Sanger (U.K. Medical Research
Council) cng b phng php xc nh trnh t DNA.
1980 Trnh t genome hon chnh ca mt sinh vt (FX174) c cng b. Genome cha 5,386
cp base m ha cho 9 protein.
Phng php NMR a chiu (multi-dimensional NMR) c s dng xc nh cu
trc protein
1981 Thut ton Smith-Waterman cn trnh t c cng b
1982 Genetics Computer Group (GCG) to ra nhiu cng c phn tch trong sinh hc phn
t ti trung tm Cng ngh sinh hc Wisconsin thuc trng i hc Wisconsin.
1985 Thut ton FASTP c cng b
Phn ng PCR c m t bi Kary Mullis v cng s
1986 Thut ng Genomics" xut hin ln u tin m t lnh vc khoa hc lin quan n
vic lp bn , xc nh trnh t v phn tch cc gene. Thut ng c a ra bi
Thomas Roderick, sau ny l tn ca mt tp ch ni ting: Genomes.
CSDL SWISS-PROT c to ra bi phng sinh ha y hc (Department of Medical
Biochemistry) ca trng i hc Geneva v ngn hng CSDL chu u EMBL ra i
6
1987
1988
1990
1991
1997
1998
2000
2001
2004
2004
2008
Phn tch trnh t genome pht hin gene, cc gene t bin, ung th, xc
nh c vai tr ca cc gene v hng ti cc liu php iu tr (genome
analysis and treatment);
Phn tnh mi quan hin tin ha, di truyn qun th da trn cc phn mm v
cng c my tnh;
Phn tch hnh nh quy m ln (high-throughput image analysis),
Pht trin cc thut ton, phn mm gii quyt nhu cu ca cc nh khoa hc
trong lnh vc sinh hc.
chnh xc. Chnh v vy cc thut ton trong tin sinh hc ch mang tnh hp l nht
(heuristic) ch khng phi l chnh xc (precise). Cc thut ton v m hnh ang
dng ph bin hin nay bao gm: heuristics, approximation algorithms, parsimony
models, Markov Chain Monte Carloalgorithms, Bayesian analysis, probabilistic
models.
Xy dng v m phng cu trc
D on cu trc phn t protein l mt trong nhng ng dng quan trng ca
tin sinh hc. Trnh t amino acid ca mt phn t protein c th c xc nh trc
tip hoc suy din t trnh t nucleotide ca gene m ha tng ng. m phng cu
trc ngi ta cn nhng thng tin c th v protein, tt nht l cu trc kt tinh ca
phn t protein. Trong nhng trng hp kh kt tinh hoc ch c trnh t amino acid
ngi ta c th so snh trnh t amino acid ca mt protein hoc polypeptide vi
nhng protein khc bit trong CSDL s dng cc thut ton tm ra s tng
ng, t a ra cu trc m phng tng i ca cc protein cha bit. Thng
thng cc trnh t c mc ging nhau >40% c th p dng d on cu trc.
Mc d c s tng quan cht ch gia mc ging nhau v trnh t v cu trc
nhng trong nhiu trng hp mc d cu trc ging nhau nhng trnh t amino acid
c th li khc nhau. V th vic xc nh hoc m phng cu trc cng khng th da
n thun vo thut ton hay chng trnh my tnh. Trong nhiu trng hp, vic m
phng ch s dng sng lc v tham kho.
S tng ng gia haemoglobin ca ngi v ca cc cy h u
(leghemoglobin) cng l mt trong nhng v d v mi tng quan gia trnh t v
cu trc. C hai protein u c dng vn chuyn oxy. Mc d chng c trnh t
amino acid rt khc nhau nhng cu trc ca chng li ging nhau mt cch c bit.
iu ny cng phn nh mi quan h gia cu trc v hot ng chc nng.
M phng tng tc phn t
M phng tng tc phn t l xy dng cc m hnh m t s tng tc khi
hai hay nhiu phn t tip xc vi nhau. Thng tin v s tng tc bao gm v tr,
nhm tng tc v c ch hnh thnh nhng tng tc. Tng tc phn t lin quan
n nhng thay i v nhit ng hc, thay i trng thi phn t (thay i in tch,
chuyn dch cc nhm lin kt, thay i cu hnh v trng thi hnh hc khng gian).
Cc tng tc phn t in hnh nh tng tc protein-protein/peptide, enzyme-c
cht, ligand-cht tng tc. Thut ng thng s dng hin nay l docking v thut
ton tng ng ca n l docking algorithms.
Cc k thut c dng h tr bao gm: CD (circular dichroism), phn tch
nhiu x tia X (X-ray crystallography), phn tch cng hng t ht nhn protein
(protein nuclear magnetic resonance spectroscopy protein NMR). Mt trong nhng
cu hi quan trng l liu ch cn phn tch cu trc phn t (3D) d on s tng
tc phn t hay cn phi lm thc nghim c th cho tng protein-protein (protein
protein interaction experiments) hoc proteinprotein docking.
D on cu trc protein (prediction of protein structure)
D on cu trc protein da vo nhng thng tin nh trnh t amino acid, kt
qu khi ph (MS), kt tinh v phn tch nhiu x tia X, cc c im sinh hc tng
9
ng (s ging nhau trn c s cng thc hin chc nng sinh hc hoc cc enzyme
xc tc mt kiu phn ng hoc nhm c cht).
Cc thut ton u da trn c s tnh ton cc lin kt ha hc, kh nng hnh
thnh cc lin kt, tng tc gia cc phn t, phn tch nhit ng hc, nng lng t
do, nng lng lin kt xy dng ln cc m hnh cu trc khng gian. Tuy nhin,
hin nay vic phn tch mi lin h v so snh gia cc cu trc v chc nng bit
vn c coi l nn tng d on cu trc cc protein. Chnh v vy, nhng protein
mi vi cu trc cha c xc nh thng c d on da vo vic so snh trnh
t kt hp vi cc c im vt l v ha hc.
Phn tch biu hin gene (analysis of gene expression)
Cc CSDL v mRNA, cDNA, EST h tr pht hin s biu hin hoc mc
biu hin ca cc gene. Cc CSDL v protein microarray v khi ph (MS) c vai tr
rt quan trng trong vic phn tch hoc pht hin s c mt ca mt protein no
mt mu sinh hc. Bng cch so snh v i chiu cc CSDL ny cho php rt ngn
thi gian nghin cu. Tuy nhin, qu trnh ny i thng tr ln phc tp khi x l
khi lng mu ln (high through put analysis) v s liu nhiu do cc sai s gp phi
trong thc nghim.
T phn tch trnh t genome n vic iu tr (from genome to therapy)
Mt trong nhng nguyn nhn chnh dn n ung th l s tch ly cc t bin.
Phn tch nhiu trnh t c th xc nh c cc t bin tim n trong cc gene c
lin quan n ung th. Tin sinh hc xy dng cc h thng phn tch t ng qun
l, lu gi cc thng tin t h tr cc thao tc tm kim, so snh v i chiu gia
cc gene, genome pht hin s a hnh (chng hn cc c s d liu dbVar, dbSNP,
CancerChromosome). Kt qu nhng phn tch h tr cho vic iu tr v chn on
bnh d dng hn. Mt v d in hnh l s pht trin cc loi thuc khc nhau p
ng vi mi c th.
Cc k thut mi ang c p dng nh so snh trnh t cc nucleotide
pht hin s khc bit mc nucleotide n tm ra cc t bin im (singlenucleotide polymorphism arrays) nhiu v tr, vng trnh t khc nhau trong genome.
Thut ton ang dng hin nay l Hidden Markov model, change-point analysis
methods.
Nghin cu tin ha (Computational evolutionary biology)
Nghin cu tin ha bao gm xc nh ngun gc tin ha ca cc loi cng
nh s bin i v pht sinh loi mi theo thi gian. Cng ngh thng tin v tin sinh
hc h tr cc nh nghin cu sinh hc nhiu kha cnh, bao gm:
- Pht hin c s tin ha da vo so snh, pht hin s thay i trnh t DNA
ch khng da nhiu vo s bin i hnh thi.
- So snh ton b genome cho php nghin cu cc s kin phc tp xy ra trong
qu trnh tin ha chng hn nh lp on, trao i vt cht di truyn hoc ly
mt phn vt cht di truyn ca mt loi (chng hn nh chuyn gene ngang,
bao gm bin np, chuyn np, ti np, cng sinh, ti t hp genome, chuyn
gene)
- Xy dng cc m hnh my tnh d on din tin v h qu ca cc qun
th theo thi gian.
10
1.4.
n vic thit k mng li CSDL lin kt v pht trin cc giao din web nh cc
nh nghin cu va c th truy cp vo cc c s d liu va c th ng k thm cc
trnh t, d liu mi hoc cc d liu c chnh sa, b sung. Xut pht t nhu
cu ca cc nh khoa hc v vic tm kim v phn tch d liu (data mining) dn
n vic pht trin cc cng c tm kim kt hp vi vic so snh cc d liu. Vic s
dng cc chng trnh FASTA, BLAST, cn trnh t (sequence alignment); lp rp cc
trnh t (genome assembly);tm kim gene trong genome (gene finding), phn tch cc
domain trong phn t protein v xc nh cu trc ca chng tr thnh nhng thao
tc thng thng hng ngy ca cc nh nghin cu. Nhng ng dng mc cao hn
v phc tp hn nh xc nh c v tr v vai tr ca gene trn cc nhim sc th
(position cloning); so snh cu trc ba chiu ca cc protein,d on cu trc protein
v cc tng tc protein-protein; nhn dng m hnh (pattern recognition); d on
m hnh biu hin gene (gene expression profile prediction)ang tr nn ph bin
nhng phng nghin cu mnh.
T kt qu ca cc nghin cu v xc nh vai tr cc gene v tng tc gene,
nh khoa hc c th so snh cc hot ng ca nhng t bo bnh thng v nhng t
bo b bnh. lm c iu nycn thit phi c s kt hp v i chiu gia cc
CSDL sinh hc to thnh mt bc tranh tng th v din t c cc mi lin h
ca cc hot ng qua s nghin cu c cc con ng chuyn ha
(metabolomics). y cng l mt trong nhng thch thc rt ln ca cc nh tin sinh
hc.
13
14
FASTA
L mt cng c tm kim CSDL c s dng so snh trnh t nucleotide
hoc amino acid vi mt CSDL trnh t. Chng trnh ny da vo thut ton tm
kim trnh t nhanh bi Lipman v Pearson. y cng l thut ton u tin c
dng tm kim cc trnh t ging nhau trong CSDL.
EMBOSS
EMBOSS c vit tt t (European Molecular Biology Open Software Suite),
l mt t hp cc phn mm phn tch ngun m min ph ng dng trong lnh vc
sinh hc phn t. C khong hn 100 chng trnh ng dng so snh trnh t, tm
trnh t trong CSDL, tm kim cc m hnh (pattern), tm kim domain, motif trong
phn t protein bng cch so snh trnh t amino acid, so snh trnh t nucleotide
pht hin cc pattern, phn tch tn sut s dng b m (codon bias analysis)
Mt danh sch cc ng dng c th tm a ch:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/
Clustalw
ClustalW l chng trnh dng so snh cc trnh t DNA v protein. Mc
ch l tm ra cc vng trnh t ging nhau v khc nhau. Trn c s h tr cho
nhiu ng dng khc nh: phn tch domain, motif, pattern, xy dng mi quan h tin
ha.
RasMol
y l cng c nghin cu rt hiu qu hin th cu trc DNA, protein v cc
phn t nh. Protein Explorer l mt dng bin th d s dng ca RasMol.
Chng trnh ng dng cho chuyn ngnh bioinformatics
- JAVA: Do bn cht Java l chng trnh c lp v vy n l mt thnh phn
quan trng ca bioinformatics (BioJava)
- Perl: S dng x l cc d liu sinh hc (BioPerl)
- BioXML: L mt phn ca d n BioPerl, l ngun tp hp cc ti liu dng
XML v DTD
Xy dng cc CSDL ti liu, tp ch phc v nghin cu
- Bi bo, tp ch (pubmed);
- H thng phn loi, kha phn loi (taxon);
- Sch (book);
- Bi bo, tp ch, ti liu lin quan n cc phn ng sinh ha
(pubchembioassay);
- Cc ti liu lin quan n cc hp cht ha hc (Pubchem compounds);
- Cc ti liu v cc cht ha hc (pubchem substances);
- Cc c s d liu: genomics, proteomics, metabolomics, microarray gene
expression v phylogenetics.
Thng tin cha ng bn trong cc CSDL sinh hc bao gm: tn gene, trnh t
gene, v tr ca gene trn NST hoc genome (locus tag), cu trc v chc nng
ca cc gene, hu qu ca cc t bin gene , cc gene lin quan (h gene) v
cu trc ca chng (nu l protein, RNA...)
15
16
17
Tm tt chng 1
Tin sinh hc l mt lnh vc khoa hc mi c s kt hp cht ch ca sinh hc
m ch yu l di truyn hc, sinh hc phn t vi cc cng c thng k, ton hc v
khoa hc my tnh. Chng 1 gii thiu khi nim, vai tr ca tin sinh hc cng nh
cc cng c phc v cho nhng vn nghin cu ca sinh hc phn t hin i chng
hn nh tm kim cc trnh t sinh hc tng ng hoc ging nhau trong cc ngn
hng c s d liu, m phng v d on s tng tc gia cc phn t, pht hin cc
m hnh biu hin gene v cc mi lin h gia cc geneCc ni dung chnh ca tin
sinh hc cng nh xu hng pht trin ca lnh vc ny cng c cp qua gip
sinh vin c mt ci nhn bao qut v mt lnh vc khoa hc mang tnh ng dng, h
tr cho cc nh nghin cu trong cc lnh vc di truyn phn t, sinh hc phn t, y
hc
Cu hi n tp chng 1
18
CHNG 2
NN TNG SINH HC CA TIN SINH HC
2.1. Axit nucleic v protein
Axit nucleic v protein l hai i phn t sinh hc ng vai tr quan trng trong
th gii sng. Axit deoxyribonuleotide nucleic (DNA) mang thng tin di truyn v axit
ribonucleic (RNA) lin quan n qu trnh sinh tng hp protein v tham gia vo iu
ha hot ng sng ca t bo. n v cu to nn axit nucleic l cc nucleotide v
protein l cc amino acid.
2.2. Cu trc ca axit nucleic
DNA v RNA c cu to bi cc n phn l nucleotide v ribonucleotide.
Trong phn t DNA, mi nucleotide c cu to bi gc axit phosphoric, mt phn
t ng pentose v mt base. Cc nucleotide ni vi nhau bi lin kt phosphodiester
gia nhm 5PO4 ca phn t ng pentose ca mt nucleotide v nhm 3OH ca
phn t ng pentose mt nucleotide tip theo. V vy phn t axit nucleic bao gi
cng tn ti u 5PO4 v 3OH. Theo quy c i vi mt axit nucleic bao gi cng
vit theo hng 5 n 3 theo chiu t tri sang phi.
Hnh 4. M di truyn
Mi quan h gia DNA, RNA v protein c m t trong lun thuyt trung
tm (Crick 1970)
20
21
22
23
Cu trc bc 3 v bc 4
Cu trc bc 3 c hnh thnh t vic sp xp v gp np tip theo t cc thnh phn
cu trc bc 2. Nhng polypeptide c chiu di ln hn 200 amino acid thng t gp
np vi nhau thnh mt s n v cu trc gi l domain. Cu trc bc 4 l dng cu
trc tip theo ca cu trc bc 3. Cc protein c cu trc bc 4 thng c hnh thnh
t nhiu chui polypeptide (subunit).
Trong cu trc bc 4 s tng tc gia cc amino acid bao gm lin kt hydro gia cc
chui peptide, cu disulfide gia cc gc cystein, cc lin kt ion gia cc nhm tch
in ca cc gc (chui bn) v tng tc k nc.
2.3. Genome v nghin cu genome
Genome
Genome cha ng ton b thng tin di truyn ca mt sinh vt. Cc thng tin
di truyn c m ha trong DNA hoc RNA. Ly genome ngi lm mt v d, nu
coi genome l mt cun sch th cun sch ny c chia thnh 23 chng (tng ng
vi 23 cp NST). Mi chng cha 48 n 250 triu ch tin tc (A,C,G,T). Ton b
cun sch c hn 3,2 t ch v c t trong nhn ca t bo.
D n xc nh trnh t genome u tin hon tt nm 1977 bi Fred Sanger.
ng v cng s xc nh trnh t phage -X174, cha 5386 base. Genome ca vi
khun u tin c xc nh trnh t l Haemophilus influenzae vo nm 1995. Sau
genome eukaryote u tin c xc nh trnh t l ca nm men Saccharomyces
cerevisiae. Hin nay, s pht trin nhanh chng ca cng ngh (Ilumina solexa, 454
pyrosequencing, ion torrent, solid sequencing...) s lng genome ca cc loi c
xc nh trnh t tng ln mt cch nhanh chng.
Nghin cu genome (genomic research)
Nghin cu genome khng n thun ch l vic tng kt cc genome c
xc nh trnh t hay cc ch ra s lng gene c trong mt genome v tnh trng
tng ng. Nghin cu genome cn bao gm c vic so snh kch thc genome, s
lng NST (karyotype), trt t cc gene, tn sut s dng codon, thnh phn GC, v
tin ha genome. Ngoi ra nghin cu genome cng bao gm c vic so snh nhiu
24
genome pht hin ra cc vng bo th, cc s kin bin i din ra trong genome.
Cc kt qu nghin cu genome thng c biu din di dng ha thng qua
cc trnh duyt genome hay genome browser.
Genome hc (genomics) l mt mn hc gn lin vi di truyn hc. Genomics
lin quan n vic nghin cu genome ca cc sinh vt bao gm xc nh trnh t
DNA ca ton b genome v lp bn di truyn c mc phn gii cao (khong cch
gia cc marker rt gn nhau). Genomics cn nghin cu cc hin tng xy ra bn
trong genome chng hn nh: hin tng u th lai (heterosis), tc ng ln t ca cc
gene (epistasis), nh hng ca mt gene ln nhiu gene (pleiotropy) v tng tc
gia cc locus v cc allele bn trong genome. Khc vi nghin cu vai tr v chc
nng ca nhng gene n l, genomics nghin cu mi quan h tng th ca cc thnh
phn trong genome.
Lp genome (genome duplication) ng vai tr ch yu trong vic hnh thnh
loi mi. Lp geneome c th dao ng t phm vi nh (lp li cc on ngn/short
tandem repeat) hoc lp li c gene hoc c cm gene, lp c NST v thm ch ton b
genome. Nhng s kin ny l nn tng to ra c tnh di truyn mi, lm c s ca
tin ha. Trao i gene ngang (horizontal gene transfer) c vai tr quan trng trong
vic gii thch s ging nhau gia cc phn nh trong cc genome ca hai sinh vt vn
khng cng ngun gc tin ha. Vic trao i gene ny cng tng i ph bin gia
cc vi sinh vt chng hn hin tng khng khng sinh cc vi sinh vt l mt v d
in hnh. Vt cht di truyn c chuyn t genome ti th v lc lp vo NST cc
t bo eukaryote cng l mt v d cho hin tng ny.
Genome ngi (human genome)
Nm 2001, bn nhp u tin ca genome ngi c cng b. Vo nm 2007,
d n xc nh trnh t genome ngi hon tt vi t l li rt nh (khong 1/20.000
base). C th truy cp cc phin bn lp rp trnh t genome ngi bng cch dng
UCSC Genome Browser, Ensembl.
Nghin cu genome ca virus (bacterophage)
Bacteriophages ng vai tr quan trng trong nghin cu di truyn vi khun v
sinh hc phn t. V mt lch s, chng c s dng xc nh cu trc gene v
nghin cu c ch cng nh m hnh iu ha hot ng gene. Do genome c kch
thc nh v khng cha intron nn bacteriophase c la chn xc nh trnh t
u tin. Tuy nhin, nghin cu v bacteriophage khng m ra s cch mng v
genome (cuc cch mng v genome bt u t vic xc nh trnh t cc vi khun).
Trnh t genome ca cc bacteriophage thng c xc nh thng bng vic c
trnh t trc tip. Phn tch genome vi khun cho thy mt phn ng k DNA vi
khun cha cc trnh t tin phage (prophage) v dng ging nh prophage (prophagelike). Nh vy, vic khai thc thng tin trong CSDL ca bacteriophage gp phn gii
thch c vai tr ca prophage trong vic hnh thnh dng genome ca vi khun.
Nghin cu genome vi khun lam (Cyanobacteria genomics)
Hin ti c 24 vi khun lam c xc dnh trnh t. 15 trong s chng c
phn lp t bin. C 6 chng thuc chi Prochlorococcus, 7 chng thuc chi nc mn
Synechococcus, Trichodesmium erythraeum IMS101 v Crocosphaera watsonii
WH8501. Mt s nghin cu cho thy cc trnh t ny c th c s dng rt hu
25
ch trong vic suy din cc c tnh sinh l v sinh thi ca vi khun lam bin. Tuy
nhin, c rt nhiu d n xc nh trnh t genome ang c thc hin trong s c
cc dng phn lp thuc chi Prochlorococcus v Synechococcus ( bin),
Acaryochloris v Prochloron, mt dng khun lam dng si c kh nng c nh
nitrogen Nodularia spumigena, Lyngbya aestuarii v Lyngbya majuscul cng nh tc
ng ca bacteriophage ln vi khun lam bin. Nh vy, vic nghin cu genome
ng vai tr quan trng trong vic gii thch ngun gc tin ha ca cc sinh vt v
cc qu trnh sinh hc chng hn nh quang hp.
Mi quan h gia C-value v s lng gene:
Gi tr C (C-value) l hm lng DNA ca mt sinh vt. Gi tr ny c s bin
ng rt ln cc loi. Khng c mi lin h r rng no gia C-value v s lng
gene ca sinh vt. cc genome phc tp, t l cc trnh t DNA khng m ha (noncoding DNA) khng mang thng tin di truyn m ha RNA cng ln. ngi,
DNA khng m ha chim ti gn 75% genome. Nghch l gi tr C (C-value paradox)
ch mi quan h khng t l gia kch thc genome v s lng gene.
2.4. Pht hin gene v xc nh chc nng gene trong genome
26
lng gene tng t. Giun trn c khong 13.000 v la c khong 46.000. ngi,
trnh t gene m ha protein chim khong 12% genome.
Cu trc gene
28
khi ph, so snh s ng dng gia cc mnh peptide, so snh trnh t amino acid...
Proteomics bao gm ni dung quan trng l nghin cu cu trc v nghin cu chc
nng. Nhng thng tin v trnh t amino acid, cu trc v chc nng gip cc nh
nghin cu gii thch c bn cht ca cc qu trnh sinh hc, c ch ca cc qu
trnh ri lon, bnh tt v nhn dng v d on chc nng ca nhng protein mi.
2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt
t bin v tch ly t bin
Mc d c ch v nguyn nhn ca tin ha n nay vn cn nhiu tranh ci,
tuy nhin trn quan im hin i, t bin c coi l vt liu ban u ca tin ha
bi v y l con ng dn n vic hnh thnh allele mi hoc cc vng c chc
nng iu ha b thay i hoc to mi. t bin c th gy ra hu qu nghim trng
nhng cng c t bin trung tnh hoc khng nh hng n kiu hnh (t bin
trong cc vng DNA khng m ha/ non-coding DNA).
Hu ht cc t bin trong gene cu trc u tc ng n sn phm protein
hoc dn n s a dng v sn phm protein do qu trnh phn ct, ghp ni exon ca
mRNA. Nhng thay i cu trc v chc nng ca cc phn t biu hin thnh cc
dng bin d ca c th trong qun th. Tri qua cc s kin tin ha cui cng c th
dn n phn loi v hnh thnh loi mi. y, cu hi t ra l ti sao nhng thay
i nh trong cc gene do t bin, c bit l t bin im, li dn n s phn bit
loi ny vi loi khc. tr li cu hi ny cn phi xem xt c hai kha cnh
khng gian v thi gian. Khng gian y l nhng chn lc ngu nhin t ln
nhng c th b t bin. Thi gian l h qu ca mt qu trnh chn lc t nhin lu
di. Khng gian v thi gian c mi quan h cht ch vi nhau nu p lc chn lc qu
mnh th trong mt thi gian ngn c th hnh thnh loi mi hoc dn n tuyt
chng.
S lp gene v genome (gene/genome duplication)
Nu mt gene c lp li hay c nhiu bn copy th t bin xy ra mt bn
copy c th khng nh hng g n hot ng sng ca t bo. Lp gene trong mt c
th lng bi s to ra thm mt cp gene v th mt cp vn hot ng chc nng
bnh thng, cp cn li c th b bin i hoc tn ti cc dng t hp khc nhau.
Vy li ch ca qu trnh lp gene ny l g? Theo thi gian, mt bn copy c th to
ra chc nng mi, lm nn tng cho vic thch nghi trong qu trnh tin ha. Ngay c
khi hai bn copy ca gene tn ti theo kiu paralogous, tc l c trnh t v chc
nng tng t nhau th s tn ti ca cc bn copy l mt dng d tha (gene
redundancy). iu ny gii thch ti sao trong mt s trng hp chut hoc nm men
b knock out mt gene nhng khng thy nh hng hoc nh hng khng qu nng
n ln kiu hnh. Nh vy, chc nng ca cc gene b knock out c th b trung ha
bi mt dng paralog tng ng ca n.
Sau khi gene c lp, tri qua cc s kin tin ha mt bn copy ca gene c
th b bin i hoc mt i. Nhng bin i xy ra nhiu gene v nhiu v tr trong
genome dn n nhng ro cn (post-zygotic isolating mechanism) trong qu trnh
giao phi v sinh sn gia chng. Nhng ro cn ny c th dn dn gy ra s phn
loi.
Cc t bin trong vng iu ha
30
31
Ortholog
Cc trnh t tng ng c coi l orthologous khi chng c tch ring bi
mt s kin phn loi. Tuy nhin chng vn c cng mt t tin chung gn nht. Khi
mt loi phn li hay tch thnh 2 loi ring bit, cc bn copy phn ly t mt gene n
c gi l orthologous. Cc gene orthologous l cc gene ca cc loi khc nhau
nhng c s ging nhau bi v chng c ngun gc l hu du trc tip ca mt gene
n l. Chng hn protein iu ha Flu c mt c Arabidopsis (thc vt a bo bc
cao) v Chlamydomonas (to lc n bo). Chlamydomonas, protein ny phc tp
hn ch n xuyn mng 2 ln thay v mt ln Arabidopsis. Khi chuyn gene ny t
to lc sang genome thc vt bng k thut di truyn th hot ng ca gene ny cng
tng t nh t bo ban u ca chng. Kt qu ny chng t 2 gene ny l
orthologous v cng di truyn t 1 t tin chung.
xc nh 2 gene ging nhau c phi l orthologous hay khng th ch cn
phn tch ngun gc tin ha ca gene . Nu cc gene nm trong mt nhnh th
chng s l ortholog v l con chu ca mt t tin chung. Cc gene orthologs thng
c chc nng sinh hc ging nhau.
Paralogous
Cc trnh t tng ng (homologous) c gi l paralogous khi chng c
phn tch bi mt s kin lp gene. Nu mt gene ca mt sinh vt b lp v chim 2
v tr khc nhau trong cng mt genome, khi 2 bn copy c gi l paralogous
(para ngha l song song) v c th cng thc hin chc nng ging nhau. Paralog
thng c cng chc nng hoc chc nng tng t nhau, nhng khng phi lun lun
nh vy. Nguyn nhn ca hin tng ny l do thiu p lc la chn, tc l p lc la
chn ch t ln 1 bn copy ca gene b lp, bn copy kia c t do t bin, thay i
v hnh thnh chc nng mi.
Cc trnh t paralogous cung cp nhiu thng tin hu ch bn trong cc genome.
Cc gene m ha cho myoglobin v haemoglobin c xem nh l dng paralogs c
xa nht. n nay ngi ta bit 4 nhm haemoglobin (A, A2, B, F) l paralog ca
nhau. Trong khi mi protein u thc hin chc nng ging nhau l vn chuyn oxy
th mt dng bin i nh haemoglobin F dn n c i lc rt cao vi oxy so vi
cc haemoglobin ngi trng thnh. Chc nng hot ng ca cc gene paralog
cng khng nht thit phi gi vng. Cc gene paralogous thng thuc v cng mt
loi, nhng khng phi lc no cng nh vy. Chng hn gene haemoglobin ca ngi
v myoglobin ca kh u ch l paralog. y cng chnh l mt vn hay gp phi
trong tin sinh hc. Khi cc genome ca cc loi khc nhau c xc nh trnh t v so
snh vi nhau ngi ta rt d dng c th kt lun chng l tng ng (homologous)
tuy nhin chng vn c th l paralog v chc nng ca chng bin i.
Ohnology
Cc gene c gi ohnologous khi chng c ngun gc t mt qu trnh lp li
ton b genome. Thut ng ny c Ken Wolfe s dng vinh danh Susumu Ohno.
Ohnolog l mt trong nhng hin tng l th trong phn tch tin ha bi v chng
c bin i trong cng mt di thi gian bt u t ngun gc t tin chung ca
chng (do lp li ton b genome).
Xenology
32
7. iu ha hot ng gene l g?
8. Ti sao phi nghin cu mi quan h tin ha ca cc sinh vt
34
CHNG 3
TM KIM V QUN L TI LIU NGHIN CU
3.1. Phng php tm kim thng tin
S pht trin nhanh chng ca mng Internet v s lng trang Web to ra
mt lng thng tin khng l v tng ln tng ngy. tm c thng tin cn thit
trong kho d liu khng l ny cn phi s dng cc cng c tm kim kt hp vi
phng php ph hp. Chng 3 s gii thiu mt s cng c v phng php tm
thng tin chung trn Internet phc v hc tp v nghin cu.
Khi cn tm kim cc trang web cha nhng t c th hoc cm t cc cng c
tm kim chng hn nh Google s cho ra kt qu nhanh v rt hiu qu. Tuy nhin,
kt qu tm kim i khi a ra rt nhiu thng tin khng lin quan trc tip n ch
hoc phm vi tm kim dn n mt nhiu thi gian chn lc. Khi tm kim c nh
hng trong mt lnh vc c th hoc mt ch c th c th s dng cc nhm th
mc (subject directories) chng hn Word Wide Web Vitual Library (http://vlib.org/)
thu hp phm vi lnh vc ca ngi tm kim. Tuy nhin mt thc t l lng thng
tin m cc cng c tm kim cung cp ch khong 1/3 s lng thng tin thc t c.
Nguyn nhn l do cc cng c ny khng th truy cp c ngun thng tin . Vic
khng truy cp c ch yu lin quan n an ninh mng v cc hng ro chn. Cc
cng c tm kim khng c php vt qua cc ro chn ny.
C hai kiu tm kim thng tin, tm kim s dng cc cng c tm kim chung
(chng hn nh Google) v tm kim cc d liu c th theo mc ch nghin cu
hoc lnh vc nghin cu. Cho d s dng cng c tm kim no th vic tm kim
thng tin cng cn c cc qu trnh bao gm: (i) xc nh cng c tm tin hoc cc
trang web h tr tm tin, (ii) xc nh ni dung thng tin cn tm, (iii) xy dng t
kha i din cho ni dung tm kim (nn s dng t kha di dng cm t thay v
nhng t n, i vi ting Anh khng nn dng mo t, nn dng danh t), (iv) s
dng cc ton t logic kt hp chng hn nh cc hm boolean nh: and, or, not,
hoc +, -, du ngoc kp , du *, lc v thu hp kt qu nghin cu.
3.2. Cch tm ti liu phc v nghin cu
Hin nay Google c xem nh mt cng c tm kim nhanh v hu hiu nht
c a s mi ngi s dng. Xt v phng din tm kim thng tin chung hoc k
c tm kim theo th mc ch (directory) th Google vn l cng c chim u th.
Trong mt s trng hp Google c th thm nhp vo mt s trang web c bo mt
hin th thng tin tm kim, tuy nhin vic truy xut vo cc ngun thng tin ny s
b chn li v l do an ninh mng. Mc d vy, c th ni tm thng tin mt cch bao
qut Google c xem nh l cng c tm kim u tin c la chn.
Vic tm kim c bt u bng cch xc nh thng tin cn tm kim, tip sau
l xy dng t kha. i vi cc nh nghin cu sinh hc, c bit trong lnh vc
sinh hc phn t, thng tin ch yu c ly t cc ti liu nc ngoi v vy vic
thnh tho ting Anh l iu gn nh bt buc. Vic xy dng t kha da vo cch
kt hp cc t, ch yu l danh t hnh thnh cc cm t kha. Thng thng cc
kt qu tr v ca Google thng rt ln v vy ngi s dng phi lc kt bng cch
s dng cc phng php nh tng di t kha, nhm t kha thnh cc cm t v
kt hp vi cc ton t logic (hm boolean) hoc s dng cc chc nng tm kim
nng cao. Tuy nhin, vic s dng Google ch gii quyt c bi ton tm thng tin
35
38
39
PHN 2
C S D LIU SINH HC
NG K TRNH T VO C S D LIU
CHNG 4. C S D LIU SINH HC
C s d liu
Nn tng quan trng nht trong tin sinh hc ng dng l CSDL. Phn ln d
liu trong cc CSDL sinh hc l nhng trnh t sinh hc i km vi nhng thng tin
m t chi tit. Chng hn d liu t cc d n xc nh trnh t genome c to ra
hng ngy trn quy m ton th gii. s dng c cc c s d liu ny cn phi
c mt h thng t chc v sp xp chng mt cch hp l c th lu tr, phn
nhm, cho php truy cp, tm kim v so snh. Ngoi ra, do c th ca CSDL sinh
hc, ngoi d liu trnh t thng thng cn c cc CSDL cu trc, chc nng.
Do tnh phc tp v mi lin h gia cc CSDL nn rt kh c th sp xp v
phn loi CSDL mt cch tch bit. Theo ngun gc ca d liu c th phn chia
thnh CSDL s cp v CSDL th cp. CSDL s cp cha cc trnh t nucleotide hoc
amino acid trnh cu trc c xc nh t thc nghim cng vi nhng thng tin m
t lin quan n chc nng, cc bi bo cng b lin quan, lin kt cho vi cc c s
d liu khc. CSDL th cp l CSDL cha cc d liu c cht lc, sp xp theo
nhng tiu ch nht nh t d liu ca CSDL s cp. Nu da vo c im d liu c
th phn chia thnh CSDL trnh t, CSDL cu trc v cc CSDL khc (hnh 18).
CSDL c vai tr v cng quan trng lm c s cho cc mc ch tm kim, phn tch
v so snh i chiu d liu. Kt hp vi cc cng c phn tch v cc lin kt cho
gia cc c s d liu, cc nh nghin cu c th xc nh, d on v phn tch
tm ra thng tin cha trong cc trnh t cng nh xc nh tnh cht v chc nng ca
cc trnh t sinh hc mi.
40
4.1. C s d liu s cp
42
43
Phng
nghim
php
Tn x tia X
NMR
Knh hin vi in t
Lai
Khc
Tng
74593
8700
374
46
147
83860
1457
1029
45
3
4
2538
Phc
hp Cc
Tng s
protein/DNA phn t
khc
3864
2
79916
192
7
9928
126
0
545
2
1
52
6
13
170
4190
23
90611
4.3.3. PubChem
L mt CSDL NCBI lu gi cc phn t nh v thng tin lin quan n cc
hot tnh sinh hc ca chng. PubChem bao gm 3 thnh phn: PubChem compound,
Pubchem substance v Pubchem Bio Assay. Trong PubChem compound cha hn
11 triu phn t (2007) cng vi cu trc 2 chiu tng ng.
46
EMBL, DDBJ) nhng NCBI ca M. Hin nay ngn hng gene cha hn 189.000.000
trnh t vi tng s hn 299.000.000.000 base ca hn 380.000 sinh vt (tnh n
thng 12 nm 2010)
a) NCBI
GenBank l CSDL trnh t ca NIH, c khong 126,551,501,141 bases trong
135,440,924 bn ghi trnh t (sequence records)v 191,401,393,188 bases trong
62,715,288 sequence records nhnh WGS (whole genome shortgun) vo thng 4
nm 2011.
Truy cp vo GenBank
C mt s cch sau:
- Tm trnh t trong GenBank (trnh t c xc nh v m t) bng Entrez
Nucleotide. Cc trnh t chia thnh 3 nhm: CoreNucleotide (b s tp chnh ca
GenBank), dbEST (Expressed Sequence Tags), v dbGSS (Genome Survey
Sequences).
- Tm v cn trnh t trong GenBank bng mt trnh t truy vn (query) s dng
cng c BLAST (Basic Local Alignment Search Tool). BLAST s tm trong cc
CSDL CoreNucleotide, dbEST, v dbGSS mt cch c lp.
- Tm cc lin kt v ti cc trnh t bng cc tin ch ca NCBI (NCBI e-utilities).
S dng d liu ca Genbank
CSDL ca GenBank c thit k nhm cung cp v khuyn khch nh nghin
cu truy cp tm hiu cc thng tin trnh t DNA. Chnh v vy, NCBI khng c bt
k gii hn no i vi ngi truy cp. Tuy nhin, mt s trnh t ng k c i km
vi bn quyn khi s dng cn phi tun theo mt s rng buc v quy nh.
Pht trin nhng tnh nng mi
NCBI lin tc pht trin cc cng c mi lm tng kh nng truy cp v ng
k trnh t vo GenBank. ng k ti khon NCBI s lin tc nhn c cc thng
tin mi qua email.
48
b) EMBL
Phng th nghim sinh hc phn t chu u l mt trung tm nghin cu hng
u th gii tp trung vo khoa hc s sng. EMBL bao gm hn 20 thnh vin quc
gia chu u: o, B, Croatia, an Mch, Php, c, Hy Lp, Iceland, Ireland, Israel,
, Luxembourg, H Lan, Norway, Ty Ban Nha, B o Nha, Thy in, Thy S v
Anh. Gn y c thm thnh vin mi l c.
CSDL EMBL cn c gi l (EMBL-Bank) cha cc ngun trnh t
nucleotide s cp ca chu u (primary nucleotide sequence resource). Ngun trnh t
DNA, RNA ch yu l do cc ng k ca cc nh nghin cu, cc d n xc nh
trnh t v cc ng dng bn quyn.Cc d liu trnh t c trao i vi 2 ngn hng
cn li hng ngy.
c) DDBJ
Ngn hng gene Nht Bn (DNA Data Bank of Japan, DDBJ) thnh lp nm
1986 l mt ngn hng trnh t DNA thuc National Institute of Genetics (NIG)
Shizuoka. c ti tr bi Japanese Ministry of Education, Culture, Sports, Science
and Technology (MEXT). N cng l 1 trong 3 thnh vin ca International
Nucleotide Sequence Database Collaboration (INSDC). Hng ngay DDBJ trao i d
liu vi EMBL EBI v GenBank NCBI. Nh vy 3 ngn hng ny c s lng
trnh t nh nhau bt k thi im no.
Mt s ngn hng gene c th
Rice genome database
Kch thc ca genome la khong 430 Mb v l nh nht trong s cc cy ng
cc c xc nh trnh t. Kch thc ny bng khong 1/7 so vi genome ngi
v gp 3.5 ln genome Arabidopsis. Tip ngay sau d n genome ngi, d n genome
la c tin hnh. Vo nm 1997, NST s 1 c hon tt. n thng 4 nm 2000 v
thng 2 nm 2001 d n hon tt nhng cha cng b. Hin nay, genome la c th
truy cp theo a ch: http://rice.genomics.org.cn/rice/link/ar.jsp hoc CSDL genome
ca NCBI.
Ngn hng gene Arabidopsis
Phin bn hin ti ca ngn hng gene Arabidopsis cha thng tin v cc gene
vi nhng bin th do qu trnh phn ct (splicing). iu c ngha l cng mt gene
c th c biu hin khc nhau ph thuc vo s bin th do qu trnh phn ct. a
ch truy cp ca ngn hng gene ny: http://www.atgc.org/Arabidopsis_Genome/
CSDL genome thc vt
3 ngun chnh ng gp to ra ngn hng genome thc vt: WGS (whole
genome sequencing), GSS (genome survey sequencing) v ESTs (expressed sequence
tags). Cc loi c tp trung gm: Arabidopsis, la, ng v Medicago truncatula.
Trnh t bao gm cc bn nhp ca cc d n xc nh trnh t, sau l EST v cc
cDNA. Trang web cho php truy cp vo genome thc vt: http://www.plantgdb.org/
Cc ngun genome khc
NCBI cung cp genome ca hn 3,200 sinh vt trong bao gm cc trnh t
hon tt v ang tip tc . a ch truy cp:
http://www.ncbi.nlm.nih.gov/About/tools/restable_org.html
49
Tm tt chng 4
1. C s d liu l ni lu tr cc d liu t nhiu ngun v c phn loi theo
nhng tiu ch nht nh nhm gip cho ngi s dng c th truy cp, tm
kim, i chiu v so snh d dng. CSDL sinh hc c vai tr v cng quan
trng v chng l nn tng cho vic tm kim, khai thc v phn tch v d
on.
2. CSDL sinh hc rt a dng v phc tp c lu tr trong cc trung tm d
liu. Cc d liu trnh t (nucleotide, amino acid) c lu tr cc ngn hng
gene, in hnh l Genebank (M), EMBL (Chu u) v DDBJ (Nht Bn).
Ngoi CSDL trnh t cn c rt nhiu cc loi CSDL khc nh: CSLD cu trc
ca cc i phn t; ngn hng gene ca cc loi sinh vt; CSDL cha cc tp
ch, bi bo; CSDL v cc cht ha hc, hnh nh, protocol
3. Hin nay, cc CSDL l ngun m cho php cc nh nghin cu tm kim, khai
thc min ph v ng thi ng k cc d liu t cc kt qu nghin cu ca
mnh ng gp xy dng CSDL. Cc CSDL thng c i km bi cc
cng c v cc phn mm h tr cho ngi s dng, chng hn nh cc cng c
tm kim, cc cng c hin th ha, cc phn mm so snh, i chiu cc
trnh t sinh hc
Cu hi n tp chng 4
1. Th no l CSDL sinh hc? CSDL sinh hc bao gm nhng loi d liu g?
2. Nu vai tr ca cc ngn hng gene (Genebank, EMBL, DDBJ) v mi lin h
gia chng.
3. Th no l CSDL s cp, CSDL th cp. Nu s khc bit gia hai loi CSDL
ny.
4. Lm th no ng k trnh t sinh hc vo cc ngn hng gene? Hy nu mt
vi cng c in hnh.
5. Th no l ngn hng CSDL genome c th? Hy k tn mt vi ngn hng
CSDL genome v nu ngha ca cc ngn hng gene ny?
6. Hy cho bit vai tr v ngha ca CSDL Pubmed.
7. Hy lit k cc CSLD chnh ca NCBI v nu tm tt ngha ca cc CSDL
ny.
8. Tm hiu v Ensemlb. Cho bit vai tr ca Wellcome Trust Sanger Institute
(WTSI)
50
51
CHNG 5
XC NH TRNH T V NG K TRNH T VO NGN HNG GENE
5.1. Xc nh trnh t nucleotide
Xc nh trnh t DNA l qu trnh ch ra chnh xc trt t sp xp ca cc trnh
t nucleotide ca phn t DNA . Hiu bit v trnh t DNA tr thnh mt yu
cu khng th thiu trong nghin cu sinh hc v cc ng dng lin quan khc nh
chn on bnh, phn tch t bin, chn on sm ung th... Cc k thut xc nh
trnh t ngy cng tin b v hin i gip cho vic xc nh trnh t nhanh chng t
trnh t DNA n l n trnh t ca ton b genome. Vo khong nhng nm 1970
vic xc nh trnh t bt u c tin hnh. C hai phng php c s dng trong
khong thi gian ny l phng php phn gii ha hc ca Maxam v Gilbert v
phng php s dng phn ng tng hp theo nguyn l kt thc chui ca Sanger.
Trong giai on ny, mc d phng php ca Sanger xut trc nhng do nhng
hn ch v mt k thut vo thi im nn phng php ca Maxam v Gilbert
chim u th. Sau ny do tin b v mt k thut phng php ca Sanger c s
dng ph bin. Qu trnh xc nh trnh t sau c t ng ha.
Gn y nhiu k thut xc nh trnh t th h mi ra i chng hn nh 454
pyrosequencing, Illumina (Solexa) sequencing ... cho php xc nh nhanh chng
trnh t ca ton b genome trong mt thi gian ngn vi chi ph tng i thp. Mc
d cc phng php xc nh trnh t mi c nhc n nhiu nhng vn cn nhiu
hn ch so vi phng php trc y. i vi nhng trnh t DNA n l, vic xc
nh trnh t vn c lm th cng thng qua vic nhn dng hoc xc nh trnh t
trc tip. Trong phm vi bi ging ny chi tit ca cc k thut xc nh trnh t s
khng c cp.
5.2. Xc nh trnh t genome
Khi nim
Xc nh trnh t genome l qu trnh xc nh ton b trnh t DNA c trong
genome ca sinh vt, k c ty th v lc lp (i vi thc vt). V mt l thuyt bt k
mt mu no t t bo biu m, ty xng, chn tc, ht, l cy... u cha y
thng tin di truyn l phn t DNA. i vi cc sinh vt lng bi c cc nhim sc
th tn ti thnh cp tng ng, trnh t DNA s c sp xp theo tng chic nhim
sc th trong b n bi.
Vic xc nh trnh t ton b genome cng tri qua lch s pht trin ring. Bt
u vo nm 1977, genome hon chnh u tin c xc nh trnh t l
bacteriophage X174 c kch thc 5,386. Tip vo nm 1995, vi khun u tin
Haemophilus influenzae c xc nh trnh t c kch thc 1,8 Mbp. Vo nm
2000, genome thc vt u tin c xc nh trnh t l Arabidopsis thaliana c kch
thc 157 Mb. Vo u nm 2003, genome ngi c xc nh hon chnh c
kch thc 3,2 Gbp. Vo nm 2008, d n xc nh trnh t 1000 genome c
khi ng. Cho n nay, hng hng chc nghn genome ca cc loi c xc nh
trnh t. Vi s pht trin nhanh chng ca cc k thut xc nh trnh t th h mi
vic xc nh trnh t genome c th hng vo i tng tng c th vi thi gian
v chi ph thp.
52
53
che ph
(coverage)
10
11
10
11
12
10
11
12 13 14
10
11
12 13 14
Cc on trnh t ngn
(reads)
Mt contig (contigous)
54
56
57
58
59
60
c/ Trc khi bt u:
Chun b d liu trnh nucleotide v trnh t axit amin. Sequin thng nhn trnh t
di dng FASTA, ngoi ra c th l PHYLIP, NEXUS, MACAW hoc
FASTA+GAP.
Xem
chi
tit
nh
dng
file
a
ch:
http://www.ncbi.nlm.nih.gov/Sequin/faq.html#Orgnameforphyl k t c nh s s
dng bng m ASCII di dng text (plain text).
d/ ng k trnh t
y s dng v d l trnh t genom ca D.melanogaster m ha cho hai yu
t khi u 4E-I v 4E-II (S truy cp trong GenBank l U54469).
Sau khi hon tt vic chun b cc file trnh t, bt chng trnh Sequin.
Form u tin ca Sequin xut hin nh sau:
63
Tm tt chng 5
1. Vi s tin b v k thut, hin nay trnh t nucleotide ca cc sinh vt c
xc nh mt cch rt nhanh chng, ton b genome ca mt sinh vt c th
c xc nh trong vi ngy. Nhng k thut xc nh trnh t hin i ang
dung hin nay c gi l cc k thut xc nh trnh t th h mi ( phn
bit vi k thut ca Maxam Gilber v phng php t ng ca Sanger).
Phng php xc nh trnh t th h mi bao gm: Pyrosequencing, Alumina
(Solexa) sequencing, Solid sequencing
2. Vic lp rp cc trnh t ngn c c (read) c thc hin da trn c s
cn trnh t. Hin nay qu trnh lp rp c h tr bi cc phn mm v h
thng my tnh mnh.
3. Cc trnh t nucleotide hoc ton b trnh t genome ca cc sinh vt c
ng k vo ngn hng gene thng qua cc cng c ng k. Cng vic ny v
cng c ngha ngoi vic cng b cng trnh nghin cu ca cc nh khoa hc
cn c ngha trong vic xy dng kho CSDL trnh t genome ca cc sinh vt
quy m ton th gii.
Cu hi n tp chng 5
1. M t nguyn l ca phng php xc nh trnh t ca Maxam-Gilbert
2. M t nguyn l ca phng php xc nh trnh t t ng ca Sanger. Hy
cho bit trnh t genome ngi c xc nh theo nguyn l v phng php
no?
3. M t nguyn l ca phng php xc nh trnh t: Pyrosequencing, Alumina
(Solexa) sequencing v Solid sequencing.
4. Bng cc cng c tm kim hc tm hiu v nu nguyn l ca cc phng
php xc nh trnh t khc.
5. Gii thch nguyn l ca vic lp rp trnh t. Cho v d mt cng c h tr lp
rp trnh t.
6. Hy nu cc cng c h tr vic ng k trnh t vo ngn hng gene
7. ngha ca vic ng k trnh t
8. S dng cng c Sequin ng k mt trnh t (bt k) vo ngn hng gene.
65
PHN 3
CC CNG C PHN TCH
KHAI THC V X L D LIU TRNH T SINH HC
CHNG 6. GENOME BROWSER
6.2.1. Ensembl
Ensembl l mt d n kt hp gia EMBL-EBI v Sanger Institute pht trin
h thng phn mm to ra v duy tr cc m t ca mt genome eukaryote nht
nh. Ensembl ban u c ti tr bi Wellcome Trust. Trang web cho php truy cp
min ph tt c cc d liu v phn mm t d n Ensembl.
D n Ensemble cung cp cc CSDL genome ca cc ng vt c xng sng
v cc loi eukaryotic v a thng tin ca chng online cho php truy cp min ph.
66
Hin nay trong Ensemble, genome ngi cha khong 3,2 t cp base, m ha cho
khong 20.000 n 25.000 gene. Genome browser khng ch cung cp trnh t genome
m quan trng nht l cung cp thng tin v v tr v cc mi quan h ca cc gene
c m t v xc nh c th trn cc NST. Giai on u khi thng tin v cc trnh
t cn hn ch, cc nh khoa hc phi m t th cng da, nh v cc gene bng cch
s dng cc d liu thu c t cc th nghim, cc tp ch khoa hc v cc CSDL. V
m t th cng nn cc thng tin c kim sot cht ch v c thm nh bi cc
chuyn gia nn chnh xc ca d liu rt cao. Tuy nhin, y l mt qu trnh i
hi rt nhiu thi gian v cng sc. Khi d liu tch ly c mt mc nht
nh v do s lng trnh t genome c c ngy cng nhiu nn vic m t th
cng s khng th p ng c. Chnh v vy, vic pht trin cc thut ton h tr
cho vic m t genome t ng c pht trin. Trong d n Ensembl, d liu trnh
t c a vo mt phn mm pipeline vit bng ngn ng Perl cho php to ra
mt b cc v tr ca cc gene c d on v lu li trong mt CSDL MySQL
phn tch v hin th sau . Ensembl cho php nhng d liu ny c truy cp t do
v ti v my trn ton th gii.
Hin nay d liu genome trong Ensembl c chia thnh nhiu nhm. Nhm
thng dng bao gm cc genome ca ngi, chut v zebrafish. Nhm linh trng
bao gm 10 ging. Ngoi ra cn cc nhm ca lp chim, b st, lng c, nm v cc
sinh vt khc.
6.2.2. UCSC
i hc California (University of California, Santa Cruz (UCSC)) thnh lp
mt trung tm gi l UCSC Genome Browsercha cc trnh t genome ca rt nhiu
sinh vt bao gm c ng vt c v khng c xng sng. Cc trnh t ny c sp
xp v m t chi tit. Browser l mt cng c hin th ha h tr cho cc thao tc
tm kim v truy cp CSDL mt cch nhanh chng rt nhiu mc . UCSC gn y
m rng s lng genome trong CSDL, tng s hin nay ln ti hng trm loi.
68
hoc t hp ca nhiu yu t. Cch truyn thng l r th drag and drop cho php
ngi dng c th chn bt k vng no trong ton b genome v phng to vng ra
ton mn hnh.
Cc nh nghin cu cng c th dng Genome browser hin th d liu ca
chnh mnh nh cng c Custom Tracks. Cng c ny cho php ngi dng upload
mt file cha trnh t ca chnh mnh v quan st d liu nhng mc khc nhau.
Ngi dng cng c th s dng d liu ca UCSC to ra nhng b d liu s dng
Table Browser tool (chng hn nh SNPs thay i trnh t amino acid ca mt protein)
v hin th cc b d liu trong Browser nh dng Custom Track.
Khng ch n thun l mt genome browser, UCSC cn lu tr mt b cc
cng c phn tch genome k c giao din y (full-featured GUI interface) cho
php nh nghin cu khai thc thng tin trong d liu browser (Table Browser), mt
cng c cn trnh t nhanh (BLAT) cng rt hiu qu trong vic tm cc trnh t trong
mt kho rt ln cc trnh t. Cng c liftOver cn ton b genome cho php chuyn
i cc trnh t t mt bn lp rp ny sang bn lp rp khc hoc gia cc loi vi
nhau. n Cng c Genome Graphs cho php ngi dng quan st tt c cc NST trong
cng mt lc v hin th kt qu ca hip hi nghin cu genome (GWAS). Cng c
Gene Sorter hin th cc gene c nhm li theo mt s tiu ch hoc thng s khng
lin quan n v tr genome, chng hn nh cc m hnh biu hin gene (expression
pattern) cc m.
Tm tt chng 6
1. Genome browser l mt trnh duyt Web cho php tm kim v hin th thng
tin v genome ca cc sinh vt di dng giao in ha. Genome browser
cung cp nhng thng tin c bn bao gm: kch thc genome, s lng NST,
bn NST, s lng gene, v tr v khong cch tuyt i gia cc gene.
Ngoi ra thng tin chi tit v gene, chc nng ca gene, thng tin ca cc
locus... u c m t chi tit trn c s lin kt vi cc CSDL.
2. Ba genome browser Ensembl, UCSC v NCBI MapViewer m t y thng
tin v genome ca nhiu loi sinh vt. Hin nay c thng tin v genome ca
hn 1000 loi trong CSDL ca cc genom browser ny.
3. Genome browser cung cp rt nhiu thng tin hu ch cho nh nghin cu nh
xc nh v tr ca gene trong genome, so snh cc locus gene ca mt s loi,
tm cc gene c mi lin h gn gi vi nhau (h hng), cc marker phn t lin
quan n gene , v tr locus tng ng ca gene mt s loi c mi quan
h gn gi, chc nng ca gene v sn phm protein cng nh h cc protein
lin quan.
Cu hi n tp chng 6
8. Cho bit ng dng ca cng c In silico PCR trong UCSC? Cho v d minh
ha.
9. Tm hiu cc genome browser v cho bit cch xc nh cc gene c cng
ngun gc tin ha vi mt gene cho trc? Ly v d minh ha.
10. S dng cc cng c genome browser hy so snh v tr locus ca mt gene
tng ng gia hai genome ca ngi v chut.
11. Hy nu s khc bit gia cc genome browser v d n xc nh trnh t 1000
genome.
73
CHNG 7
LM QUEN VI CC CNG C PHN TCH CSDL SINH HC
Tn sut s
dng (%)
35
9
12
2
3
9
11
8,5
7
6
4,5
4,5
4
3,5
3
3
3
2,5
2,5
2,5
100
hng CSDL, in hnh nht l ngn hng gene GenBank, ngn hng CSDL Chu u
EMBL, ngn hng gene ca Nht Bn (DDBJ).
tm kim trnh t nh nghin cu cn mt cng c gi l browser sau a
tn trnh t, tn gene, sn phm gene hoc cc thng tin lin quan n trnh t sinh
hc. Trc y do s lng gene hoc protein pht hin v ng k vo trong cc ngn
hng CSDL cn cha nhiu th vic t tn tng i n gin v d qun l. Tuy
nhin vi tc pht trin nhanh chng ca cc k thut xc nh trnh t, cc phng
php xc nh chc nng ca gene v protein to ra mt s lng ln d liu tn
ca cc trnh t trong c nhng trnh t tng ng, gene tng ng... Vic qun
l cc d liu s cng kh khn hn khi xy dng cc CSDL da vo cc lin kt cho
(cross database). Chnh v vy, mt trong nhng vn kh khn hin nay trong vic
tm kim trnh t l vic thng nht tn gi. Thut ng ontology ra i ch mt lnh
vc nghin cu thng nht tn gi ca cc trnh t sinh hc, gene hoc protein.
Hnh 33 minh ha giao din CSDL ca NCBI. Ty thuc vo loi CSDL cn
tm kim ngi s dng s la chn CSDL tng ng v in tn, thng tin ca trnh
t, gene hoc protein... vo tm kim v chn Search. Kt qu s cho ra cc thng tin
tng ng. Trong phn ny tm kim trnh t DNA hoc protein ngi s dng c
th la chn CSDL nucleotide, gene, protein, EST, STS... Sau khi NCBI tr v kt qu
tm kim, ngi s dng c th copy trnh t hoc ti trnh t vo my tnh.
76
78
79
80
81
82
85
7.2.6. Xc nh khung c m
Vic xc nh khung c m c ngha trong vic pht hin hoc d on
gene. Khung c m (open reading frame/ORF) c nh ngha l mt on trnh t
c bt u bi m khi u AUG sau l cc b ba m ha lin tc v kt thc bi
mt trong 3 b m kt thc (UAA, UAG v UGA). S lng cc nucleotide c trong
mt khung c m lun l bi s ca 3. Lu i vi mi trnh t DNA cho trc
lun c 6 khung c, trong 3 khung theo chiu dng (+) v 3 khung theo chiu m
(-). Trn mi khung c c th khng c, c 1 hoc nhiu ORF.
86
7.2.7. Tm cc bi bo khoa hc
Vic tm kim cc bi bo khoa hc l cng vic khng th thiu ca nh
nghin cu. Cc bi bo, cng trnh nghin cu ng trn cc tp ch cng c sp
xp vo trong cc CSDL cho php nh nghin cu c th tm, ti min ph hoc tr
ph. C nhiu CSDL lu tr cc tp ch khoa hc thuc nhiu lnh vc khc nhau.
PubMed l mt trong nhng CSDL lu tr cc tp ch lin quan n khoa hc s sng,
sinh y hc v mt s ngnh lin quan c t trong ngn hng CSDL NCBI ca M.
Cho n nay s lng bn ghi cc bi bo Pubmed ln ti hng chc triu (xem
phn CSDL PubMed). Trong CSDL ca NCBI, Pubmed tp hp cc thng tin v cc
bi bo ng trn cc tp ch lin quan n y sinh hc, sinh hc v cc tp ch lin
quan. n thi im hin nay, s lng tp ch m Pubmed c lin kt ln ti hng
nghn tp ch. Thng thng cc trng i hc ln, vin hoc trung tm nghin cu
thng mua cc ti khon hoc cng cho php truy cp v ti v cc bi bo khoa hc.
Vit Nam ngi s dng cng c th mua ti khon truy cp vo cc tp ch
online nh ScienDirect, Springerlink... ti v cc bi bo khoa hc. Trong ngn
hng CSDL NCBI, PMC l c s d liu trong NCBI cha cc bi bo cho php ti v
min ph.
7.2.8. Lp rp trnh t
Ngy nay vic xc nh trnh t tr nn n gin v chi ph cho vic xc nh
trnh t ton b genome ngy cng gim. Tuy nhin bi ton kh khn y l vic
lp rp cc trnh t DNA n l to thnh mt genome hon chnh. Nguyn l ca
vic lp rp trnh t rt n gin da vo c s ca s chng lp ln nhau ca cc on
DNA c cc phn trnh t ging nhau. V nguyn tc khi ct genome (cc NST) mt
87
cch ngu nhin s to ra s lng mnh ct nhng v tr ngu nhin. Sau khi xc
nh trnh t cc on ngn, cc on ny s phi c ni li vi nhau bng cch xp
chng ln nhau (overlaping) tm cc vng trnh t ging nhau.
phn tch, so snh trnh t sinh hc v c genome. Nhng nghin cu so snh cc trnh
t DNA m ha ribosome, cytochrome c, gene ty th, gene m ha ribulose-1,5bisphosphate carboxylase oxygenase (RuBisCO) ang c s dng ph bin hin nay
trong nhn din, phn loi sinh vt v sp xp vo cc n v phn loi (taxon).
Khi so snh trnh t hoc cu trc cc i phn t ngi ta thy rng cc phn
t DNA, RNA hoc protein c trnh t ging nhau th cu trc ca chng s tng t
nhau hoc ging nhau v cng thc hin chc nng nh nhau. Trong qu trnh tin ha
nhng bin i trong trnh t sinh hc c th xy ra ngu nhin do chnh bn thn sinh
vt hoc nh hng ca cc yu t gy t bin. S bin i v trnh t din ra ngu
nhin khp genome ca mi c th. Khi c s tc ng ca cc iu kin ngoi cnh,
nhng bin i ny lin quan trc tip n kh nng thch nghi, tn ti ca sinh vt.
Qu trnh ny dn n s thay i tn s allele trong qun th, lm nn tng cho s
hnh thnh loi mi. Mc d s kin phn loi c th xy ra nhng theo quan im ca
tin ha, cc loi mi c pht sinh t cc loi t tin gn gi vi chng nht. Chnh
v vy bng cch so snh trnh t genome hoc mt s gene c th c th h tr cho
vic xc nh mi quan h tin ha cng nh v tr ca sinh vt trong h thng phn
loi.
Cn nhiu trnh t l cng c h tr ch yu nh gi s bin i trong trnh
t DNA, protein. Cc phn mm phn tch tin ha u da trn c s cn trnh t.
Mt s phn mm in hnh c s dng ph bin bao gm Mega5, ClustalX kt hp
vi
cng
c
Treeview,
Phylip
(University
of
Washington
http://evolution.genetics.washington.edu/phylip/software.html). Vic xy dng cy
phn loi c da vo 2 nhm, nhm th nht da vo xc nh khong cch
(distance based methods) v nhm th 2 da vo cc k t ging nhau ca trnh t
(character based methods). i vi nhm th nht cc phng php UPGMA,
Neighbor Joining Method (NJ), Weighted Neighbor-Joining (Weighbor), FitchMargoliash (FM) and Minimum Evolution (ME) Methods. i vi nhm th 2 cc
phng php c s dng gm: Maximum parsimony (MP), Maximum Likelihood
(ML).
89
90
92
Cc nhm cng c cho php nh nghin cu cn cc trnh t (cn cp hoc nhiu trnh
t) DNA, RNA v protein.
93
94
95
96
Tm tt chng 7
1. phn tch cc trnh t sinh hc, cu trc phn t cn phi s dng cc cng c
hoc phn mm h tr. Cc cng c phn tch bao gm: (i) tm kim trnh t ging
nhau, (ii) xc nh cc vng chc nng, vng bo th, (iii) cn trnh t, (iv) xc
nh bn gii hn, (vi) phn tch cc c im vt l, tnh cht ha hc ca
protein, d on cu trc bc 2, bc 3, tng tc protein (vii) phn tch mi quan h
tin ha, (viii) so snh genome, tm kim gene trong genome...
2. Cn trnh t l mt bc trong qu trnh tm kim cc trnh t ging nhau v trnh
t tng ng. Cn trnh t c coi l vn ct li ca tin sinh hc. Da vo kt
qu cn trnh t ngi ta c th xc nh c thng tin ca mt on DNA,
protein, tm c mi quan h gia cc gene, h gene hoc cc vng chc nng.
BLAST l cng c h tr tm kim v phn tch cc trnh t tng ng theo kiu
cc b rt nhanh v hiu qu. C nhiu bin th khc nhau ca Blast cho cc mc
ch khc nhau.
3. CSDL ExPASy v cc cng c phn tch h tr cho nghin cu proteomics bao
gm phn tch cc c im vt l, ha hc, d on cu trc, tng tc protein...
Cc cng c phn tch khc bao gm xc nh khung c m, bn ct gii hn,
phn tch mi quan h tin ha... h tr cho cc nh nghin cu.
97
Cu hi n tp chng 7
1. Cng c phn tch CSDL sinh hc l g? C bao nhiu nhm cng c phn tch?
Cho v d mt vi cng c ca mi nhm v ng dng c th ca chng.
2. tm v copy mt trnh t gene ngi ta s dng ngn hng CSDL v cng c
g? Th no l nh dng FASTA ca trnh t sinh hc?
3. Th no l trnh t ging nhau v trnh t tng ng? Nhm cng c no cho
php tm kim cc trnh t v nu ng dng ca vic tm kim ny?
4. Hy m t nhm cng c BLAST? V ng dng ca BLAST?
5. Nhm BLAST chuyn dng (specialized BLAST) gm nhng cng c g? Cho
v d minh ha.
6. Hy m t cc nhm cng c ca EMBL, trong mi nhm chn mt cng c
in hnh v cho v d minh ha ng dng ca cng c .
7. ExPASy l g? Hy cho bit cc nhm cng c v ng dng ca chng?
8. Th no l cn trnh t, hy cho bit cng c no cho php cn nhiu trnh t v
ng dng ca vic cn nhiu trnh t.
9. Bn ct gii hn ca mt on DNA l g? Cng c g cho php xy dng
bn gii hn? ng dng ca vic xy dng bn gii hn.
10. Ti sao cn phi d on cu trc ca phn t protein? d on cu trc ca
phn t protein ngi ta s dng hng tip cn g? Cho bit cng c no h
tr cho vic d on cu trc bc 2 ca phn t ptoein.
11. Cng c no h tr thit k mi (primer) hoc mu d (probe) trong cc k
thut lai axit nucleic. Hy chn mt cng c v phn tch ng dng ca cng c
trong vic thit k mi v mu d.
12. Khung c m l g? Ti sao cn phi xc nh khung c m? Hy cho bit
cng c no h tr cho vic phn tch ny?
98
CHNG 8
100
105
107
108
109
111
Tm tt chng 8
D liu sinh hc bao gm d liu trnh t, d liu cu trc v cc loi d liu
khc nh bi bo, sch, d liu kiu gene kiu hnh, cc hp cht ha hc, cc con
ng chuyn ha... Cc d liu ny c lu tr trong cc ngn hng gene, NCBI,
PDB, ExPaSy... Phn tch trnh t bao gm cc thao tc so snh thng qua vic cn
trnh t. nh gi mc ging nhau v tng ng gia cc trnh t c ngha quan
trng trong phn tch cu trc, chc nng v quan h tin ha. Vic so snh trnh t
cho php pht hin t bin, SNP, cc vng bo th... Phn tch trnh t protein, d
on cu trc bc 2, bc 3 c h tr bi cc cng c ca ExPaSy, ngn hng CSDL
protein (PDB). So snh cu trc s h tr d on chc nng v tng tc gia cc
phn t. Phn tch khung c m h tr xc nh cc vng m ha, xc nh cc gene.
Tm kim cc vng chc nng ca protein (motif, pattern, domain) c h tr bi rt
nhiu cng c nh Conserved Domain and Protein Classification ca NCBI, Motif
Scan. Xc nh trnh t promoter v cc trnh t lin quan n qu trnh iu ha biu
hin gene l c s xc nh gene trong genome.
Cu hi n tp chng 8
1. Lm th no tm c trnh t nucleotide ca mt gene hoc trnh t amino acid
ca mt phn t protein quan tm trong ngn hng CSDL.
2. S truy cp (accesion number) ca mt gene hay protein l g? Lm th no c s
truy cp ny?
3. Ontology l g? Ti sao ontology li quan trng?
4. Nhng CSDL no cha cc d liu cu trc protein? Cho v d minh ha.
5. So snh trnh t l g? Ti sao li cn phi so snh trnh t
6. Hy nu c s tin ha ca so snh trnh t, cho v d minh ha.
7. Phn bit cc khi nim homology, similarity v identity
8. Khung c m l g? Da vo c s g xc nh khung c m? ngha ca vic
phn tch khung c m.
9. Promoter l g? Lm th no xc nh c trnh t promoter iu khin hot
ng ca mt gene trong c th eukayote.
10. Trnh t m ha (coding sequence/CDS) l g? C mi lin h g gia khung c
m v CDS?
11. Da trn c s g xc nh trnh t iu ha phin m (TF)? ngha ca vic
xc nh TF?
12. Th no l vng chc nng trong phn t protein? C s g cho php xc nh
vng chc nng ca phn t protein.
13. Phn tch ch ra mi quan h cht gia cu trc v chc nng ca protein
14. Tng tc phn t l g? Da trn c s g d on hay m phng tng tc
phn t.
112
CHNG 9
CN TRNH T V NGUYN L CA CN TRNH T
9.1. Gii thiu v cn trnh t
Mc tiu ca cn cp trnh t l tm ra kh nng bt cp ging nhau ln nht
ca cc nucleotide hoc amino acid ca hai trnh t. Gi s c 2 trnh t cn so snh l
GAATTCAG v GGATCGA. S c rt nhiu kh nng xy ra. Trng hp (A) c 5
nucleotide ging ht nhau (match), 1 nucleotide bt nhm (mismatch) v 2 khong
trng (gap). Trng hp (B) cng tng t c 5 nucleotide ging nhau, 1 v tr bt
nhm v 2 khong trng. Trng hp (C) c 5 nucleotide ging nhau, 1 v tr bt cp
nhm v 1 khong trng trn mi trnh t. Trng hp (D) cng tng t, tuy nhin v
tr ca khong trng l khc nhau. Vy trong cc trng hp th trng hp no
phn nh ng kt qu cn trnh t? S khng c cu tr li no l ng hay sai m ch
c th a ra kt qu ph hp nht theo mt tiu ch no .
113
114
116
Gap penalties
Vic ti u ha cn trnh t thng lin quan n vic a cc khong trng
tng ng vi cc dng t bin mt hoc thm. V qu trnh tin ha trong t nhin
(mt, thm) tng i him so vi cc dng t bin thay th (substitution) nn vic
a cc ch trng vo trnh t cn phi cn nhc rt k, xt v mt thut ton vic ny
khng d dng g, bi v n phi phn nh c cc s kin t bin mt v thm cc
nucleotide trong qu trnh tin ha.
Vic gn cc gi tr pht (penalties) c th t nhiu mang tnh ty bi v khng c gi
thuyt tin ha no a ra gi cho mt t bin mt hoc thm. Nu gi tr penalty
qu thp, cc khong trng s c nhiu trong kt qu cn trnh t, v th ngay c cc
trnh t c t mi lin h cng c th a ti kh nng bt cp ging nhau cao v im
s cng c th tng t nhau. Nu gi tr penalty t qu cao, cc ch trng c
a vo s b hn ch v vic cn trnh t li kh c th thc hin.
Nhiu nghin cu mang tnh thc nghim i vi cc protein dng cu (globular
protein) s dng cc gi tr penalties khc nhau th nghim pht trin
phng php ph hp nht cn trnh t. Nhng gi tr penalties thng c t
mc nh hu ht chng trnh cn trnh t. Mt yu t khc cng quan trng l v tr
ca ch trng trong mt dy cc ch trng lin tc trong trng hp t bin mt
on. V tr trng u tin c a vo s gi l opening gap/introducing gap v
cc v tr trng tip theo sau c gi l extending gap, v tr trng cui cng
c gi l closing gap. R rng, a mt v tr trng tip theo sau opening gap s
d dng hn so vi vic ch ra v tr u tin a ch trng vo. Chnh v vy, v tr
trng u tin nn b pht nhiu hn so vi cc v tr trng tip theo. Chng hn, ngi
ta c th dng dng s -12/-1 trong v tr trng u tin s b pht 12 im cn
cc ch trng tip theo b pht 1 im. Tng s im pht (W) s c tnh theo cng
thc: W=+(k1) trong k: chiu di ca gap, : introducing gap v :
extending gap.
Ngoi cch tnh im pht ny ngi ta cn c th p dng mt vi cch khc,
chng hn nh constant gap penalty tc l mi v tr trng u nhn mt gi tr pht
nh nhau bt k n nm u. Ngoi ra, gi tr im pht cng c th thay i ty
thuc vo mc ch ca ngi s dng.
c) Word methods
Cn gi l phng php k-tuple, l phng php s dng kh nng tt nht ch
khng m bo l tm c mt kt qu cn trnh t ti u nhng n c hiu qu cao
hn so vi chng trnh ng. Phng php ny c biu hu ch vi cc trnh t di
hoc s lng trnh t ln. Phng php ny c bit n nhiu nht do ng dng
ca n trong cc cng c tm kim CSDL thuc nhm FASTA v BLAST. Word
methods s dng mt lot cc word cha cc trnh t ngn, khng chng ln nhau
trong trnh t truy vn (query) bt cp vi cc trnh t trong CSDL. Cc v tr tng
i ca word trong 2 trnh t c so snh v loi tr thu c mt kt qu v ch
ra mt vng ca trnh t c cn nu nh cc t khc nhau to ra cng mt kt qu.
Ch nhng vng c xc nh ny s tip tc c thc hin tip vi cc thng s cn
trnh t mc cao hn. Nhng trnh t khc s b loi b ngay. Chnh v th rt
ngn c thi gian phn tch.
Trong phng php FASTA, ngi dng c th t mt gi tr k nh l chiu
di t word length ring, vic iu chnh gi tr k rt c ngha khi tm cc trnh t
117
Tblastn: So snh trnh t protein truy vn vi CSDL DNA theo c 6 khung ca CSDL
ny. Ni cch khc CSDL DNA c dch m thnh trnh t amino acid theo c 6
khung ri so snh vi trnh t protein truy vn.
Tblastx: So snh trnh t protein c m ha bi trnh t DNA truy vn vi cc trnh
t protein c m ha bi CSDL trnh t nucleotide. Theo cch ny, s kh nng s
l rt ln v trnh t truy vn DNA a vo s to ra 6 trnh t protein. CSDL
nucleotide s dch m theo 6 khung c. Vy tng cng s c 36 kh nng, chnh v
vy phng php ny i hi thi gian v ti nguyn my tnh.
Cc bin th ca BLAST
T 20 bp tr ln
CSDL
Nucleotide
28 bp hoc di
hn
cho
megablast
Chng trnh
discontiguous
megablast,
blastn
megablast,
hoc
Peptide
T 7 n 20 bp
Mc ch
Tm cc v tr bm ca mi
Tm cc v tr bt cp ngn v
Nucleotide hoc cc motif ngn lin tc
bt cp gn nh ton b.
Megablast:
121
Cng c cho php xc nh mt trnh t cha bit liu c trong CSDL khng?
Cng vi 3 cng c c tnh nng tng t: MEGABLAST, discontiguous-megablast,
v blastn, MEGABLAST c bit thit k cho cc trnh t di v tm kim cc trnh t
c mc ging rt cao. Ngoi ra, cc thng s b sung bao gm gi tr cut-off
(ngng thng qua gi tr k vng), chng trnh cho php iu chnh phn trm ging
nhau ca cc trnh t tm kim c so vi trnh t truy vn. Ngoi ra Megablast cho
php tm kim s dng nhiu trnh t truy vn cng mt lc.
Discontiguous megablast:
Tm cc trnh t c mc khc nhau bng cc ct trnh t query thnh cc
trnh t ngn gi l t (word). Chng trnh s tm ra cc kt qu bt cp chnh xc vi
cc t (query word) gi l word hit, sau m rng phm vi cc t theo nhiu bc
to ra kt qu cn trnh t cui cng c cha c cc ch trng (gap). Nu nh chiu
di t ban u (initial word size) hay cn gi l word size cng ln th kt qu tm
kim s b thu hp v ch cho ra kt qu l nhng trnh t c mc ging nhau rt
cao v ngc li. Chng trnh Megablast s dng word size l 11, trong khi
blastn thng s dng gi tr thp hn, gi tr thp nht l 7.
Blastn: c ti u tm kim cho tc hn l nhy. Kt qu tm kim s cho ra
cc trnh t c mc ging nhau t cao v gim dn xung thp so vi trnh t truy
vn.
122
CSDL
Mc ch
Chng trnh
(RPS-
BLAST
Tm cc v tr bt
cp ngn gn nh
hon ton
Tm cc motif
Word
size
Blastn chun
11
Tm cho cc trnh t bt cp 7
khng hon ton chnh xc
DUST
Setting
ON
Off
Expect
Value
Score Matrix
On
10
BLOSUM62
Off
20000
PAM30
Chng trnh
Word
Size
124
A
0
0.1
0.3
0
0.5
129
D 0.6
0.6
0.66
Cc cp A:A, B:B, C:C v D:D s cho khong cch bng 0. V vy bng s c vit
li. Trong bng ny, khong cch gia cp trnh t A:B l nh nht (0.1) v vy A v B
s c ghp thnh 1 cp.
Gi tr d(AB)/C = 0.4/2 = 0.2 biu din khong cch tin ha t mt t tin chung n
AB v C.
Theo khong cch tnh c, d(AB)C < d(AB)D v vy trnh t C s c gp nhm
vi trnh t AB to thnh nhm ABC.
130
Do vic tnh ton khong cch ph thuc vo phng php s dng. Nn trong
cng mt d liu ty thuc vo thut ton m kt qu c th khc nhau. Trn thc t
ngi ta cn phi hiu chnh gi tr khong cch theo cc m hnh nh: Jukes-Cantor
model, Kimura two-parameter model hoc Kimura three-parameter model.
NJ (Neighbour Joining)
V c bn, phng php NJ cng tng t nh phng php UGPMA. NJ ng
dng trong nghin cu tin ha nh (minimum evolution). Tn gi NJ lin quan n
cch gp nhm k cn (neighbor). Phng php ny c u im l khng p dng gi
thuyt tc tin ha ging nhau tt c cc nhnh ca cy.
8
7
2
3
4
B
8
7
1
Y
2
3
133
134
Tm tt chng 10
Cy tin ha l dng s m t mi quan h tin ha gia cc sinh vt. Ty
thuc vo dng d liu v phng php phn tch, cy tin ha c th c xy dng
t cc d liu hnh thi, c im phn b, sinh l ha sinh hay trnh t nucleotide,
amino acid... Cy tin ha phi th hin c mi quan h tin ha, trt t tin ha ca
cc n v taxon (OTU) v mc bin i hoc thi gian tin ha. Cc d liu v
hnh thi, c im phn b hoc sinh l ha sinh thng khng phn nh c y
mc tin ha. Hin nay cc d liu v trnh t sinh hc c s dng ph bin
nh gi v xy dng cy tin ha.
Khi xy dng cy tin ha da trn d liu trnh t sinh hc cn ch n vic
la chn trnh t, m hnh tin ha, ng h phn t (molecular clock) v phng
php kim nh chnh xc ca cy. Ty thuc vo mc ging nhau hay sai khc
ca cc trnh t phn tch c th la chn phng php xy dng cy da vo khong
cch hoc phn tch k t. i vi nhm th nht ngi ta thng dng phng php
UPGMA, NJ. Nhm th hai bao gm ML v MP. to gc cho cy tin ha c th
dng phng php t im gia hoc s dng nhm ngoi. S dng nhm ngoi
thng hiu qu hn tuy nhin nhm ngoi phi c mi quan h tin ha gn vi tt
cc n v taxon phn tch nhng phi tch ring khi cc n v ny to thnh mt
nhm ring bit. Trn c s c th t cc n v taxon phn tch vo ng v tr
tin ha ca chng.
V mt l thuyt, trnh t sinh hc no cng c th s dng c trong phn
tch tin ha. Tuy nhin trong phn ln cc trng hp, vic phn tch mt s trnh t
136
137