You are on page 1of 137

TRNG I HC NNG NGHIP H NI

KHOA CNG NGH SINH HC


....................................

Bi ging
TIN SINH HC NG DNG
(Applied bioinformatics)

NGUYN C BCH

H NI, 8/2013

PHN 1. GII THIU CHUNG

CHNG 1. GII THIU V BIOINFORMATICS


1.1.
Khi nim
1.2.
Nn tng sinh hc v s pht trin ca bioinformatics
1.3.
Vai tr ca bioinformatics trong nghin cu sinh hc
1.4.
Nhim v v cc hng nghin cu ca Bioinformatic
1.5.
Xu hng pht trin ca bioinformatics
Tm tt chng 1
Cu hi n tp chng 1

5
5
5
7
12
16
18
18

CHNG 2
NN TNG SINH HC CA TIN SINH HC
2.1. Axit nucleic v protein
2.2. Cu trc ca axit nucleic
2.3. Genome v nghin cu genome
2.4. Pht hin gene v xc nh chc nng gene trong genome
2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene
2.6. Proteome v lnh vc nghin cu protein (proteomics)
2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt
2.8. Phn tch mi quan h tin ha ca cc sinh vt
Tm tt chng 2
Cu hi n tp chng 2

19
19
19
19
24
26
29
29
30
31
33
33

CHNG 3
TM KIM V QUN L TI LIU NGHIN CU
3.1. Phng php tm kim thng tin
3.2. Cch tm ti liu phc v nghin cu
3.3. Lm quen vi Pubmed
3.4. Cch qun l ti liu nghin cu
Tm tt chng 3
Cu hi n tp chng 3

35
35
35
35
36
37
38
38

PHN 2
C S D LIU SINH HC
NG K TRNH T VO C S D LIU

40
40
40

CHNG 4. C S D LIU SINH HC


4.1. C s d liu s cp
4.1.1. CSDL trnh t nucleotide
4.1.2. CSDL trnh t protein
4.1.3. C s d liu cu trc cc phn t
4.2. C s d liu th cp
4.3. Cc c s d liu khc
4.3.1. C s d liu kiu gene v kiu hnh
4.3.2. CSDL kiu gene (PhenomicDB)
4.3.3. PubChem
4.4. Ngn hng gene
Tm tt chng 4
Cu hi n tp chng 4
CHNG 5
XC NH TRNH T V NG K TRNH T VO NGN HNG GENE
5.1. Xc nh trnh t nucleotide
5.2. Xc nh trnh t genome
5.3. Lp rp trnh t
5.4. ng k trnh t
5.5. Cc cng c ng k trnh t
5.5.1. Cc thng tin cn thit phi chun b trc khi ng k trnh t
5.5.2. V d ng k trnh t bng WebIn
5.5.3. V d ng k trnh t bng Sequin

40
41
41
41
43
45
46
46
46
46
47
50
50
52
52
52
52
53
55
58
61
62
62

Tm tt chng 5
Cu hi n tp chng 5

65
65

PHN 3
CC CNG C PHN TCH
KHAI THC V X L D LIU TRNH T SINH HC

66
66
66

CHNG 6. GENOME BROWSER


6.1. Khi nim genome browser
6.2. Gii thiu mt s genome browser quan trng
6.2.1. Ensembl
6.2.2. UCSC
6.2.3. NCBI Genomes and MapViewer
6.3. c im v ng dng ca cc genome browser
Tm tt chng 6
Cu hi n tp chng 6

66
66
66
66
68
70
71
72
72

CHNG 7
LM QUEN VI CC CNG C PHN TCH CSDL SINH HC
7.1. Lm quen vi cc cng c phn tch c bn
7.1.1. Tm v copy trnh t
7.1.2. Nhm cng c tm kim trnh t ging nhau
7.2. Tm cc vng chc nng, vng bo th
7.2.1. Cn nhiu trnh t (multi sequence alignment)
7.2.2. Xy dng bn gii hn (restriction map contruction)
7.2.3. D on cu trc bc 2 v bc 3 ca phn t protein
7.2.4. Phn tch trnh t axit nucleic
7.2.5. Thit k mi cho PCR v mu d lai axit nucleic
7.2.6. Xc nh khung c m
7.2.7. Tm cc bi bo khoa hc
7.2.8. Lp rp trnh t
7.2.9. Phn tch quan h tin ha
7.2.10. Phn tch protein
7.2.11. Nghin cu biu hin gene
7.3. Cc nhm cng c phn tch
7.3.1. Cng c phn tch ca NCBI
7.3.2. Nhm cng c ca EMBL
7.3.3. Nhm cng c ca ExPASy
7.3.4. Cc nhm cng c khc
Tm tt chng 7
Cu hi n tp chng 7

74
74
74
74
75
79
79
81
83
84
85
86
87
87
88
90
90
91
91
92
95
97
97
98

CHNG 8
LM QUEN VI PHN TCH D LIU SINH HC
8.1. Tm d liu trong cc ngn hng CSDL
8.1.1. D liu trnh t
8.1.2. D liu cu trc
8.1.3. Cc d liu khc
8.2. Phn tch trnh t
8.2.1. So snh trnh t
8.2.2. Phn tch khung c m v vng trnh t m ha
8.2.3. Tm kim Promoter v cc vng iu ha hot ng gene
8.2.4. Tm kim vng chc nng ca protein (functional motif searching)
8.2.5. D on v m phng tng tc protein

99
99
99
99
99
102
102
102
106
106
109
110

CHNG 9
CN TRNH T V NGUYN L CA CN TRNH T
9.1. Gii thiu v cn trnh t
9.2. Nguyn l ca cn trnh t
9.3. Cn nhiu trnh t v nguyn l cn nhiu trnh t

113
113
113
114
118

9.4. Cc cng c tm kim trnh t tng ng

119

CHNG 10. PHN TCH MI QUAN H TIN HA


10.1. Khi nim
10.2. D liu dng xy dng cy tin ha
10.2.1. Phng php da vo khong cch
10.2.2. Phng php phn tch k t
10.3. La chn m hnh tin ha
10.4. nh gi cy phn tin ha

125
125
127
129
131
133
133

PHN 1. GII THIU CHUNG


CHNG 1. GII THIU V BIOINFORMATICS
1.1.

Khi nim

Tin sinh hc l ngnh khoa hc ng dng ton hc v khoa hc my tnh vo


lnh vc sinh hc c bit l sinh hc phn t v y hc. Thut ng tin sinh hc ln u
tin c Paulien Hogeweg gii thiu nm 1979 dng m t nghin cu v cc qu
trnh trong h thng sinh hc. Vo cui nhng nm 1980, thut ng ny c a vo
lnh vc di truyn hc v nghin cu genome. Tin sinh hc lin quan n vic xc nh
trnh t, qun l, phn tch v khai thc cc CSDL sinh hc. Tin sinh hc hin lin
quan n xy dng v pht trin cc c s d liu, cc thut ton, thng k v cc k
thut my tnh gii quyt cc vn lin quan n l thuyt v thc nghim trong
vic qun l v phn tch cc d liu sinh hc. Tin sinh hc cng bao gm m phng
v d on tng tc gia cc phn t v cc qu trnh sinh hc.

Hnh 1: Tin sinh hc v mi lin h gia cc lnh vc


1.2.

Nn tng sinh hc v s pht trin ca bioinformatics

Vic pht hin DNA l vt cht mang thng tin di truyn v xc nh m hnh
cu trc ca DNA m ra thi k pht trin ca sinh hc phn t. DNA m ha cho
mRNA v cc loi RNA khc. Protein c dch m t phn t mRNA s thc hin
nhiu chc nng sinh hc trong t bo k c iu ha hot ng ca gene cng nh cc
qu trnh sinh hc. Mc d vic xc nh trnh t genome ca cc sinh vt hin nay
tr nn n gin nhng lm sng t thng tin di truyn cha trong genome v s
hot ng chc nng cng nh mi tng tc gia cc gene vn cn l mt thch thc
ln. Chng hn ngi, mi t bo cha 23 cp NST v kch thc genome khong
3,2.109 cp nucleotide trong cha khong 23.000 gene (1). n nay v c bn cc
qu trnh phin m v dch m c bit nhng xc nh c chnh xc s
lng gene, v tr v s tng tc ca cc gene ny vn cn l cu hi kh.

International Human Genome Sequencing Consortium (2004). "Finishing the euchromatic sequence of the
human genome.". Nature 431 (7011): 93145. Bibcode

Vi s pht trin nhanh chng ca cc k thut v cng ngh mi, d liu sinh
hc m ch yu l trnh t nucleotide, amino acid, c to ra hng ngy cng nhiu.
Vic thu thp, lu tr, cho php truy cp, tm kim, phn tch v so snh mi lin quan
gia cc d liu trong cc c s d liu khng l l nhim v ca tin sinh hc. Thc t
i hi cc nh tin sinh hc, khoa hc my tnh cn phi pht trin cc thut ton mi
nng cao chnh xc v gim thi gian cho cc nh nghin cu sinh hc.
Tin sinh hc l mt lnh vc nghin cu a ngnh, mc nht nh, n c
t trn nn tng ca sinh hc phn t (ngun cung cp CSDL cn phn tch), khoa
hc my tnh (cung cp cc phn cng cho vic phn tch v mng li my tnh so
snh, i chiu cc kt qu phn tch), cc thut ton phn tch d liu. Ba yu t
ny c vai tr sng cn i vi tin sinh hc. Bn thn sinh hc phn t cng l mt
lnh vc tng i mi c da trn nn tng ca nhiu mn khoa hc c bn m
quan trng nht l di truyn hc, ha sinh hc, t bo hc Chnh v vy vic ra i,
nghin cu tin sinh hc cng nh ng dng tin sinh hc cng i hi kin thc c bn
lin ngnh v hiu bit v khoa hc my tnh. Di y l mt vi im mc lch s
quan trng cho s pht trin ca sinh hc phn t v tin sinh hc.
Nm
Pht minh
1930 Tiselius a ra k thut in di phn tch protein trong dung dch
1951 Pauling v Corey xut cu trc xon alpha v phin gp np beta
1953 Watson v Crick xut m hnh chui xon kp DNA da trn d liu thu c t kt
qu phn tch nhiu x tia X ca Franklin and Wilkins
1954 Nhm nghin cu ca Perutz pht trin phng php dng nguyn t nng (heavy
atom) gii quyt kh khn trong vic kt tinh protein.
1955 Trnh t ca protein u tin c phn tch l insulin b bi F. Sanger.
1970 Thut ton ca Needleman-Wunsch cho vic cn trnh t (alignment) c cng b.
1972 Phn t DNA ti t hp c to ra bi Paul Berg v nhm nghin cu ca mnh.
1973 C s d liu Protein c cng b bi Brookhaven
1974 Vint Cerf v Robert Kahn pht trin phng thc giao tip my tnh TCP lm nn tng
cho internet.
1975 in di 2 chiu c pht trin bi P. H. O'Farrell
Phng php Southern blot c m t v cng b bi E. M. Southern
1977 C d liu protein, PDB, chnh thc ra i
Maxam v Walter Gilbert (Harvard) v Frederick Sanger (U.K. Medical Research
Council) cng b phng php xc nh trnh t DNA.
1980 Trnh t genome hon chnh ca mt sinh vt (FX174) c cng b. Genome cha 5,386
cp base m ha cho 9 protein.
Phng php NMR a chiu (multi-dimensional NMR) c s dng xc nh cu
trc protein
1981 Thut ton Smith-Waterman cn trnh t c cng b
1982 Genetics Computer Group (GCG) to ra nhiu cng c phn tch trong sinh hc phn
t ti trung tm Cng ngh sinh hc Wisconsin thuc trng i hc Wisconsin.
1985 Thut ton FASTP c cng b
Phn ng PCR c m t bi Kary Mullis v cng s
1986 Thut ng Genomics" xut hin ln u tin m t lnh vc khoa hc lin quan n
vic lp bn , xc nh trnh t v phn tch cc gene. Thut ng c a ra bi
Thomas Roderick, sau ny l tn ca mt tp ch ni ting: Genomes.
CSDL SWISS-PROT c to ra bi phng sinh ha y hc (Department of Medical
Biochemistry) ca trng i hc Geneva v ngn hng CSDL chu u EMBL ra i
6

1987

1988

1990

1991
1997
1998

2000

2001
2004
2004
2008

(European Molecular Biology Laboratory).


NST nhn to ca nm men (YAC) c gii thiu
Bn vt l ca E.coli c cng b
Ngn ng lp trnh Perl c pht trin bi Larry Wall.
NCBI (National Center for Biotechnology Information) c thnh lp vin nghin cu
ung th quc gia (National Cancer Institute).
D n xc nh genome ngi c khi ng (Commission on Life Sciences, National
Research Council. Mapping and Sequencing the Human Genome, National Academy
Press: Washington, D.C.), 1988.
Thut ton FASTA dng so snh trnh t c cng b bi Pearson v Lupman.
Des Higgins v Paul Sharpe cng b pht trin chng trnh CLUSTAL
Chng trnh BLAST ra i (Altschul, et. al.)
Molecular Applications Group c thnh lp California bi Michael Levitt v Chris
Lee. Sn phm ca cng ty l Look and SegMod c dng thit k cc m hnh phn
t v protein.
InforMax c thnh lp Bethesda, MD. Sn phm ca cng ty hng ti l cc phn
mm, chng trnh phn tch trnh t, qun l v phn tch CSDL, tm kim, hin th d
liu bn ha, thit k dng (clone construction), mapping v thit k mi.
Vin nghin cu Geneva (Research institute in Geneva/ CERN) cng b to ra phng
thc make-up cho World Wide Web.
Genome ca E.coli (4.7 Mbp) c cng b
Genom ca Caenorhabditis elegans v nm men bnh m c cng b.
Swiss Institute of Bioinformatics c thnh lp di dng hip hi nghin cu phi li
nhn
Genome ca Pseudomonas aeruginosa (6.3 Mbp) c cng b
Genome ca Arabidopsis thaliana (100 Mb) c xc nh trnh t
Genome Drosophila melanogaster (180Mb) c xc nh trnh t
Genome ngi c kch thc 3,000 Mbp c cng b
Bn nhp genome ca chut, Rattus norvegicus, c cng b
Th h xc nh trnh t mi chnh thc ra i khi u vi k thut 454 sequencing
Cc d n xc nh trnh t genome 1000 loi http://www.1000genomes.org/

Vai tr ca bioinformatics trong nghin cu sinh hc


Trong mt vi thp k gn y, lnh vc genomic v cng ngh sinh hc phn t
pht trin nhanh chng to ra mt khi lng thng tin rt ln lm c s cho cc
phn tch so snh v i chiu. phn tch c s d liu (CSDL) cn phi c thut
ton kt hp vi khoa hc my tnh. Tin sinh hc vi s kt hp cht ch ca CSDL,
thut ton v khoa hc my tnh s lm sng t bn cht ca cc qu trnh sinh hc. C
th tm tt vai tr ca tin sinh hc nh sau:
- Thu thp, t chc v qun l cc d liu sinh hc (database);
- Pht trin cc cng c tm kim d liu (search tools, data mining)
- Phn tch trnh t (sequence analysis), m t genome (genome annotation), so
snh genome (genomic comparison);
- M phng cu trc, m phng tng tc phn t (molecular interaction
modelling), d on cu trc protein (prediction of protein structure);
- Phn tch chc nng protein (protein function analysis), tng tc protein v
cc con ng chuyn ha (protein interactions and metabolism pathways), m
hnh ha cc h thng sinh hc (modeling biological systems), phn tch m
hnh biu hin gene (analysis of gene expression profile),
1.3.

Phn tch trnh t genome pht hin gene, cc gene t bin, ung th, xc
nh c vai tr ca cc gene v hng ti cc liu php iu tr (genome
analysis and treatment);
Phn tnh mi quan hin tin ha, di truyn qun th da trn cc phn mm v
cng c my tnh;
Phn tch hnh nh quy m ln (high-throughput image analysis),
Pht trin cc thut ton, phn mm gii quyt nhu cu ca cc nh khoa hc
trong lnh vc sinh hc.

Phn tch trnh t (sequence analysis)


Phn tch trnh t l qu trnh gm nhiu thao tc lin quan n tm kim cc d
liu trnh t, so snh cc trnh t vi nhau v kt hp vi cc cng c khc tm ra
nhng thng tin cn thit nm trong chui trnh t cn phn tch. Nhng thng tin thu
c bao gm s tng ng, cc vng hot ng chc nng (domain), cc vng c
trng (motif), v tr ca cc gene trong genome (gene finding), cc yu t iu ha
hot ng gene (promoter, intron, exon, vng cu trc iu ha phin m).
Nm 1977, genome u tin c xc nh trnh t l ca phage -X174. n
nay genome ca hng nghn sinh vt c xc nh trnh t v lu gi trong cc
ngn hng gene. Nhiu cng c tin sinh hc quan trng v cc chng trnh h tr
phn tch, so snh trnh t sinh hc c pht trin v ng dng ph bin.
M t genome (genome annotation)
Trong nghin cu genome, qu trnh nh du cc trnh t DNA v gn cc
thng tin sinh hc vo nhng trnh t DNA c gi l m t (annotation). H thng
phn mm cho php m t genome u tin c Dr. Owen White xy dng vo nm
1995. i tng u tin l vi khun Haemophilus influenzae. ng xy dng h
thng ny vi mc tiu ban u l tm ra cc gene, cc tRNA trong genome... sau
gn nhng chc nng sinh hc bit vo cc yu t ny. n nay c nhiu h
thng m t genome c pht trin. V cn bn cc h thng m t ny ging
nhau nhng c s khc nhau v thut ton v chng trnh my tnh.
So snh genome
Trng tm ca so snh genome l xc nh s ging nhau hoc mi lin h gia
cc gene (orthology analysis) hoc cc c im chung trong genome ca cc sinh vt.
So snh genome c hin th di dng bn tng tc gia cc genome cho php
pht hin c cc s kin hoc mc bin i genome trong qu trnh tin ha dn
n s khc nhau hoc bin i gia cc genome, gia cc vng gene hoc gia cc
gene.
Cc s kin tin ha phc tp xy ra nhiu mc khc nhau dn n tin
ha genome. mc thp nht (mc phn t), cc t bin im lm thay i
genome nhng nucleotide n l. S bin i ny c th gy ra hu qu nghim
trng, trung tnh hoc khng nh hng g. mc cao hn, cc t bin lp on,
o on, mt on v thay i v tr cc trnh t DNA trong NST (gene nhy,
transposable elements) lm thay i t chc vt l ca genome. Theo thi gian, cui
cng ton b genome tham gia vo qu trnh lai, lng bi ha v tng tc cng sinh
ni bo dn n s phn loi. Tnh phc tp ca tin ha genome dn n nhng s
kh khn trong vic pht trin thut ton cng nhng m hnh ton hc m phng
8

chnh xc. Chnh v vy cc thut ton trong tin sinh hc ch mang tnh hp l nht
(heuristic) ch khng phi l chnh xc (precise). Cc thut ton v m hnh ang
dng ph bin hin nay bao gm: heuristics, approximation algorithms, parsimony
models, Markov Chain Monte Carloalgorithms, Bayesian analysis, probabilistic
models.
Xy dng v m phng cu trc
D on cu trc phn t protein l mt trong nhng ng dng quan trng ca
tin sinh hc. Trnh t amino acid ca mt phn t protein c th c xc nh trc
tip hoc suy din t trnh t nucleotide ca gene m ha tng ng. m phng cu
trc ngi ta cn nhng thng tin c th v protein, tt nht l cu trc kt tinh ca
phn t protein. Trong nhng trng hp kh kt tinh hoc ch c trnh t amino acid
ngi ta c th so snh trnh t amino acid ca mt protein hoc polypeptide vi
nhng protein khc bit trong CSDL s dng cc thut ton tm ra s tng
ng, t a ra cu trc m phng tng i ca cc protein cha bit. Thng
thng cc trnh t c mc ging nhau >40% c th p dng d on cu trc.
Mc d c s tng quan cht ch gia mc ging nhau v trnh t v cu trc
nhng trong nhiu trng hp mc d cu trc ging nhau nhng trnh t amino acid
c th li khc nhau. V th vic xc nh hoc m phng cu trc cng khng th da
n thun vo thut ton hay chng trnh my tnh. Trong nhiu trng hp, vic m
phng ch s dng sng lc v tham kho.
S tng ng gia haemoglobin ca ngi v ca cc cy h u
(leghemoglobin) cng l mt trong nhng v d v mi tng quan gia trnh t v
cu trc. C hai protein u c dng vn chuyn oxy. Mc d chng c trnh t
amino acid rt khc nhau nhng cu trc ca chng li ging nhau mt cch c bit.
iu ny cng phn nh mi quan h gia cu trc v hot ng chc nng.
M phng tng tc phn t
M phng tng tc phn t l xy dng cc m hnh m t s tng tc khi
hai hay nhiu phn t tip xc vi nhau. Thng tin v s tng tc bao gm v tr,
nhm tng tc v c ch hnh thnh nhng tng tc. Tng tc phn t lin quan
n nhng thay i v nhit ng hc, thay i trng thi phn t (thay i in tch,
chuyn dch cc nhm lin kt, thay i cu hnh v trng thi hnh hc khng gian).
Cc tng tc phn t in hnh nh tng tc protein-protein/peptide, enzyme-c
cht, ligand-cht tng tc. Thut ng thng s dng hin nay l docking v thut
ton tng ng ca n l docking algorithms.
Cc k thut c dng h tr bao gm: CD (circular dichroism), phn tch
nhiu x tia X (X-ray crystallography), phn tch cng hng t ht nhn protein
(protein nuclear magnetic resonance spectroscopy protein NMR). Mt trong nhng
cu hi quan trng l liu ch cn phn tch cu trc phn t (3D) d on s tng
tc phn t hay cn phi lm thc nghim c th cho tng protein-protein (protein
protein interaction experiments) hoc proteinprotein docking.
D on cu trc protein (prediction of protein structure)
D on cu trc protein da vo nhng thng tin nh trnh t amino acid, kt
qu khi ph (MS), kt tinh v phn tch nhiu x tia X, cc c im sinh hc tng
9

ng (s ging nhau trn c s cng thc hin chc nng sinh hc hoc cc enzyme
xc tc mt kiu phn ng hoc nhm c cht).
Cc thut ton u da trn c s tnh ton cc lin kt ha hc, kh nng hnh
thnh cc lin kt, tng tc gia cc phn t, phn tch nhit ng hc, nng lng t
do, nng lng lin kt xy dng ln cc m hnh cu trc khng gian. Tuy nhin,
hin nay vic phn tch mi lin h v so snh gia cc cu trc v chc nng bit
vn c coi l nn tng d on cu trc cc protein. Chnh v vy, nhng protein
mi vi cu trc cha c xc nh thng c d on da vo vic so snh trnh
t kt hp vi cc c im vt l v ha hc.
Phn tch biu hin gene (analysis of gene expression)
Cc CSDL v mRNA, cDNA, EST h tr pht hin s biu hin hoc mc
biu hin ca cc gene. Cc CSDL v protein microarray v khi ph (MS) c vai tr
rt quan trng trong vic phn tch hoc pht hin s c mt ca mt protein no
mt mu sinh hc. Bng cch so snh v i chiu cc CSDL ny cho php rt ngn
thi gian nghin cu. Tuy nhin, qu trnh ny i thng tr ln phc tp khi x l
khi lng mu ln (high through put analysis) v s liu nhiu do cc sai s gp phi
trong thc nghim.
T phn tch trnh t genome n vic iu tr (from genome to therapy)
Mt trong nhng nguyn nhn chnh dn n ung th l s tch ly cc t bin.
Phn tch nhiu trnh t c th xc nh c cc t bin tim n trong cc gene c
lin quan n ung th. Tin sinh hc xy dng cc h thng phn tch t ng qun
l, lu gi cc thng tin t h tr cc thao tc tm kim, so snh v i chiu gia
cc gene, genome pht hin s a hnh (chng hn cc c s d liu dbVar, dbSNP,
CancerChromosome). Kt qu nhng phn tch h tr cho vic iu tr v chn on
bnh d dng hn. Mt v d in hnh l s pht trin cc loi thuc khc nhau p
ng vi mi c th.
Cc k thut mi ang c p dng nh so snh trnh t cc nucleotide
pht hin s khc bit mc nucleotide n tm ra cc t bin im (singlenucleotide polymorphism arrays) nhiu v tr, vng trnh t khc nhau trong genome.
Thut ton ang dng hin nay l Hidden Markov model, change-point analysis
methods.
Nghin cu tin ha (Computational evolutionary biology)
Nghin cu tin ha bao gm xc nh ngun gc tin ha ca cc loi cng
nh s bin i v pht sinh loi mi theo thi gian. Cng ngh thng tin v tin sinh
hc h tr cc nh nghin cu sinh hc nhiu kha cnh, bao gm:
- Pht hin c s tin ha da vo so snh, pht hin s thay i trnh t DNA
ch khng da nhiu vo s bin i hnh thi.
- So snh ton b genome cho php nghin cu cc s kin phc tp xy ra trong
qu trnh tin ha chng hn nh lp on, trao i vt cht di truyn hoc ly
mt phn vt cht di truyn ca mt loi (chng hn nh chuyn gene ngang,
bao gm bin np, chuyn np, ti np, cng sinh, ti t hp genome, chuyn
gene)
- Xy dng cc m hnh my tnh d on din tin v h qu ca cc qun
th theo thi gian.
10

Theo di v chia s thng tin ca mt s lng ln cc loi v c th.


Xy dng bc tranh tng th v cy pht sinh chng loi.

Phn tch hnh nh


Cng ngh my tnh hin nay cng vi cc th nghim phn tch t ng quy
m ln to ra mt s lng hnh nh vi dung lng rt ln. Thm vo , nhng loi
hnh nh cha ng nhiu thng tin nh: nh phn tch cc mu, m bnh, nh chp
trong y hc, lm sng cn phi c phn tch cn thn nhiu mc . Vic lu tr
cc hnh nh ny c ngha khi cn i chiu v so snh cht lc thng tin phc v
cho chn on v iu tr. Di y l mt s v d v nhng ng dng tin sinh hc
trong x l v phn tch hnh nh:
- Phn tch nh lng cc c im bn trong hnh nh nh bo quan, kch
thc, hnh dng, v tr phn b ca cc phn t hoc kt qu chp ct lp ca
cc m, c quan.
- Xc nh cc m hnh, hnh mu real-time ca dng kh vn chuyn trong phi
ng vt, s vn chuyn ca cc cht qua mng t bo, m (drug delivery).
- D on kch thc ca cc ht, vn cc xy ra trong qu trnh phu thut (realtime imaginery) v qu trnh hi phc sau b thng cc ng mch.
- Phn tch cc hnh nh hng ngoi xc nh hot ng trao i cht
- Phn tch cc hnh nh hunh quang chng hn vi cc k thut xc nh trnh
t th h mi, cc k thut nh du hunh quang v phn tch real-time.
Phn tch chc nng protein
Cc CSDL MS, trnh t, cu trc, tng tc protein-protein, protein docking l
nn tng phn tch chc nng protein. Vic so snh trnh t, cn trnh t h tr rt
c lc pht hin cc motif, domain, (m hnh) pattern pht hin v phn tch
chc nng cc protein. Cc h protein hoc cc protein cng thc hin chc nng cng
c pht hin da trn nhng c s so snh ny.
Tng tc protein v cc con ng chuyn ha
Nghin cu tng tc gia cc protein, enzyme trong cc qu trnh sinh hc c
ngha ng dng rt ln. Chng hn tm c cht cho enzyme, xc nh protein khng
nguyn, khng th... Nghin cu xy dng m hnh tng tc gia cc protein gip
xc nh vai tr ca cc yu t tham gia cng nh c ch iu ha s biu hin ca cc
gene tham gia trong cc mng li. S ri lon hoc thay i cc mi quan h tng
tc s dn n nhng bnh tt. Vic iu tr cc bnh da trn c s hiu bit mi lin
h nhiu yu t s c hiu qu rt ln. y cng l hng c cc nh sinh hc, tin
sinh hc ang tp trung nghin cu hin nay.
M hnh ha cc h thng sinh hc (Modeling biological systems)
Thc cht l s m phng bng my tnh cc qu trnh sinh hc din ra trong h
thng sng (t bo, m hoc ton b c th). thc hin c iu ny cn kt hp
gia sinh hc h thng (system biology) v ton sinh hc (mathematical biology). V
d nh cc h thng t bo, cc bo quan, cc cht trao i v cc enzymes tham gia
hnh thnh cc con ng trao i cht, cc con ng dn truyn tn hiu, iu ha
hot ng gene. Tt c nhng qu trnh ny cn c phn tch v hin th trong phc
hp ca cc thnh phn bn trong t bo hoc cc bo quan trong t bo. Ngoi ra vi
11

tin sinh hc v sinh hc my tnh c th m phng s sng nhn to lin quan n qu


trnh tin ha ca sinh vt.

Pht trin cc phn mm v cng c phn tch (Software and tools)


Thut ton v cc thch thc trong khoa hc my tnh
Cc phn mm hoc chng trnh my tnh c pht trin da vo nhiu thut
ton. Mc chnh xc v tc x l ph thuc vo thut ton v phn cng my
tnh. Pht trin thut ton mi s ti u ha, rt ngn thi gian phn tch, gim thiu s
dng ti nguyn my tnh v nng cao tin cy ca cc phn tch, m phng.
Cc cng c tm kim trnh t ging v tng ng:
Trnh t tng ng (homology): gia cc trnh t DNA hoc cc tnh trng
phn tch c cng ngun gc, quan h tin ha t mt t tin chung. Mc ging
nhau (similarity) gia hai (cc) trnh t c th c xc nh liu s tng ng l
thc s hay l ngu nhin.
Cc cng c thuc nhm ny nhm xc nh s ging nhau gia mt trnh t
mi a vo (novel query sequence) vi cu trc v chc nng cha bit vi ton b
CSDL c bit.Nhm ny bao gm cc cng c chnh: FASTA, BLAST v cc
bin th ca chng (xem cc chng sau).
Phn tch chc nng protein:
Phn tch chc nng: Xc nh chc nng v lp bn ca cc thnh phn
chc nng bao gm phn m ha v khng m ha ca gene trong genome.
thc hin cn s h tr ca cc chng trnh v cng c my tnh trong vic so
snh trnh t protein truy vn vi cc CSDL protein th cp cha thng tin v
cc motif, domain. Kt qu tm kim s cho ra danh sch cc protein ging
nhau t php d on chc nng ca protein cha bit.
- Phn tch cu trc
Cho php so snh cc cu trc cha bit vi cc CSDL cu trc bit. Chc
nng ca mt protein c th xc nh chnh xc hn khi so snh cu trc ca n
hn l ch trnh t amino acid. V cu trc tng t nhau thng gn lin vi s
tng ng v chc nng hot ng. Vic xc nh cu trc protein dng 2D/3D
c ngha v cng quan trng nghin cu chc nng ca n. Cng vic ny
i km vi vic tinh sch, kt tinh protein v kt hp vi cc phng php phn
tch tinh th.
- Phn tch trnh t
Cc cng c thuc nhm ny cho php thc hin cc phn tch su hn v trnh
t cha bit bao gm: phn tch tin ha, xc nh t bin, cc vng a nc,
CpG islands v xu hng s dng cc thnh phn base trong cc m di truyn
(compositional biases). Nhng kt qu phn tch ny s h tr cho cc nghin
cu lm sng t chc nng ca trnh t cha bit.

1.4.

Nhim v v cc hng nghin cu ca Bioinformatic

Vo giai on u ca cuc cch mng genomics, tin sinh hc tp trung vo


vic tp hp v lu gi cc thng tin, c s d liu sinh hc hnh thnh cc ngn
hng c s d liu (ch yu l trnh t amino acid, nucleotide). Qu trnh ny lin quan
12

n vic thit k mng li CSDL lin kt v pht trin cc giao din web nh cc
nh nghin cu va c th truy cp vo cc c s d liu va c th ng k thm cc
trnh t, d liu mi hoc cc d liu c chnh sa, b sung. Xut pht t nhu
cu ca cc nh khoa hc v vic tm kim v phn tch d liu (data mining) dn
n vic pht trin cc cng c tm kim kt hp vi vic so snh cc d liu. Vic s
dng cc chng trnh FASTA, BLAST, cn trnh t (sequence alignment); lp rp cc
trnh t (genome assembly);tm kim gene trong genome (gene finding), phn tch cc
domain trong phn t protein v xc nh cu trc ca chng tr thnh nhng thao
tc thng thng hng ngy ca cc nh nghin cu. Nhng ng dng mc cao hn
v phc tp hn nh xc nh c v tr v vai tr ca gene trn cc nhim sc th
(position cloning); so snh cu trc ba chiu ca cc protein,d on cu trc protein
v cc tng tc protein-protein; nhn dng m hnh (pattern recognition); d on
m hnh biu hin gene (gene expression profile prediction)ang tr nn ph bin
nhng phng nghin cu mnh.
T kt qu ca cc nghin cu v xc nh vai tr cc gene v tng tc gene,
nh khoa hc c th so snh cc hot ng ca nhng t bo bnh thng v nhng t
bo b bnh. lm c iu nycn thit phi c s kt hp v i chiu gia cc
CSDL sinh hc to thnh mt bc tranh tng th v din t c cc mi lin h
ca cc hot ng qua s nghin cu c cc con ng chuyn ha
(metabolomics). y cng l mt trong nhng thch thc rt ln ca cc nh tin sinh
hc.

Hnh 2. Mi lin h gia transcriptomics, proteomics v cc con ng chuyn


ha (metabolomics) (Goodacre (2005) J Exp Bot 56: 245)
Hng pht trin cao hn na l xy dng c cc m hnh v s tng tc
gia cc m hnh chuyn ha trn c s ny s lm sng t c cc m hnh biu
hin gene, s tng tc gia cc gene v nhm cc gene. Nhng kt qu ny s gp
phn trong vic iu khin s hot ng ca gene v pht trin cc liu php iu tr
hiu qu.

13

Hnh 3. Mng li cc gene lin quan n cc bnh ngi


(The human disease network. PNAS. vol. 104, no. 21, 86858690)
Nghin cu pht trin thut ton, phn mm v cc cng c phn tch mi
(software and tools) chng hn: h tr trong vic xc nh s c mt v v tr ca cc
gene trong mt trnh t DNA hay trn NST, d on cu trc protein v chc nng ca
chng hoc phn tch, sp xp cc nhm trnh t protein thnh mt h gm cc trnh t
c lin quan.
Cc cng c chnh ca Bioinformatics (Bioinformatics tools)
BLAST
BLAST l ch vit tt ca Basic Local Alignment Search Tool. y l nhm
cng c cho php so snh cc trnh t DNA v protein vi cc trnh t khc c trong
CSDL. Hin nay c mt s bin th ca BLAST nh: PSI-BLAST, PHI-BLAST,
DELTA-BLAST. Ngoi ra cn c mt s cng c BLAST c bit p dng cho cc
genome ngi, vi sinh vt, k sinh trng st rt v cc genome khc. Cc cng c h
tr pht hin cc trnh t c ln vi trnh t ca vector (c bit khi ng k vo
ngn hng gene), cc trnh t globulin min dch, v cc trnh t bo th...

14

FASTA
L mt cng c tm kim CSDL c s dng so snh trnh t nucleotide
hoc amino acid vi mt CSDL trnh t. Chng trnh ny da vo thut ton tm
kim trnh t nhanh bi Lipman v Pearson. y cng l thut ton u tin c
dng tm kim cc trnh t ging nhau trong CSDL.
EMBOSS
EMBOSS c vit tt t (European Molecular Biology Open Software Suite),
l mt t hp cc phn mm phn tch ngun m min ph ng dng trong lnh vc
sinh hc phn t. C khong hn 100 chng trnh ng dng so snh trnh t, tm
trnh t trong CSDL, tm kim cc m hnh (pattern), tm kim domain, motif trong
phn t protein bng cch so snh trnh t amino acid, so snh trnh t nucleotide
pht hin cc pattern, phn tch tn sut s dng b m (codon bias analysis)
Mt danh sch cc ng dng c th tm a ch:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/
Clustalw
ClustalW l chng trnh dng so snh cc trnh t DNA v protein. Mc
ch l tm ra cc vng trnh t ging nhau v khc nhau. Trn c s h tr cho
nhiu ng dng khc nh: phn tch domain, motif, pattern, xy dng mi quan h tin
ha.
RasMol
y l cng c nghin cu rt hiu qu hin th cu trc DNA, protein v cc
phn t nh. Protein Explorer l mt dng bin th d s dng ca RasMol.
Chng trnh ng dng cho chuyn ngnh bioinformatics
- JAVA: Do bn cht Java l chng trnh c lp v vy n l mt thnh phn
quan trng ca bioinformatics (BioJava)
- Perl: S dng x l cc d liu sinh hc (BioPerl)
- BioXML: L mt phn ca d n BioPerl, l ngun tp hp cc ti liu dng
XML v DTD
Xy dng cc CSDL ti liu, tp ch phc v nghin cu
- Bi bo, tp ch (pubmed);
- H thng phn loi, kha phn loi (taxon);
- Sch (book);
- Bi bo, tp ch, ti liu lin quan n cc phn ng sinh ha
(pubchembioassay);
- Cc ti liu lin quan n cc hp cht ha hc (Pubchem compounds);
- Cc ti liu v cc cht ha hc (pubchem substances);
- Cc c s d liu: genomics, proteomics, metabolomics, microarray gene
expression v phylogenetics.
Thng tin cha ng bn trong cc CSDL sinh hc bao gm: tn gene, trnh t
gene, v tr ca gene trn NST hoc genome (locus tag), cu trc v chc nng
ca cc gene, hu qu ca cc t bin gene , cc gene lin quan (h gene) v
cu trc ca chng (nu l protein, RNA...)

15

D liu bao gm: Cc trnh t gene, cc m t v c im ca gene (gene m


ha cho mRNA, tRNA, rRNA), thut ng phn loi (ngun gc ca gene,
sinh vt cha gene ), cc trch dn (bi bo lin quan n gene, protein) v
cc bng s liu (nu c).
Kiu nh dng CSDL
Cc dng nh dng ca d liu sinh hc gm nhiu loi: ch, d liu trnh t,
cu trc protein v cc lin kt (link).
- Dng ch: PubMed v OMIM.
- Dng trnh t: GenBank (DNA) v UniProt (protein).
- Dng cu trc: PDB, SCOP, v CATH.
Nhng vn lin quan n CSDL protein
Vic pht trin CSDL cu trc protein thng rt kh khn v chm hn so vi
trnh t DNA v cu trc 3 chiu ca protein rt kh xc nh. xc nh cu trc 3
chiu ca mt phn t protein ngi ta phi tch ring hay tinh sch protein vi
lng ln, tip tm cc iu kin ph hp cho protein kt tinh sau s dng
cc k thut xc nh cu trc, chng hn nh dung tia X (X-ray crystallography),
cng hng t ht nhn (NMR spectroscopy), CD (Circular Dichroism), knh hin vi
in t... Cc d liu cu trc c ng k v c th truy cp thng qua cc CSDL
thnh vin ca wwPDB (PDBe, PDBj v RCSB PDB, SCOP) v CATH.
Cc CSDL c th loi
Mt s CSDL c th loi c cng b, ch yu dng cho nghin cu.
Chng hn: Colibase (CSDL cho E.coli). Cc CSDL khc nh Flybase cho Drosophila
v WormBase cho cc bn giun trn (Caenorhabditis elegans v Caenorhabditis
briggsae). Ngoi ra cn c cc CSDL khc cho la (Oryza sativa), Arabidopsis
1.5. Xu hng pht trin ca bioinformatics
Xu hng ca bioinformatics tp trung vo cc hng sau:
- Pht trin cc thut ton v my tnh (Algorithms and computational
challenges)
- Phn tch chc nng protein (Protein function)
- Tng tc protein v cc con ng chuyn ha(Protein interactions and
pathways)
- p dng trong lm sng v nghin cu tm thuc mi, d on ri ro, nguy c.
Cc xu hng hin nay ca Bioinformatics
-

Thut ton: 27%


Machine learning: 21%
Thng k: 18%
Sinh hc: 10%
CSDL: 10%
Cc hng khc: 14%

16

Cc ch nghin cu hin nay:


- Phng php: 26%
- Phn tch trnh t (motif, domain), so snh trnh t : 25%
- M phng cu trc protein: 19%
- M hnh cu trc v iu ha hot ng gene: 12%
- Phn tch trnh t lin quan n tin ha: 12%
- M phng v xy dng mng li trao i cht (metabolome): 6%

K nng v yu t con ngi pht trin bioinformatics:


- Hiu bit su rng c hai lnh vc: sinh hc v tin hc
- Nm c nhng vn cn quan tm c 2 lnh vc
- Hi t c khoa hc my tnh v phn mm: t vn v pht trin thut
ton
mc nht nh c th ni tin sinh hc l lnh vc th v, hp dn, mi, thch thc,
c th truy cp c, lnh vc c th m rng nghin cu, c s nh hng nhiu, c
hi cho ngi lm my tnh.
Nhng ch cn khm ph:
- Cc k thut CSDL cho d liu Bioinformatics
- Di truyn phn t (nn tng ch yu thuc v lnh vc sinh hc)
- So snh trnh t, m hnh mu (patterns), profiles
- Pht hin cc pattern
- Gene expression arrays
- Xy dng cu trc protein (nn tng ch yu thuc v lnh vc sinh hc)
- Xy dng hnh hc khng gian (lp th) ca protein (k thut my tnh v cc
cng c)
- D on cu trc protein
- Xy dng mng li ha sinh hc, metabolome (nn tng ch yu thuc v lnh
vc sinh hc)
- Xy dng cc con ng trao i cht, cc con ng iu ha v tn hiu iu
ha gene: CSDL, k thut my tnh v cc cng c

17

Tm tt chng 1
Tin sinh hc l mt lnh vc khoa hc mi c s kt hp cht ch ca sinh hc
m ch yu l di truyn hc, sinh hc phn t vi cc cng c thng k, ton hc v
khoa hc my tnh. Chng 1 gii thiu khi nim, vai tr ca tin sinh hc cng nh
cc cng c phc v cho nhng vn nghin cu ca sinh hc phn t hin i chng
hn nh tm kim cc trnh t sinh hc tng ng hoc ging nhau trong cc ngn
hng c s d liu, m phng v d on s tng tc gia cc phn t, pht hin cc
m hnh biu hin gene v cc mi lin h gia cc geneCc ni dung chnh ca tin
sinh hc cng nh xu hng pht trin ca lnh vc ny cng c cp qua gip
sinh vin c mt ci nhn bao qut v mt lnh vc khoa hc mang tnh ng dng, h
tr cho cc nh nghin cu trong cc lnh vc di truyn phn t, sinh hc phn t, y
hc
Cu hi n tp chng 1

1. Trnh by khi nim tin sinh hc.


2. Hy nu tm tt vai tr ca tin sinh hc trong nghin cu sinh hc.
3. Trnh t sinh hc l g? Hy nu mt vi v d v vic phn tch trnh t sinh
hc.
4. Th no so snh trnh t? Mc ch ca vic so snh trnh t lm g?
5. Ti sao phi nghin cu cu trc cc i phn t ? tin sinh hc h tr nh th
no trong vic d on cu trc phn t.
6. Nhng hiu bit v vai tr ca cc gene, mi lin h gia cc gene c vai tr
nh th no trong y hc hin i?
7. Th no l mi quan h tin ha gia cc sinh vt? Tin sinh hc s h tr g
trong nghin cu tin ha.
8. Hy nu nhim v v cc hng nghin cu ca tin sinh hc hin nay.
9. Hy nu nhng ch ang c cc nh tin sinh hc tp trung nghin cu.
10. tr thnh nhng nh nghin cu trong lnh vc tin sinh hc chng ta cn
phi c nhng yu t g?

18

CHNG 2
NN TNG SINH HC CA TIN SINH HC
2.1. Axit nucleic v protein
Axit nucleic v protein l hai i phn t sinh hc ng vai tr quan trng trong
th gii sng. Axit deoxyribonuleotide nucleic (DNA) mang thng tin di truyn v axit
ribonucleic (RNA) lin quan n qu trnh sinh tng hp protein v tham gia vo iu
ha hot ng sng ca t bo. n v cu to nn axit nucleic l cc nucleotide v
protein l cc amino acid.
2.2. Cu trc ca axit nucleic
DNA v RNA c cu to bi cc n phn l nucleotide v ribonucleotide.
Trong phn t DNA, mi nucleotide c cu to bi gc axit phosphoric, mt phn
t ng pentose v mt base. Cc nucleotide ni vi nhau bi lin kt phosphodiester
gia nhm 5PO4 ca phn t ng pentose ca mt nucleotide v nhm 3OH ca
phn t ng pentose mt nucleotide tip theo. V vy phn t axit nucleic bao gi
cng tn ti u 5PO4 v 3OH. Theo quy c i vi mt axit nucleic bao gi cng
vit theo hng 5 n 3 theo chiu t tri sang phi.

Hnh 4. Cu trc DNA


Axit nucleic c cu to bi 5 loi base khc nhau: cytosine (C), uracil (U),
thymine (T), adenine (A) v guanine (G). Tuy nhin, U ch c mt trong phn t RNA
v C ch c mt trong DNA. Phn t DNA v RNA khng ch khc nhau v thnh
phn base m cn khc nhau v phn t ng. RNA c ng ribose trong khi
DNA cha ng 2-deoxyribose. Phn t DNA gm 2 chui polynucleotide xon vi
19

nhau theo hng i song. Phn t DNA c th tn ti di dng si n (ssDNA) v


dng si kp (dsDNA). Trong phn t DNA, hai si c gn vi nhau qua lin kt
hydro gia cc base. Hai lin kt hydro gia A v T v ba lin kt hydro gia C v G.
Hai si DNA b sung vi nhau do nu bit trnh t ca mt si s suy ra trnh t
ca si cn li.
Lu tr thng tin di truyn

Trnh t cc base mang thng tin m ha cho cc protein. Phn t protein c


cu to bi 20 amino acid v mi amino acid c m ha bi 1 b ba gm 3
nucleotide tng ng trn phn t DNA. Mi b ba nh vy c gi l b m
(codon). Mi sinh vt c xu hng s dng cc b m khc nhau. Chng hn
prokaryote mt s loi dng b m khc vi cc sinh vt eukaryote. M di truyn ca
genome ti th cng c mt s khc bit so vi m di truyn ca genome trong nhn.

Hnh 4. M di truyn
Mi quan h gia DNA, RNA v protein c m t trong lun thuyt trung
tm (Crick 1970)

20

Hnh 5. Lun thuyt trung tm


Ton b thng tin di truyn cha trong nhn hoc kiu nhn ca mt sinh vt
c gi l genome. Ngoi tr cc retrovirus genome l RNA, thng tin di truyn
c cha ng trong cc trnh t nucleotide ca phn t DNA. Ngoi tr qu trnh
phin m ngc t RNA sang DNA mt s virus RNA, dng thng tin c chuyn
mt chiu t genome n transcriptome v n proteome thng qua qu trnh phin m
v dch m. Ton b cc bn phin m RNA (mRNA, tRNA, rRNA v cc RNA
khng m ha khc) ca mt sinh vt c gi l transcriptome. Ton b protein c
th c dch m t cc mRNA c gi l proteome. Nh vy trnh t amino acid
trong phn t protein c quyt nh bi trnh t DNA v dng thng tin c
chuyn t DNA n protein thng qua mRNA.
Genome ca eukaryote v prokaryote c nhiu im khc bit. prokaryote
thng tin di truyn c m ha trn mt on DNA lin tc, trong khi
eukaryote, cc trnh t m ha (exon) c ngn cch bi cc trnh t khng m ha
gi l intron. Ngoi ra, eukaryote, s phin m t DNA thnh mRNA trng thnh
cng phc tp hn nhiu chng hn cc intron c loi b trong qu trnh phn ct
mRNA. Cng chnh v qu trnh ny t mt gene ban u c th hnh thnh nn nhiu
mRNA v to ra nhiu protein tng ng. iu ny gii thch ti sao genome sinh
vt bc cao cha mt s lng gene nht nh, chng hn ngi c khong 25.000
gene, tuy nhin s lng protein thc t c to ra ln hn nhiu, khong 1 triu
protein.

21

Hnh 6. Cu trc vng gene ca prokaryote v eukaryote


Cu trc phn t protein
Cu trc s cp
Cc phn t protein l cc i phn t sinh hc c cu thnh t khong 20
loi amino acid. Trong iu kin nht nh phn t protein s cun gp li hnh thnh
cu trc 3 chiu mang y cc c im v chc nng sinh hc. Cc gc amino acid
trong chui polypeptide s quyt nh nhng c im ha hc nh tnh k nc, phn
cc, acid, base ca phn t protein. Cu trc s cp ca phn t protein hay cn gi l
cu trc bc 1 l trt t sp xp ca amino acid trong chui polypeptide. Cu trc bc
1 s quyt nh cc cu trc khng gian ca phn t protein.
Trong phn t protein, amino acid ni vi nhau to thnh chui polypeptide. Cc
amino acid c ni vi nhau thng qua lin kt amide ca nhm carboxyl vi nhm
amino ca amino acid tip theo. Chnh v vy chui polypeptide c 2 u N v C tn
cng. Theo quy c v chiu, u N bn tay tri v u C bn phi.

22

Hnh 7. Cc amino acid trong phn t protein


Cu trc bc 2
Thut ng cu trc bc 2 ch nhng vng khng gian cc b trn chui
polypeptide. Cu trc bc hai lin quan n s c mt ca cc xon alpha (-helix) v
phin gp np beta (-strand) v cc cu trc vng xon (loop). C s ca vic hnh
thnh cc cu trc ny l do cc c im hnh hc ca cc gc trong cc amino acid.
Vo nhng nm 1930 v 1940, Linus Pauling v Robert Corey m t cc lin kt
peptide l dng cu trc phng, cng (khng xoay). Nh vy, mt chui polypeptide
c th c xem nh l mt chui cc trnh t ni vi nhau v nm trn mt mt
phng. Xon alpha, phin beta v cc vng xon tham gia hnh thnh nn cu trc bc
2. Cu trc xon alpha v phin beta c gi n nh nh lin kt hydro. Phin beta
c th c 2 dng song song v i song (hnh 8).

23

Hnh 8. Cu trc bc 2 ca mt phn t protein


Xon alpha v phin beta. Cu disulfide lm n nh cu trc bc 3 v cc vng lin
quan n hot tnh xc tc (mu vng).

Cu trc bc 3 v bc 4
Cu trc bc 3 c hnh thnh t vic sp xp v gp np tip theo t cc thnh phn
cu trc bc 2. Nhng polypeptide c chiu di ln hn 200 amino acid thng t gp
np vi nhau thnh mt s n v cu trc gi l domain. Cu trc bc 4 l dng cu
trc tip theo ca cu trc bc 3. Cc protein c cu trc bc 4 thng c hnh thnh
t nhiu chui polypeptide (subunit).
Trong cu trc bc 4 s tng tc gia cc amino acid bao gm lin kt hydro gia cc
chui peptide, cu disulfide gia cc gc cystein, cc lin kt ion gia cc nhm tch
in ca cc gc (chui bn) v tng tc k nc.
2.3. Genome v nghin cu genome
Genome

Genome cha ng ton b thng tin di truyn ca mt sinh vt. Cc thng tin
di truyn c m ha trong DNA hoc RNA. Ly genome ngi lm mt v d, nu
coi genome l mt cun sch th cun sch ny c chia thnh 23 chng (tng ng
vi 23 cp NST). Mi chng cha 48 n 250 triu ch tin tc (A,C,G,T). Ton b
cun sch c hn 3,2 t ch v c t trong nhn ca t bo.
D n xc nh trnh t genome u tin hon tt nm 1977 bi Fred Sanger.
ng v cng s xc nh trnh t phage -X174, cha 5386 base. Genome ca vi
khun u tin c xc nh trnh t l Haemophilus influenzae vo nm 1995. Sau
genome eukaryote u tin c xc nh trnh t l ca nm men Saccharomyces
cerevisiae. Hin nay, s pht trin nhanh chng ca cng ngh (Ilumina solexa, 454
pyrosequencing, ion torrent, solid sequencing...) s lng genome ca cc loi c
xc nh trnh t tng ln mt cch nhanh chng.
Nghin cu genome (genomic research)
Nghin cu genome khng n thun ch l vic tng kt cc genome c
xc nh trnh t hay cc ch ra s lng gene c trong mt genome v tnh trng
tng ng. Nghin cu genome cn bao gm c vic so snh kch thc genome, s
lng NST (karyotype), trt t cc gene, tn sut s dng codon, thnh phn GC, v
tin ha genome. Ngoi ra nghin cu genome cng bao gm c vic so snh nhiu
24

genome pht hin ra cc vng bo th, cc s kin bin i din ra trong genome.
Cc kt qu nghin cu genome thng c biu din di dng ha thng qua
cc trnh duyt genome hay genome browser.
Genome hc (genomics) l mt mn hc gn lin vi di truyn hc. Genomics
lin quan n vic nghin cu genome ca cc sinh vt bao gm xc nh trnh t
DNA ca ton b genome v lp bn di truyn c mc phn gii cao (khong cch
gia cc marker rt gn nhau). Genomics cn nghin cu cc hin tng xy ra bn
trong genome chng hn nh: hin tng u th lai (heterosis), tc ng ln t ca cc
gene (epistasis), nh hng ca mt gene ln nhiu gene (pleiotropy) v tng tc
gia cc locus v cc allele bn trong genome. Khc vi nghin cu vai tr v chc
nng ca nhng gene n l, genomics nghin cu mi quan h tng th ca cc thnh
phn trong genome.
Lp genome (genome duplication) ng vai tr ch yu trong vic hnh thnh
loi mi. Lp geneome c th dao ng t phm vi nh (lp li cc on ngn/short
tandem repeat) hoc lp li c gene hoc c cm gene, lp c NST v thm ch ton b
genome. Nhng s kin ny l nn tng to ra c tnh di truyn mi, lm c s ca
tin ha. Trao i gene ngang (horizontal gene transfer) c vai tr quan trng trong
vic gii thch s ging nhau gia cc phn nh trong cc genome ca hai sinh vt vn
khng cng ngun gc tin ha. Vic trao i gene ny cng tng i ph bin gia
cc vi sinh vt chng hn hin tng khng khng sinh cc vi sinh vt l mt v d
in hnh. Vt cht di truyn c chuyn t genome ti th v lc lp vo NST cc
t bo eukaryote cng l mt v d cho hin tng ny.
Genome ngi (human genome)
Nm 2001, bn nhp u tin ca genome ngi c cng b. Vo nm 2007,
d n xc nh trnh t genome ngi hon tt vi t l li rt nh (khong 1/20.000
base). C th truy cp cc phin bn lp rp trnh t genome ngi bng cch dng
UCSC Genome Browser, Ensembl.
Nghin cu genome ca virus (bacterophage)
Bacteriophages ng vai tr quan trng trong nghin cu di truyn vi khun v
sinh hc phn t. V mt lch s, chng c s dng xc nh cu trc gene v
nghin cu c ch cng nh m hnh iu ha hot ng gene. Do genome c kch
thc nh v khng cha intron nn bacteriophase c la chn xc nh trnh t
u tin. Tuy nhin, nghin cu v bacteriophage khng m ra s cch mng v
genome (cuc cch mng v genome bt u t vic xc nh trnh t cc vi khun).
Trnh t genome ca cc bacteriophage thng c xc nh thng bng vic c
trnh t trc tip. Phn tch genome vi khun cho thy mt phn ng k DNA vi
khun cha cc trnh t tin phage (prophage) v dng ging nh prophage (prophagelike). Nh vy, vic khai thc thng tin trong CSDL ca bacteriophage gp phn gii
thch c vai tr ca prophage trong vic hnh thnh dng genome ca vi khun.
Nghin cu genome vi khun lam (Cyanobacteria genomics)
Hin ti c 24 vi khun lam c xc dnh trnh t. 15 trong s chng c
phn lp t bin. C 6 chng thuc chi Prochlorococcus, 7 chng thuc chi nc mn
Synechococcus, Trichodesmium erythraeum IMS101 v Crocosphaera watsonii
WH8501. Mt s nghin cu cho thy cc trnh t ny c th c s dng rt hu
25

ch trong vic suy din cc c tnh sinh l v sinh thi ca vi khun lam bin. Tuy
nhin, c rt nhiu d n xc nh trnh t genome ang c thc hin trong s c
cc dng phn lp thuc chi Prochlorococcus v Synechococcus ( bin),
Acaryochloris v Prochloron, mt dng khun lam dng si c kh nng c nh
nitrogen Nodularia spumigena, Lyngbya aestuarii v Lyngbya majuscul cng nh tc
ng ca bacteriophage ln vi khun lam bin. Nh vy, vic nghin cu genome
ng vai tr quan trng trong vic gii thch ngun gc tin ha ca cc sinh vt v
cc qu trnh sinh hc chng hn nh quang hp.
Mi quan h gia C-value v s lng gene:
Gi tr C (C-value) l hm lng DNA ca mt sinh vt. Gi tr ny c s bin
ng rt ln cc loi. Khng c mi lin h r rng no gia C-value v s lng
gene ca sinh vt. cc genome phc tp, t l cc trnh t DNA khng m ha (noncoding DNA) khng mang thng tin di truyn m ha RNA cng ln. ngi,
DNA khng m ha chim ti gn 75% genome. Nghch l gi tr C (C-value paradox)
ch mi quan h khng t l gia kch thc genome v s lng gene.
2.4. Pht hin gene v xc nh chc nng gene trong genome

Hnh 10. T chc genome ngi

26

Sau khi cc d n xc nh trnh t genome kt thc, kt qu thu c l cc


chui trnh t c sp xp trong cc nhim sc th. Vn tip theo l phi gii
m thng tin cha ng trong cc chui trnh t . Vic gii m thng tin thc cht
l tr li nhng cu hi nh: (i) genome ca sinh vt cha bao nhiu gene, (ii) cc
gene phn b u trn cc nhim sc th, (iii) chc nng ca cc gene l g,
(iv) c ch iu ha ng ca cc gene nh th no v mi lin h gia cc gene
trong vic hnh thnh kiu hnh hoc bnh tt... tr li nhng cu hi ny i hi
rt nhiu thi gian, cng sc v trong mt s trng hp cha th tm ra p n cho
nhng cu hi . C nhiu hng tip cn gii m genome, trong cc cng c
tin sinh hc c vai tr rt ln. Chng hn xc nh s lng gene ngi ta phi da
vo cc c im ca gene bao gm: trnh t m ha (coding sequence) hay cc khung
c m (open reading frame), trnh t promoter, cc trnh t ni gia exon v intron
cng nh cc trnh t iu khin hot ng ca gene (cc vng 5 UTR, 3UTR)... So
snh genome, so snh trnh t DNA l nhng thao tc quan trng u tin pht hin
cng nh d on chc nng ca gene.
Lp bn vt l da trn c s trt t cc gene v thng tin bit ca cc
gene cng l bc u tin trong nghin cu genome. Thng tin ny s c hin th
di dng ha cc genome browser. Xc nh chc nng ca gene c coi l
mt trong nhng thch thc vi cc nh nghin cu genome. Mc d thng tin v trnh
t, cu trc v chc nng sinh hc ca cc gene, cc trnh t sinh hc c cng b
ngy cng nhiu nhng vic d on chc nng ca cc gene thng rt phc tp. C
nhiu hng tip cn cho bi ton ny trong c th tip cn t genome hoc t sn
phm gene (protein) hoc kiu hnh. Gi s ngi ta mun bit tnh trng chiu cao
cy, kh nng khng su bnh, mu sc hoa hay hm lng protein trong sa do gene
no m ha. Nu tnh trng cn nghin cu l n gene th s tng i n gin. Tuy
nhin nu tnh trng do nhiu gene quy nh (tnh trng s lng) th cng vic ny
s tr ln v cng phc tp. Vn l lm th no ch r c gene hoc cc gene
no phn b u trong genome (trn NST) trc tip m ha hoc tham gia vo qu
trnh hnh thnh nn tnh trng . Ngoi ra, m hnh hot ng hoc c ch, iu kin
biu hin ca cc gene nh th no?
Trn thc t cho d s phng php no hay hng tip cn no th cui cng
vn phi xc nhn li c ng gene tham gia vo vic hnh thnh tnh trng
khng. Vic kim chng ny thc s l mt cu hi v cng nan gii c bit nhng
tnh trng di truyn s lng cc i tng sinh vt bc cao bi v cc k thut
knock out, knock down, c ch s biu hin gene bng RNAi khng phi lc no cng
c th p dng v p dng thnh cng. Mt hng tip cn khc xc nh chc
nng ca gene nh k thut microarray nhm pht hin s xut hin hoc thay i mc
biu hin ca cc mRNA trong nhng iu kin nht nh cng gp phn vo vic
nhn din v nghin cu chc nng gene. Nhng nghin cu so snh genome, so snh
trnh t, so snh cu trc (data mining and analysis) cng l mt xu hng v l thao
tc u tin khi cc CSDL cha thng tin v cc trnh t sinh hc ngy cng nhiu.
Tuy nhin mc chnh xc v tin cy ca cc thng tin a ra ph thuc rt nhiu
vo cc thut ton v mc phong ph ca thng tin trong cc c s d liu.
S lng gene ca cc sinh vt
ngi, lc ban u genome ngi d on cha khong 50.000 n 100.000
gene. Gn y s lng gene c bit khong hn 20.000. Chut v rui cng c s
27

lng gene tng t. Giun trn c khong 13.000 v la c khong 46.000. ngi,
trnh t gene m ha protein chim khong 12% genome.
Cu trc gene

Hnh 11. S cu trc mt gene prokaryote


prokaryote, v mt quy c u 5 ca gene c t bn tri, u 3 bn phi.
Cu trc mt gene in hnh c minh ha di y.

Hnh 12. S cu trc vng trnh t promoter ca prokaryote

Hnh 13. Cu trc gene ca eukaryote (trn) v vng promoter (di)

28

2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene


Hot ng chc nng ca gene l mt qu trnh phc tp, c s tham gia ca rt
nhiu thnh phn ca t bo. prokaryote, hot ng chc nng v iu ha hot
ng ca gene tng i n gin. Tuy nhin eukaryote iu ha hot ng ca gene
v cng phc tp lin quan n nhiu qu trnh t cu trc nhim sc th lin quan n
cc c ch epigenetics (methyl ha, acetyl ha, phosphoril ha), khi u phin m,
phin m, ci bin sau phin m, dch m, ci bin sau dch m v vn chuyn hng
ch. Nghin cu hot ng ca mt gene phc tp th iu ha hot ng ca mt
con ng chuyn ha (metabolomic pathway) cn phc tp hn nhiu do c s tham
gia ca rt nhiu gene v tng tc ca nhiu protein, enzyme khc trong t bo.
Chnh v vy nghin cu hot ng chc nng ca gene cn c s so snh v i chiu
vi nhiu c s d liu v nhiu genome khc nhau.

Hnh 14. Cc qu trnh iu ha hot ng gene eukaryote


2.6. Proteome v lnh vc nghin cu protein (proteomics)
Proteome c coi l ton b protein c biu hin bi mt genome, t bo,
m hoc cc sinh vt mt thi im hoc iu kin nht nh. Xt v mc ang
dng, proteome ln hn nhiu so vi genome, c bit sinh vt nhn chun. Ni cch
khc s lng protein ln hn nhiu so vi s lng cc gene c trong genome.
Nguyn nhn l do cc hin tng phn ct, sa cha tin mRNA (pre-mRNA) ca
cc gene v qu trnh ci bin sau dch m chng hn nh phosphoryl ha, glycosyl
ha. Nu so vi d liu v genome ch yu l trnh t DNA, RNA th d liu v
proteome phc tp hn bi v ngoi trnh t amino acid cn c cc d liu cu trc,
chc nng v s tng tc gia cc protein.
Lnh vc nghin cu proteome lin quan n nhiu k thut phc tp nh tch
chit, tinh sch protein, phn tch protein bng in di 2 chiu, cc k thut phn tch
29

khi ph, so snh s ng dng gia cc mnh peptide, so snh trnh t amino acid...
Proteomics bao gm ni dung quan trng l nghin cu cu trc v nghin cu chc
nng. Nhng thng tin v trnh t amino acid, cu trc v chc nng gip cc nh
nghin cu gii thch c bn cht ca cc qu trnh sinh hc, c ch ca cc qu
trnh ri lon, bnh tt v nhn dng v d on chc nng ca nhng protein mi.
2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt
t bin v tch ly t bin
Mc d c ch v nguyn nhn ca tin ha n nay vn cn nhiu tranh ci,
tuy nhin trn quan im hin i, t bin c coi l vt liu ban u ca tin ha
bi v y l con ng dn n vic hnh thnh allele mi hoc cc vng c chc
nng iu ha b thay i hoc to mi. t bin c th gy ra hu qu nghim trng
nhng cng c t bin trung tnh hoc khng nh hng n kiu hnh (t bin
trong cc vng DNA khng m ha/ non-coding DNA).
Hu ht cc t bin trong gene cu trc u tc ng n sn phm protein
hoc dn n s a dng v sn phm protein do qu trnh phn ct, ghp ni exon ca
mRNA. Nhng thay i cu trc v chc nng ca cc phn t biu hin thnh cc
dng bin d ca c th trong qun th. Tri qua cc s kin tin ha cui cng c th
dn n phn loi v hnh thnh loi mi. y, cu hi t ra l ti sao nhng thay
i nh trong cc gene do t bin, c bit l t bin im, li dn n s phn bit
loi ny vi loi khc. tr li cu hi ny cn phi xem xt c hai kha cnh
khng gian v thi gian. Khng gian y l nhng chn lc ngu nhin t ln
nhng c th b t bin. Thi gian l h qu ca mt qu trnh chn lc t nhin lu
di. Khng gian v thi gian c mi quan h cht ch vi nhau nu p lc chn lc qu
mnh th trong mt thi gian ngn c th hnh thnh loi mi hoc dn n tuyt
chng.
S lp gene v genome (gene/genome duplication)
Nu mt gene c lp li hay c nhiu bn copy th t bin xy ra mt bn
copy c th khng nh hng g n hot ng sng ca t bo. Lp gene trong mt c
th lng bi s to ra thm mt cp gene v th mt cp vn hot ng chc nng
bnh thng, cp cn li c th b bin i hoc tn ti cc dng t hp khc nhau.
Vy li ch ca qu trnh lp gene ny l g? Theo thi gian, mt bn copy c th to
ra chc nng mi, lm nn tng cho vic thch nghi trong qu trnh tin ha. Ngay c
khi hai bn copy ca gene tn ti theo kiu paralogous, tc l c trnh t v chc
nng tng t nhau th s tn ti ca cc bn copy l mt dng d tha (gene
redundancy). iu ny gii thch ti sao trong mt s trng hp chut hoc nm men
b knock out mt gene nhng khng thy nh hng hoc nh hng khng qu nng
n ln kiu hnh. Nh vy, chc nng ca cc gene b knock out c th b trung ha
bi mt dng paralog tng ng ca n.
Sau khi gene c lp, tri qua cc s kin tin ha mt bn copy ca gene c
th b bin i hoc mt i. Nhng bin i xy ra nhiu gene v nhiu v tr trong
genome dn n nhng ro cn (post-zygotic isolating mechanism) trong qu trnh
giao phi v sinh sn gia chng. Nhng ro cn ny c th dn dn gy ra s phn
loi.
Cc t bin trong vng iu ha
30

Mc d v mt s lng gene c th ni l nh nhau tt c cc t bo, tuy


nhin khng phi tt c cc gene u c biu hin nh nhau mi t bo. S khc
bit ny ph thuc vo loi t bo, s tng tc ca cc tn hiu ngoi bo, cc yu t
phin m...
C nhiu bng chng cho rng t bin trong vng iu khin ng vai tr quan
trng trong tin ha. Chng hn: Ngi c mt gene (LCT) m ha cho lactase,
enzyme ny ng vai tr phn gii lactose. Hu ht mi ngi trn th gii gene ny
v u hot ha tr nh nhng s khng hot ng ngi ln. Tuy nhin, nhng
ngi Bc u v 3 b tc chu Phi gene ny vn hot ng v trong khu phn n ca
h vn dng sa. Nguyn nhn l do c mt t bin trong vng iu khin gene
lactose cho php n vn c biu hin. Mt v d khc l gene Prx1. Gene ny m
ha cho mt yu t phin m quyt nh cho s hnh thnh chn trc ng vt c
v. Khi chut c vng enhancer ca gene Prx1 b thay th bi vng enhancer tng
ng ca di (chn trc s l i cnh), khi cc chn trc di hn 6% so vi bnh
thng. Nh vy, mt s thay i v hnh thi khng c iu khin bi s thay i
protein Prx1 nhng li do s thay i v mc biu hin ca gene ny.
2.8. Phn tch mi quan h tin ha ca cc sinh vt
Tin ha l mt qu trnh dn s thay i v vn gen ca mt qun th theo thi
gian. Mc d bn cht ca tin ha din ra mc qun th, tuy nhin vic xc nh
v phn tch mi quan h tin ha c th nhiu mc khc nhau nh qun th, loi,
nhm c th, t bo, cc bo quan v mc phn t. Trong lnh vc tin sinh hc
ng dng vic phn tch mi quan h tin ha ch yu da vo phn tch mc
phn t hay tin ha phn t. Chng hn gn y ngi ta da vo vic phn tch cc
trnh t DNA m ha cho ribosome, cytochrome c, Rubisco ribolose (RuBisCo), gene
ti th... phn loi sinh vt v xp chng vo cc n v phn loi (taxon). Tt nhin
vic phn tch mc phn t l cha cn phi kt hp vi kt qu ca cc
nghin cu khc.
Analogous
Hiu mt cch n gin analogous l nhng c im ging nhau c quan
st thy hai hay nhiu loi m bn thn chng khng c s lin h v mt t tin.
Cc c im sinh hc ging nhau nh vy thng l kt qu ca qu trnh tin ha
hi t. Tin ha hi t l kiu tin ha m s thay i mt s c im trong qu
trnh tin ha ch mang tnh thch nghi vi iu kin nht nh. V d i cnh ca
chim v di c cu trc dng tng t nhau v ph hp cho vic bay ln nhng v
bn cht l khc nhau.
Homologous
Cc tnh trng tng ng (homology) c cng mt ngun gc tin ha chung.
Mt tnh trng tng ng c th l:
- Homoplasious: qu trnh tin ha xy ra ring r, nhng c cng t tin chung
- Plesiomorphic: c cng t tin chung, nhng trong qu trnh tin ha dn n
s mt i mt s tnh trng cc th h con chu.
- (syn)apomorphic: c cng t tin chung v c mt tt c con chu ca chng

31

Ortholog
Cc trnh t tng ng c coi l orthologous khi chng c tch ring bi
mt s kin phn loi. Tuy nhin chng vn c cng mt t tin chung gn nht. Khi
mt loi phn li hay tch thnh 2 loi ring bit, cc bn copy phn ly t mt gene n
c gi l orthologous. Cc gene orthologous l cc gene ca cc loi khc nhau
nhng c s ging nhau bi v chng c ngun gc l hu du trc tip ca mt gene
n l. Chng hn protein iu ha Flu c mt c Arabidopsis (thc vt a bo bc
cao) v Chlamydomonas (to lc n bo). Chlamydomonas, protein ny phc tp
hn ch n xuyn mng 2 ln thay v mt ln Arabidopsis. Khi chuyn gene ny t
to lc sang genome thc vt bng k thut di truyn th hot ng ca gene ny cng
tng t nh t bo ban u ca chng. Kt qu ny chng t 2 gene ny l
orthologous v cng di truyn t 1 t tin chung.
xc nh 2 gene ging nhau c phi l orthologous hay khng th ch cn
phn tch ngun gc tin ha ca gene . Nu cc gene nm trong mt nhnh th
chng s l ortholog v l con chu ca mt t tin chung. Cc gene orthologs thng
c chc nng sinh hc ging nhau.
Paralogous
Cc trnh t tng ng (homologous) c gi l paralogous khi chng c
phn tch bi mt s kin lp gene. Nu mt gene ca mt sinh vt b lp v chim 2
v tr khc nhau trong cng mt genome, khi 2 bn copy c gi l paralogous
(para ngha l song song) v c th cng thc hin chc nng ging nhau. Paralog
thng c cng chc nng hoc chc nng tng t nhau, nhng khng phi lun lun
nh vy. Nguyn nhn ca hin tng ny l do thiu p lc la chn, tc l p lc la
chn ch t ln 1 bn copy ca gene b lp, bn copy kia c t do t bin, thay i
v hnh thnh chc nng mi.
Cc trnh t paralogous cung cp nhiu thng tin hu ch bn trong cc genome.
Cc gene m ha cho myoglobin v haemoglobin c xem nh l dng paralogs c
xa nht. n nay ngi ta bit 4 nhm haemoglobin (A, A2, B, F) l paralog ca
nhau. Trong khi mi protein u thc hin chc nng ging nhau l vn chuyn oxy
th mt dng bin i nh haemoglobin F dn n c i lc rt cao vi oxy so vi
cc haemoglobin ngi trng thnh. Chc nng hot ng ca cc gene paralog
cng khng nht thit phi gi vng. Cc gene paralogous thng thuc v cng mt
loi, nhng khng phi lc no cng nh vy. Chng hn gene haemoglobin ca ngi
v myoglobin ca kh u ch l paralog. y cng chnh l mt vn hay gp phi
trong tin sinh hc. Khi cc genome ca cc loi khc nhau c xc nh trnh t v so
snh vi nhau ngi ta rt d dng c th kt lun chng l tng ng (homologous)
tuy nhin chng vn c th l paralog v chc nng ca chng bin i.
Ohnology
Cc gene c gi ohnologous khi chng c ngun gc t mt qu trnh lp li
ton b genome. Thut ng ny c Ken Wolfe s dng vinh danh Susumu Ohno.
Ohnolog l mt trong nhng hin tng l th trong phn tch tin ha bi v chng
c bin i trong cng mt di thi gian bt u t ngun gc t tin chung ca
chng (do lp li ton b genome).
Xenology
32

Cc dng homolog hnh thnh do s trao i gene ngang (horizontal gene


transfer) gia 2 sinh vt c gi l xenologs. Phn ln cc xenolog ging nhau v
chc nng.
Gametology
Gametology m t mi quan h gia cc gene tng ng (homologous gene)
cc NST khng tng ng (chng hn NST X v NST Y ngi). Gametolog l kt
qu ca s quyt nh gii tnh v mt di truyn v l cc ro cn cho s ti t hp
gia cc NST gii tnh.
Tm tt chng 2
1. Tin sinh hc ra i da trn nn tng quan trng ca sinh hc, c bit l sinh
hc phn t. Sinh hc phn t nghin cu cu trc, chc nng ca cc phn t
v cc hot ng sng ca t bo, m, c quan v c th mc phn t.
Trong tin sinh hc, nghin cu phn t tp trung vo vic xc nh trnh t cc
axit nucleic (DNA, RNA) v trnh t amino acid (protein), ng thi nghin
cu cu trc, chc nng v s tng tc gia cc phn t ny.
2. Thng tin di truyn c lu tr trong phn t DNA, RNA c biu hin
thng qua cc qu trnh phin m, dch m v ci bin (sau phin m v dch
m). y cng l ni dung ca lun thuyt trung tm trong sinh hc phn t.
3. Vi s pht trin nhanh chng ca cc k thut, vic xc nh trnh t gene v
genome tr thnh mt cng vic thng ngy cc phng th nghim. Sau
khi xc nh trnh t genome, vic m t v gn cc thng tin sinh hc vo cc
trnh t DNA l mt nhim v ca c cc nh nghin cu sinh hc v tin sinh
hc. Cc kt qu nghin cu sinh hc v thnh phn, cu trc gene ca sinh vt
prokaryote v eukaryote lm c s cho vic xy dng cc thut ton v m hnh
m phng my tnh.
4. Nhng nghin cu v mi lin h gia trnh t v cu trc phn t axit nucleic,
protein v mi lin h gia cu trc v chc nng sinh hc s lm nn tng
m phng v d on v so snh cc cu trc, d on chc nng da vo vic
so snh trnh t.
5. t bin v nhng thay i trnh t, cu trc gene, genome trong qu trnh tin
ha to c s nghin cu cc mi quan h loi, s pht sinh loi v
nghin cu chc nng ca gene, genome gia cc loi sinh vt. Trn c s phn
tch v so snh trnh t sinh hc c th xc nh c cc mi quan h di
truyn, ngun gc tin ha v xu hng tin ha cc mc tng gene, h
gene, h protein v mc loi.
Cu hi n tp chng 2
1.
2.
3.
4.
5.
6.

Trnh by thnh phn cu to v cu trc ca axit nucleic


Th no l m di truyn, c im ca m di truyn
Trnh by ni dung ca lun thuyt trung tm
Trnh by mi lin h gia cu trc v chc nng ca cc protein
Genome l g? ngha ca vic nghin cu genome?
Hy m t cu trc gene ca sinh vt prokaryote v eukaryote
33

7. iu ha hot ng gene l g?
8. Ti sao phi nghin cu mi quan h tin ha ca cc sinh vt

34

CHNG 3
TM KIM V QUN L TI LIU NGHIN CU
3.1. Phng php tm kim thng tin
S pht trin nhanh chng ca mng Internet v s lng trang Web to ra
mt lng thng tin khng l v tng ln tng ngy. tm c thng tin cn thit
trong kho d liu khng l ny cn phi s dng cc cng c tm kim kt hp vi
phng php ph hp. Chng 3 s gii thiu mt s cng c v phng php tm
thng tin chung trn Internet phc v hc tp v nghin cu.
Khi cn tm kim cc trang web cha nhng t c th hoc cm t cc cng c
tm kim chng hn nh Google s cho ra kt qu nhanh v rt hiu qu. Tuy nhin,
kt qu tm kim i khi a ra rt nhiu thng tin khng lin quan trc tip n ch
hoc phm vi tm kim dn n mt nhiu thi gian chn lc. Khi tm kim c nh
hng trong mt lnh vc c th hoc mt ch c th c th s dng cc nhm th
mc (subject directories) chng hn Word Wide Web Vitual Library (http://vlib.org/)
thu hp phm vi lnh vc ca ngi tm kim. Tuy nhin mt thc t l lng thng
tin m cc cng c tm kim cung cp ch khong 1/3 s lng thng tin thc t c.
Nguyn nhn l do cc cng c ny khng th truy cp c ngun thng tin . Vic
khng truy cp c ch yu lin quan n an ninh mng v cc hng ro chn. Cc
cng c tm kim khng c php vt qua cc ro chn ny.
C hai kiu tm kim thng tin, tm kim s dng cc cng c tm kim chung
(chng hn nh Google) v tm kim cc d liu c th theo mc ch nghin cu
hoc lnh vc nghin cu. Cho d s dng cng c tm kim no th vic tm kim
thng tin cng cn c cc qu trnh bao gm: (i) xc nh cng c tm tin hoc cc
trang web h tr tm tin, (ii) xc nh ni dung thng tin cn tm, (iii) xy dng t
kha i din cho ni dung tm kim (nn s dng t kha di dng cm t thay v
nhng t n, i vi ting Anh khng nn dng mo t, nn dng danh t), (iv) s
dng cc ton t logic kt hp chng hn nh cc hm boolean nh: and, or, not,
hoc +, -, du ngoc kp , du *, lc v thu hp kt qu nghin cu.
3.2. Cch tm ti liu phc v nghin cu
Hin nay Google c xem nh mt cng c tm kim nhanh v hu hiu nht
c a s mi ngi s dng. Xt v phng din tm kim thng tin chung hoc k
c tm kim theo th mc ch (directory) th Google vn l cng c chim u th.
Trong mt s trng hp Google c th thm nhp vo mt s trang web c bo mt
hin th thng tin tm kim, tuy nhin vic truy xut vo cc ngun thng tin ny s
b chn li v l do an ninh mng. Mc d vy, c th ni tm thng tin mt cch bao
qut Google c xem nh l cng c tm kim u tin c la chn.
Vic tm kim c bt u bng cch xc nh thng tin cn tm kim, tip sau
l xy dng t kha. i vi cc nh nghin cu sinh hc, c bit trong lnh vc
sinh hc phn t, thng tin ch yu c ly t cc ti liu nc ngoi v vy vic
thnh tho ting Anh l iu gn nh bt buc. Vic xy dng t kha da vo cch
kt hp cc t, ch yu l danh t hnh thnh cc cm t kha. Thng thng cc
kt qu tr v ca Google thng rt ln v vy ngi s dng phi lc kt bng cch
s dng cc phng php nh tng di t kha, nhm t kha thnh cc cm t v
kt hp vi cc ton t logic (hm boolean) hoc s dng cc chc nng tm kim
nng cao. Tuy nhin, vic s dng Google ch gii quyt c bi ton tm thng tin
35

chung v khi qut v tm c thng tin c th cho mc ch nghin cu i hi


qu trnh tm kim li trong kt qu va tm c dn n mt rt nhiu thi gian v
cng sc.
Trong lnh vc sinh hc, mt phn ln ti liu phc v nghin cu v hc tp l
cc bi bo khoa hc c ng trn cc tp ch chuyn ngnh. Vic s dng thng tin
t cc bi bo m bo c tnh chnh xc v c th ca thng tin. Pubmed l mt
trong nhng c s d liu MEDLINE ca NCBI cho php ngi s dng c th tm
kim rt nhiu kt qu nghin cu lin quan n lnh vc sinh, y hc di dng cc bi
bo khoa hc ton vn (full text) hoc tm tt (abstract). Gn y, nhiu tp ch khc
nhau ng k vo trong danh mc ca Pubmed v vy phm vi tm kim cc kt qu
cng b di dng bi bo khoa hc ca Pubmed khng ch dng li phm vi y sinh
hc m cn lin quan n nhiu lnh vc khc nh ha hc, vt l, cng ngh vt liu,
cng ngh thng tin... Cc bi bo dng ton vn c th download min ph c th tm
trong CSDL PMC ca NCBI.
Cc d liu tm kim trong Pubmed c th hin di dng cc bi bo v
thng tin lin quan. Hnh xxx gii thiu mt kt qu tm kim in hnh ca Pubmed.
V mt nh dng, thng tin tm kim bng Pubmed s c cung cp bao gm tiu
bi bo, tc gi hoc nhm tc gi thc hin, tn tp ch c ng, s xut bn v s
trng ca bi bo. Ngoi ra, Pubmed cung cp ng kt ni (link) ti ngun ca bi
bo cho php ngi c c th truy cp min ph hoc c s cho php ca trang
cung cp cha bi bo .

Hnh 15. Tm kim ti liu nghin cu t CSDL Pubmed


3.3. Lm quen vi Pubmed
PubMed l mt ngun m c pht trin v duy tr bi NCBI, thuc NIH.
PubMed cha hn 20 triu trch dn cho cc vn lin quan n sinh y hc t
MEDLINE, cc tp ch khoa hc s sng v cc sch online. PubMed l mt CSDL
ln tp hp cc bi bo, tm tt, cc trch dn v cc ng link lin kt vi cc CSDL
khc. Ban u CSDL MEDLINE cha cc tp ch, tm tt lin quan n khoa hc s
sng v cc ch y sinh hc. United States National Library of Medicine (NLM)
36

NIH duy tr CSDL ny nh mt phn ca h thng qun l v lu tr thng tin.


PubMed c a ra bt u t thng ging nm 1996.
Tnh t nm 1966 n nay PubMed cha hn 22,7 triu bi bo v thm ch c nhng
bi t nm 1809. Hng nm c khong 0,5 triu bi bo mi c b sung. Trong s
cc d liu trong Pubmed c khong 13,1 triu c vit di dng tm tt v 14,2
triu di dng ng lin kt vi cc bi bo ton vn (full text) v trong s ny c
3,8 triu bi bo cho php ngi dng ti v min ph.
PubMed cng trang b cc ton t logic trong qu trnh thc hin tm kim, tuy
nhin qu trnh ny l t ng. T kha a vo s c dch ra thnh cc dng bin
th ca tng t v cc t thng c s dng lin quan vi cc t kha kt hp
vi cc ton t logic.

Hnh 16. Kt qu tm kim CSDL Pubmed

3.4. Cch qun l ti liu nghin cu


Vic tm c ti liu ph hp vi mc ch nghin cu l mt qu trnh i hi
mt nhiu thi gian v cng sc. Tuy nhin, ngay c khi tm c nhng bi bo
lin quan n ch nghin cu th vic qun l thng tin ny mt cch hiu qu cho
vic c, tra cu v trch dn cng i hi nh nghin cu sp xp v t chc ngun
thng tin ny mt cc hiu qu.
C nhiu cch qun l cc thng tin v d liu bi bo, trong Endnote l mt
cng c kh hiu qu cho php nh nghin cu truy cp v trch dn ngun ti liu
theo nhiu mc ch khc nhau. Mt trong nhng u im l Endnote nhn nh dng
kt qu tm kim ca mt s cng c, in hnh nht l nh dng MEDLINE ca
NCBI. Ngoi ra Pubmed cho php tm kim kh nng tm kim thng tin v trch dn
trong cc bi bo khoa hc, lun vn v lun n mt cch t ng da trn c s d
liu c to ra. Di y l mt hnh nh minh ha ca chng trnh Endnote. Cch
s dng Endnote c gii thiu c th trong cc bi thc hnh trn lp i km vi bi
ging ny.
37

Hnh 17: Qun l CSDL bi bo khoa hc bng chng trnh Endnote


Tm tt chng 3
1. Internet cha ng mt kh thng tin khng l, khai thc c ngun thng
tin ny cn phi s dng cc cng c tm kim.
2. Vic tm kim thng tin bao gm vic xc nh ngun thng tin, xy dng t
kha v biu thc tm tin v cui cng l la chn cng c tm kim.
3. Vic nh gi tin cy ca thng tin phi da vo mt s tiu ch nh mc
ch ca ngi ng ti thng tin, thi gian ng ti, cc ng dn
4. C s d liu Pubmed l mt trong nhng CSLD quan trng ca NCBI. y
cc nh nghin cu c th tm kim v ti v rt nhiu cng trnh, bi bo
nghin cu c ng trn nhiu tp ch c uy tn.
5. Vic qun l ti liu bng cc cng c tin hc gip cho nh nghin cu t chc,
sp xp c cc ti liu tham kho mt cch khoa hc. Vic trch dn cc ti
liu cho cc bi bo, lun vn, lun n bng Endnote gip nh nghin cu tit
kim c thi gian v cng sc.
Cu hi n tp chng 3

1. Hy nu cc bc chnh trong qu trnh tm kim thng tin s dng cng c tm


kim? Da trn nhng c s no nh gi tin cy ca thng tin tm kim
c. Hy nu mt v d c th cc bc tm kim mt ni dung nghin cu
(chng hn nghin cu chuyn gene khng thuc tr c vo thuc l) bng cng
c Google?
2. Tm mt s hnh nh vi khun E.coli, vi khun gy bnh bc l Xanthomonas
oryzae pv oryzae, nguyn l k thut PCR.

38

3. S dng cc cng c tm kim, hy tm cc ti liu v k thut PCR v ng


dng ca k thut ny. Yu cu: Xc nh t kha, s kt qu tm c. Trong
s cc kt qu tm c hy chn ra mt ti liu ng tin cy nht?
4. S dng kin thc hc hy tm kim a ch v truy cp vo cc trang ch
ca Ngn hng gen th gii NCBI, EMBL, EBI, DDJB, PubMed v trang ch
ca Vin nghin cu la quc t (IRRI).
5. Truy cp vo trang PubMed, tm kim cc ti liu lin quan n virus HIV hoc
bnh vim gan. Tm kim khong trn 10 bi bo (full text) trong CSLD
Pubmed sau dng chng trnh Endnote lu gi v qun l cc bi bo
ny dng mt th vin.
6. Trn c s th vin va xy dng hy tm kim cc bi bo theo cc trng (tn
tc gi, tn bi bo, nm cng b, t kha). T kt qu xy dng th vin, hy
p dng chng trnh Endnote trch dn t ng cc bi bo, cng trnh
nghin cu cho lun vn tt nghip.

39

PHN 2
C S D LIU SINH HC
NG K TRNH T VO C S D LIU
CHNG 4. C S D LIU SINH HC
C s d liu
Nn tng quan trng nht trong tin sinh hc ng dng l CSDL. Phn ln d
liu trong cc CSDL sinh hc l nhng trnh t sinh hc i km vi nhng thng tin
m t chi tit. Chng hn d liu t cc d n xc nh trnh t genome c to ra
hng ngy trn quy m ton th gii. s dng c cc c s d liu ny cn phi
c mt h thng t chc v sp xp chng mt cch hp l c th lu tr, phn
nhm, cho php truy cp, tm kim v so snh. Ngoi ra, do c th ca CSDL sinh
hc, ngoi d liu trnh t thng thng cn c cc CSDL cu trc, chc nng.
Do tnh phc tp v mi lin h gia cc CSDL nn rt kh c th sp xp v
phn loi CSDL mt cch tch bit. Theo ngun gc ca d liu c th phn chia
thnh CSDL s cp v CSDL th cp. CSDL s cp cha cc trnh t nucleotide hoc
amino acid trnh cu trc c xc nh t thc nghim cng vi nhng thng tin m
t lin quan n chc nng, cc bi bo cng b lin quan, lin kt cho vi cc c s
d liu khc. CSDL th cp l CSDL cha cc d liu c cht lc, sp xp theo
nhng tiu ch nht nh t d liu ca CSDL s cp. Nu da vo c im d liu c
th phn chia thnh CSDL trnh t, CSDL cu trc v cc CSDL khc (hnh 18).
CSDL c vai tr v cng quan trng lm c s cho cc mc ch tm kim, phn tch
v so snh i chiu d liu. Kt hp vi cc cng c phn tch v cc lin kt cho
gia cc c s d liu, cc nh nghin cu c th xc nh, d on v phn tch
tm ra thng tin cha trong cc trnh t cng nh xc nh tnh cht v chc nng ca
cc trnh t sinh hc mi.

Hnh 18. Phn loi CSDL sinh hc

40

4.1. C s d liu s cp

4.1.1. CSDL trnh t nucleotide


GenBank
CSDL GenBank c xem l CSDL c bit v s dng nhiu nht thuc NCBI
(Center for Biotechnology Information ca M. Genbank l CSDL cho php truy cp
min ph cha hn 189.000.000 trnh t vi tng s hn 299.000.000.000 base ca
hn 380.000 sinh vt (tnh n thng 12 nm 2010). GenBank cng kt hp vi 2 ngn
hng ln ca chu u (European Molecular Biology Laboratory (EMBL) t ti
European Bioinformatics Institute (EBI) v DNA Data Bank of Japan (DDBJ) ca
Nht hnh thnh trung tm hp tc trnh t nucleotide quc t (INSDC).
Cc trnh t c gi vo NCBI phi c chiu di t 50 base tr ln c m t
chi tit bao gm s truy cp (accession number/AN). S truy cp ny s c gi
khng i ngay c khi trnh t c update. Trong mt s trng hp cc phin bn
(nh s) t sau s truy cp v c ngn cch bi du chm. Trnh t c a vo
Genbank thng qua vic ng k trnh t c thc hin thng qua giao din web
(Bankit) hoc qua email (Sequin). Vic ng k trnh t s c m t chi tit
chng sau.
Mi trnh t lu tr trong Genbank c gi l mt mc (entry) c bt u
vi t kha LOCUS theo sau l tn locus (locus name). Tng t vi AN, tn locus l
duy nht tuy nhin, khc vi s truy cp, tn locus c th thay i sau khi c cn
nhc hoc sa i. Tn locus bao gm 8 k t bao gm ch u tin ch tn chi v
loi, sau l 6 con s ca s truy cp.
EMBL v DDBJ
Hai i tc chu u v Nht Bn ca GenBank l EMBL/EBI v DDBJ, y
cng l hai kho CSDL trnh t s cp. Ba CSDL GenBank/EMBL/DDBJ lin kt vi
nhau hnh thnh INSDC. CSDL ca mi i tc u c trao i vi nhau hng
ngy, v vy c th thc hin cc thao tc tm kim trnh t bt k ngn hng no.
Mc d nh dng cho mi entry ca NCBI v DDBJ so vi EMBL c s khc bit
nhng thng tin cha ng cho mi entry l nh nhau.

4.1.2. CSDL trnh t protein


SWISSPROT
Mt trong nhng CSDL ln nht cha cc trnh t protein c m t chi tit
nht l CSDL SWISSPROT c t ti Vin nghin cu tin sinh hc Thy S
(Institute of Bioinformatics/SIB). CSDL ny c h thng server gi l Expasy (Expert
Protein Analysis System). CSDL SWISSPROT c cha cc trnh t c chn lc
th cng, mi bn ghi (record) trong CSDL u c thm nh bi cc chuyn gia v
nu cn thit c th c i chiu vi cc cng trnh cng b. Chnh v iu ny m
CSDL ny c cht lng rt cao v c coi l tiu chun vng cho phn tch, tm hiu
thng tin v protein. Hn na SWISSPROT l mt phn trong CSDL UniProt hay cn
gi l UniProt.
Do s lng cc trnh t v thng tin mi c to ra lin tc nn cc chuyn
gia ca SIB khng th c thi gian bt kp v th mt CSDL mi c hnh
thnh bn cnh SWISSPROT l TrEMBL database. TrEMBL l ch vit tt ca
Translated EMBL v th n cha tt c cc trnh t protein c dch m t trnh t
41

DNA. Tt c cc thng tin m t u c thc hin t ng nh my tnh ch khng


phi cc chuyn gia v th tin cy ca TrEMBL km hn. C hai CSDL ny u c
th truy cp c thng qua giao din chnh SWISSPROT. Cc trnh t truy vn n
gin c th c nhp vo trong khung. Cc cng c tm kim v cng c phn tch
cc CSDL ny u c h tr SIB.
CSDL Protein NCBI
Mt CSDL trnh t rt quan trng khc cng c duy tr NCBI l CSDL
protein. CSDL ny khng ch n thun l cc d liu trnh t m l mt tp hp cc
entry t nhiu CSDL trnh t protein khc. Chng hn cc CSDL UniProt, PIR, v
PDB.
UniProt
Thng tin v cc protein trong UniProt vn tip tc tng ln nhanh chng. Bn
cnh thng tin v cc trnh t, cc m hnh biu hin, cc kt qu d on cu trc bc
2 v chc nng sinh hc cng c lu gi v m t. Tt c cc d liu ny c lu
gi trong cc CSDL, mt trong s chng l nhng CSDL c th (CSDL chuyn su
v mt lnh vc). tp hp c tt c cc thng tin lin quan n mt protein quan
tm c th mt rt nhiu thi gian. Chnh v vy EBI, SIB v Georgetown University
xy dng mt trung tm cho lu gi thng tin v cc protein gi l Universal
Protein Resource hay vit tt l UniProt. UniProt c thnh lp vo nm 2007 trn c
s kt hp ca cc CSDL protein nh: Swissprot, TrEMBL v PIR. UniProt bao gm
3 phn: (i) UniProt Knowledgebase (UniProtKB), (ii) c s d liu cc cm protein
c sp xp hay UniProt Reference Clusters Database (UniRef) v (iii) UniProt
Archive (UniPArc) l mt tp hp ca cc trnh t protein i km vi lch s ca n.
Trong s 3 CSDL ny ca UniProt, UniProtKB l CSDL tt nht c kt hp
ca Swissprot v TrEMBL. tm kim protein trong CSDL UniProtKB c th s
dng cc t kha di hoc t hp cc t kha. UniRef l mt CSDL trnh t duy nht
tc l mi trnh t ch c mt duy nht 1 ln. CSDL UniRef rt ph hp cho mc ch
tm kim trnh t tng ng. CSDL ny tn ti di 3 dng UniRef100, UniRef90 v
UniRef50. Mi CSDL ny cho php tm kim cc trnh t ging 100%, ln hn 90%
v ln hn 50%.
PIR
Protein information resource (PIR) cung cp cho cc nh khoa hc CSDL tin
cy v cc trnh t protein cng nh thng tin v chc nng ca chng mt cch chnh
xc v tin cy. PIR h tr c lc cho cc nghin cu v genome, proteom v sinh hc
h thng (system biology).
c thnh lp t nm 1984 bi hip hi nghin cu y sinh hc quc t
(NBRF) nhm h tr cc nh nghin cu xc nh v m t nh danh cc thng tin
trnh t protein. Bao gm so snh trnh t protein, xc nh cc trnh t c mi quan h
v tin ha da trn c s cn trnh t.

42

Hnh 19. C s d liu PIR


Tri qua hn 4 thp ch, bt u vi Atlas of Protein Sequence and Structure,
PIR cung cp cc CSDL protein v cng c phn tch cho php cc nh khoa hc
s dng v truy cp min ph bao gm c CSDL Protein Sequence Database (PSD).

4.1.3. C s d liu cu trc cc phn t


PDB
Protein data bank (PDB) l CSDL cha cc d liu cu trc ba chiu ca cc
i phn t sinh hc, chng hn nh protein v axit nucleic. D liu thng l kt qu
nghin cu thc nghim s dng cc k thut kt tinh v phn tch tinh th bng tia X
hoc phn tch ph NMR. D liu c thu thp t kt qu nghin cu ca tt c cc
nh khoa hc, nhm nghin cu trn ton th gii. PDB c coi l ngun cung cp
CSDL cu trc sinh hc ln nht c lin kt vi cc CSDL ln khc nh GenBank,
EMBL, SwissProt
Bt u t nm 1976 vi ch c 3 cu trc phn t protein c xc nh, tnh
n gia thng 5/2013, CSDL PDB cha tng s 90611 d liu cu trc cc phn t.

43

Phng
nghim

php

Tn x tia X
NMR
Knh hin vi in t
Lai
Khc
Tng

thc Proteins Nucleic acid

74593
8700
374
46
147
83860

1457
1029
45
3
4
2538

Phc
hp Cc
Tng s
protein/DNA phn t
khc
3864
2
79916
192
7
9928
126
0
545
2
1
52
6
13
170
4190
23
90611

Hnh 20. C s d liu cu trc protein PDB


hin th cc file ca PDB c th s dng cc chng trnh my tnh ngun
m. Mt s chng trnh c tch hp sn trn trang Web nh Pymol, UCSF
Chimera, Rasmol, Swiss-PDB Viewer. Cc phn mm ny thng i hi h tr
Javascript phin bn mi nht.
Ngoi vic lu gi cc d liu cu trc ca cc phn t, PDB cung cp cc cng
c cho php nh nghin cu so snh trnh t cc protein, m phng cu trc v so snh
cu trc ca cc protein.
SCOP
SCOP (Structure classification of Protein) phn loi cc protein bit cu trc
theo mt h thng th bc Cc protein thc hin chc nng sinh hc tng t nhau v
c mi quan h tin ha gn gi th chng s c cu trc tng t nhau, t nht l
nhng vng trung tm hot ng. Do c th d on c chc nng ca mt
protein cha bit bng cch so snh cu trc ca n vi cu trc cc protein bit.
CSDL SCOP cho php d on chc nng protein v c phn thnh ba dng l cc
44

h protein, siu h protein v cc cu trc gp np. Cc h protein bao gm cc


protein c mi quan h tin ha r rng v gn gi vi nhau c gii hn bi mt
mc ging nhau v trnh t t nht >30% trn ton b chiu di trnh t ca cc
protein. Nu khng p ng c nhng tiu ch ny cc protein s c xp vo
trong mt h nu nh chng vn c s tng ng v cu trc v chc nng. Tuy
nhin, cc protein c trnh t ging nhau mc thp nhng chng c mi quan h
vi nhau da vo cc c im cu trc v chc nng th s c xp thnh cc siu
h. Cc protein c cng kiu hoc dng cu trc bc hai trong cng mt kiu gp np
v cun li s c xp vo cng mt nhm.
CATH (Class Architecture Topology and Homologous Superfamily)
C s d liu CATH phn nhm cu trc cc protein theo kiu th bc thnh 4
cp. Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H). S
xp v k phn loi cc protein thnh nhm cc lp (Class) ch yu c tin hnh t
ng, mt phn cc cu trc bc 2 c xem xt v tnh ton m khng cn quan tm
n s sp xp v kt ni ca cc cu trc bc 2. C 4 lp protein c phn bit: (i)
protein c cu to ch yu bi cc cu trc xon (ch yu l xon alpha), (ii) phin
beta, (iii) c xon v phin (alpha-beta) v (iv) cc protein c rt t cu trc bc 2.
Nhm Archiecture (A) m t s sp xp ca cc thnh phn cu trc bc 2 mt
cch ln lt v chnh xc theo cch th cng. Trong nhm Topology m t dng
protein v s tng tc kt ni ca cc thnh phn cu trc bc 2. S phn nhm
Topology da vo thut ton s dng da trn c s thc nghim xut pht t cc
thng s phn nhm cc domain. Nhm siu h protein tng ng (H) bao gm
cc domain tng ng, chng hn cc domain c cng ngun gc chung. Mc
ging nhau ca cc trnh t c xc nh bng cch so snh trnh t sau bi so
snh cu trc ty thuc vo vic phn loi theo nhm Topology. Ngoi 4 nhm trn,
mt nhm th 5 gi l h trnh t (Superfamilies). Trong nhm ny cc domain c
phn nhm da vo mc ging nhau cao ca trnh t (t nht 35% ging nhau trn
hn 60% chiu di ca domain ln) v vy cc protein ny thng c chc nng tng
t nhau.
4.2. C s d liu th cp
PROSITE
Lm mt CSDL th cp cha cc protein c phn nhm da vo vic s
dng motif bo th (nhng vng trnh t ngn c kch thc t 10 n 20 amino acid
c tnh cht bo th cao trong cc phn t protein c mi lin h gn gi). y l c s
rt quan trng nghin cu chc nng protein.
Vic tm kim cc protein c cc dng motif ging nhau cho php pht hin
c chc nng ca chng. iu ny rt hu ch trong vic nghin cu mt protein
cha bit. Vic pht hin cc motif c trong protein cha bit ny c th gi v chc
nng v mt s c im sinh hc ca n. Vic pht hin cc motif da vo nguyn l
cn trnh t (xem chng 8).
PRINTS
Cc trnh t trong CSDL PRINTS c phn bit da vo nguyn l
fingerpriting. Cc Fingerprints bao gm mt vi motif trnh t. CSDL PRINTS li
45

dng c im cc protein cha cc vng chc nng ging nhau s c mt vi vng


motif trnh t ging nhau. Bng cch so snh mt s vng trnh t Fingerprint s xc
nh c mi lin h ca mt protein vi mt h protein bit thm ch ngay c khi
mt s motif b mt hoc khng c.
CSDL PRINTS c lin kt cho vi cc mc (entries) ca cc CSDL lin
quan nh cho php ngi s dng c th truy cp ti nhiu ngun thng tin lin
quan n h protein. Cng tng t nh Prosite, CSDL Prints cha thng tin v mi
h protein v, nu c th, chc nng sinh hc ca mi motif trong cc fingerprint.
Pfam
CSDL Pfam phn loi cc protein da vo dng. Mi dng c xc nh bng
kh nng xut hin ca mt amino acid nht nh, mt v tr chn thm hoc mt i
mt amino acid mi v tr trong mt trnh t protein. Cc protein trong Pfam c
phn nhm da vo vic cn trnh t. Kt qu ca vic cn trnh t s cho php phn
bit kt hp gia chc nng, cu trc v mi quan h tin ha.
4.3. Cc c s d liu khc

4.3.1. C s d liu kiu gene v kiu hnh


Mi quan h gia kiu gene v kiu hnh c nghin cu thng qua s thay
i kiu hnh ca cc gene b t bin. C mt s CSDL kiu gene/kiu hnh c
to ra lu gi cc mi quan h gia cc gene v cc c im sinh hc ca sinh vt.
Trong s c th k n CSDL OMIM (Online Mendelian Inheritance in Man) ca
NCBI. Mt dng CSDL na l dbGaP (Genotype and Phenotype database) ca NCBI.
D liu trong CSDL ny c s dng phn tch mc ngha thng k ca cc mi
quan h gia kiu gene v kiu hnh. Ngoi ra CSDL OMIA (Online Mendelian
Inheritance in Animals) NCBI cng cha cc mi quan h gia kiu gene v kiu
hnh nhiu loi ng vt, ngoi tr chut v ngi. Vi chut, CSDL tng ng l
MGD (Mouse genome database). Mi quan h gia genotype ca hai m hnh sinh vt
quan trng l rui dm (D. melanogaster) v giun trn (C. elegan), c lu gi
FlyBase v Wormbase. C hai CSDL cha thng tin cho mi quan h gia genotype
v phenotype.

4.3.2. CSDL kiu gene (PhenomicDB)


CSDL kiu gene l mt CSDL lu gi thng tin v kiu gene v kiu hnh ca
nhiu loi t ngi cho n nhng sinh vt c nghin cu nhiu nh chut, c, rui
dm, giun trn, nm men v Arabidopsis thaliana. CSDL ny kt hp d liu t nhiu
CSDL khc.
Mt im c bit ca CSDL PhenomicDB l c s so snh cho gia cc sinh
vt vi nhau da trn mi quan h gia kiu gene v kiu hnh. Vic so snh c
thc hin bng cch kt hp cc d liu phn tch cc gene tng ng theo kiu
orthology (phn li t mt t tin chung).

4.3.3. PubChem
L mt CSDL NCBI lu gi cc phn t nh v thng tin lin quan n cc
hot tnh sinh hc ca chng. PubChem bao gm 3 thnh phn: PubChem compound,
Pubchem substance v Pubchem Bio Assay. Trong PubChem compound cha hn
11 triu phn t (2007) cng vi cu trc 2 chiu tng ng.
46

PubChem substance cho php tm kim cc cht c to ra bi nhiu nh sn


xut, cc hp cht cha bit thnh phn v cc hp cht t nhin cha bit cu trc 2
chiu. PubChem BioAssay cung cp d liu v cc phn ng sinh hc. CSDL ny cho
php tm kim bng t kha truy vn (query). CSDL PubChem rt hu ch do c s
lin kt gia cc d liu bn trong CSDL v cc CSDL bn ngoi nh PubMed. Chng
hn khi bit mt cht c ch hot ng ca mt enzyme th c th tm c nhiu cht
c kh nng c ch tng t. Hn na cc phn t ha hc nh c th c xc nh
c cu trc khc nhau li c th c cng hot tnh sinh hc trong cc phn ng sinh
hc. y l c s p dng trong vic pht hin v pht trin cc cu trc thuc iu
tr mi.
Cc CSDL c th
Ngoi cc CSDL k trn, hin nay c ti hng nghn CSDL lu gi cc thng
tin v trnh t sinh hc, cu trc phn t, bn gene, mi lin h gia kiu gene v
kiu hnh. Vi s pht trin nhanh chng ca k thut xc nh trnh t genome th h
mi hng chc nghn genome ca cc sinh vt c xc nh trnh t. Cc CSDL
genome i km vi nhng thng tin m t c ngha rt ln trong vic khai thc thng
tin v genome, so snh genome cng nh nghin cu chc nng ca cc gene, cc
protein thng qua vic so snh khng ch mc phn t m c ton b genome.
i vi mt s i tng sinh vt c nghin cu k lng, thng tin chi tit
v tng gene hoc c ch iu ha hot ng ca cc gene u c m t. Mt v d
in hnh l CSDL Arabidopsis thaliana, CSDL v la v mt s i tng cy trng
quan trng.
S pht trin nhanh chng v s lng genome v kt qu ca vic so snh
genome hnh thnh nn cc CSDL v s a hnh cc nucleotide n (SNP). Cc c
s d liu SNP c ngha quan trng trong vic phn tch s a hnh ca cc sinh vt
v mi lin h gia SNP vi cc tnh trng v k c bnh tt. Nghin cu v SNP cng
gp phn nghin cu s phn ng khc nhau mc c th vi cc nh hng ca
mi trng hoc thuc iu tr. i vi vt nui, khai thc cc CSDL SNP cng cung
cp cc ch th phn t ng dng trong chn to ging.
Nghin cu v gene v hot ng chc nng ca gene hnh thnh nn nhng
CSDL EST (expression sequence tag). Nhng CSDL ny c vai tr quan trng trong
vic nghin cu m hnh hot ng ca gene. Tng ng vi CSDL EST, CSDL STS
(sequence tag site) cha nhng trnh t DNA duy nht trong genome v v tr ca
chng c xc nh trn NST. CDSL ny c nhiu ngha bao gm vic lp bn
gene, pht trin cc marker phn t, h tr lp rp trnh t
4.4. Ngn hng gene
Khi nim

Ngn hng gene, GenBank, cha CSDL trnh t sinh hc v cc trnh t ny


c m t mt cch chi tit bao gm: thng tin v sinh vt cha trnh t , c im
ca trnh t (gene m ha cho protein hay RNA), kch thc gene, sn phm gene,
chc nng ca gene v sn phm gene). GenBank c cu thnh t 3 ngun: National
Center for Biotechnology Information (NCBI), European Molecular Biology
Laboratory (EMBL) Data Library t European Bioinformatics Institute (EBI) v DNA
Data Bank of Japan (DDBJ). C 3 CSDL ny hnh thnh trung tm hp tc trnh t
nucleotide quc t (INSDC).
L mt trong ba CSDL trnh t DNA ln (NCBI,
47

EMBL, DDBJ) nhng NCBI ca M. Hin nay ngn hng gene cha hn 189.000.000
trnh t vi tng s hn 299.000.000.000 base ca hn 380.000 sinh vt (tnh n
thng 12 nm 2010)

Hnh 21. S m t mi quan h ca 3 ngn hng gene


Cc ngn hng gene

a) NCBI
GenBank l CSDL trnh t ca NIH, c khong 126,551,501,141 bases trong
135,440,924 bn ghi trnh t (sequence records)v 191,401,393,188 bases trong
62,715,288 sequence records nhnh WGS (whole genome shortgun) vo thng 4
nm 2011.
Truy cp vo GenBank
C mt s cch sau:
- Tm trnh t trong GenBank (trnh t c xc nh v m t) bng Entrez
Nucleotide. Cc trnh t chia thnh 3 nhm: CoreNucleotide (b s tp chnh ca
GenBank), dbEST (Expressed Sequence Tags), v dbGSS (Genome Survey
Sequences).
- Tm v cn trnh t trong GenBank bng mt trnh t truy vn (query) s dng
cng c BLAST (Basic Local Alignment Search Tool). BLAST s tm trong cc
CSDL CoreNucleotide, dbEST, v dbGSS mt cch c lp.
- Tm cc lin kt v ti cc trnh t bng cc tin ch ca NCBI (NCBI e-utilities).
S dng d liu ca Genbank
CSDL ca GenBank c thit k nhm cung cp v khuyn khch nh nghin
cu truy cp tm hiu cc thng tin trnh t DNA. Chnh v vy, NCBI khng c bt
k gii hn no i vi ngi truy cp. Tuy nhin, mt s trnh t ng k c i km
vi bn quyn khi s dng cn phi tun theo mt s rng buc v quy nh.
Pht trin nhng tnh nng mi
NCBI lin tc pht trin cc cng c mi lm tng kh nng truy cp v ng
k trnh t vo GenBank. ng k ti khon NCBI s lin tc nhn c cc thng
tin mi qua email.

48

b) EMBL
Phng th nghim sinh hc phn t chu u l mt trung tm nghin cu hng
u th gii tp trung vo khoa hc s sng. EMBL bao gm hn 20 thnh vin quc
gia chu u: o, B, Croatia, an Mch, Php, c, Hy Lp, Iceland, Ireland, Israel,
, Luxembourg, H Lan, Norway, Ty Ban Nha, B o Nha, Thy in, Thy S v
Anh. Gn y c thm thnh vin mi l c.
CSDL EMBL cn c gi l (EMBL-Bank) cha cc ngun trnh t
nucleotide s cp ca chu u (primary nucleotide sequence resource). Ngun trnh t
DNA, RNA ch yu l do cc ng k ca cc nh nghin cu, cc d n xc nh
trnh t v cc ng dng bn quyn.Cc d liu trnh t c trao i vi 2 ngn hng
cn li hng ngy.
c) DDBJ
Ngn hng gene Nht Bn (DNA Data Bank of Japan, DDBJ) thnh lp nm
1986 l mt ngn hng trnh t DNA thuc National Institute of Genetics (NIG)
Shizuoka. c ti tr bi Japanese Ministry of Education, Culture, Sports, Science
and Technology (MEXT). N cng l 1 trong 3 thnh vin ca International
Nucleotide Sequence Database Collaboration (INSDC). Hng ngay DDBJ trao i d
liu vi EMBL EBI v GenBank NCBI. Nh vy 3 ngn hng ny c s lng
trnh t nh nhau bt k thi im no.
Mt s ngn hng gene c th
Rice genome database
Kch thc ca genome la khong 430 Mb v l nh nht trong s cc cy ng
cc c xc nh trnh t. Kch thc ny bng khong 1/7 so vi genome ngi
v gp 3.5 ln genome Arabidopsis. Tip ngay sau d n genome ngi, d n genome
la c tin hnh. Vo nm 1997, NST s 1 c hon tt. n thng 4 nm 2000 v
thng 2 nm 2001 d n hon tt nhng cha cng b. Hin nay, genome la c th
truy cp theo a ch: http://rice.genomics.org.cn/rice/link/ar.jsp hoc CSDL genome
ca NCBI.
Ngn hng gene Arabidopsis
Phin bn hin ti ca ngn hng gene Arabidopsis cha thng tin v cc gene
vi nhng bin th do qu trnh phn ct (splicing). iu c ngha l cng mt gene
c th c biu hin khc nhau ph thuc vo s bin th do qu trnh phn ct. a
ch truy cp ca ngn hng gene ny: http://www.atgc.org/Arabidopsis_Genome/
CSDL genome thc vt
3 ngun chnh ng gp to ra ngn hng genome thc vt: WGS (whole
genome sequencing), GSS (genome survey sequencing) v ESTs (expressed sequence
tags). Cc loi c tp trung gm: Arabidopsis, la, ng v Medicago truncatula.
Trnh t bao gm cc bn nhp ca cc d n xc nh trnh t, sau l EST v cc
cDNA. Trang web cho php truy cp vo genome thc vt: http://www.plantgdb.org/
Cc ngun genome khc
NCBI cung cp genome ca hn 3,200 sinh vt trong bao gm cc trnh t
hon tt v ang tip tc . a ch truy cp:
http://www.ncbi.nlm.nih.gov/About/tools/restable_org.html
49

Vai tr ca cc ngn hng gene


- Lu tr d liu trnh t: cc trnh t sau khi nhn c s c kim tra, phn
nhm v sp xp vo cc CSDL tng ng. Kt qu to ra ngun ti nguyn
dng chung cho tt c mi ngi.
- Ngn hng gene cho php truy cp, copy, ti v (download) cc d liu trnh t,
khai thc (data mining) v phn tch (data analysis).
- To c s cng b trnh t: Cc nh khoa hc ng k trnh t sinh hc ca
mnh vo ngn hng cng l cch cng b kt qu nghin cu v lm c s
trch dn trong cc bi bo m h s cng b.

Tm tt chng 4
1. C s d liu l ni lu tr cc d liu t nhiu ngun v c phn loi theo
nhng tiu ch nht nh nhm gip cho ngi s dng c th truy cp, tm
kim, i chiu v so snh d dng. CSDL sinh hc c vai tr v cng quan
trng v chng l nn tng cho vic tm kim, khai thc v phn tch v d
on.
2. CSDL sinh hc rt a dng v phc tp c lu tr trong cc trung tm d
liu. Cc d liu trnh t (nucleotide, amino acid) c lu tr cc ngn hng
gene, in hnh l Genebank (M), EMBL (Chu u) v DDBJ (Nht Bn).
Ngoi CSDL trnh t cn c rt nhiu cc loi CSDL khc nh: CSLD cu trc
ca cc i phn t; ngn hng gene ca cc loi sinh vt; CSDL cha cc tp
ch, bi bo; CSDL v cc cht ha hc, hnh nh, protocol
3. Hin nay, cc CSDL l ngun m cho php cc nh nghin cu tm kim, khai
thc min ph v ng thi ng k cc d liu t cc kt qu nghin cu ca
mnh ng gp xy dng CSDL. Cc CSDL thng c i km bi cc
cng c v cc phn mm h tr cho ngi s dng, chng hn nh cc cng c
tm kim, cc cng c hin th ha, cc phn mm so snh, i chiu cc
trnh t sinh hc

Cu hi n tp chng 4
1. Th no l CSDL sinh hc? CSDL sinh hc bao gm nhng loi d liu g?
2. Nu vai tr ca cc ngn hng gene (Genebank, EMBL, DDBJ) v mi lin h
gia chng.
3. Th no l CSDL s cp, CSDL th cp. Nu s khc bit gia hai loi CSDL
ny.
4. Lm th no ng k trnh t sinh hc vo cc ngn hng gene? Hy nu mt
vi cng c in hnh.
5. Th no l ngn hng CSDL genome c th? Hy k tn mt vi ngn hng
CSDL genome v nu ngha ca cc ngn hng gene ny?
6. Hy cho bit vai tr v ngha ca CSDL Pubmed.
7. Hy lit k cc CSLD chnh ca NCBI v nu tm tt ngha ca cc CSDL
ny.
8. Tm hiu v Ensemlb. Cho bit vai tr ca Wellcome Trust Sanger Institute
(WTSI)
50

9. Th no l CSDL cu trc cc i phn t? ngha ca CSDL ny?

51

CHNG 5
XC NH TRNH T V NG K TRNH T VO NGN HNG GENE
5.1. Xc nh trnh t nucleotide
Xc nh trnh t DNA l qu trnh ch ra chnh xc trt t sp xp ca cc trnh
t nucleotide ca phn t DNA . Hiu bit v trnh t DNA tr thnh mt yu
cu khng th thiu trong nghin cu sinh hc v cc ng dng lin quan khc nh
chn on bnh, phn tch t bin, chn on sm ung th... Cc k thut xc nh
trnh t ngy cng tin b v hin i gip cho vic xc nh trnh t nhanh chng t
trnh t DNA n l n trnh t ca ton b genome. Vo khong nhng nm 1970
vic xc nh trnh t bt u c tin hnh. C hai phng php c s dng trong
khong thi gian ny l phng php phn gii ha hc ca Maxam v Gilbert v
phng php s dng phn ng tng hp theo nguyn l kt thc chui ca Sanger.
Trong giai on ny, mc d phng php ca Sanger xut trc nhng do nhng
hn ch v mt k thut vo thi im nn phng php ca Maxam v Gilbert
chim u th. Sau ny do tin b v mt k thut phng php ca Sanger c s
dng ph bin. Qu trnh xc nh trnh t sau c t ng ha.
Gn y nhiu k thut xc nh trnh t th h mi ra i chng hn nh 454
pyrosequencing, Illumina (Solexa) sequencing ... cho php xc nh nhanh chng
trnh t ca ton b genome trong mt thi gian ngn vi chi ph tng i thp. Mc
d cc phng php xc nh trnh t mi c nhc n nhiu nhng vn cn nhiu
hn ch so vi phng php trc y. i vi nhng trnh t DNA n l, vic xc
nh trnh t vn c lm th cng thng qua vic nhn dng hoc xc nh trnh t
trc tip. Trong phm vi bi ging ny chi tit ca cc k thut xc nh trnh t s
khng c cp.
5.2. Xc nh trnh t genome
Khi nim
Xc nh trnh t genome l qu trnh xc nh ton b trnh t DNA c trong
genome ca sinh vt, k c ty th v lc lp (i vi thc vt). V mt l thuyt bt k
mt mu no t t bo biu m, ty xng, chn tc, ht, l cy... u cha y
thng tin di truyn l phn t DNA. i vi cc sinh vt lng bi c cc nhim sc
th tn ti thnh cp tng ng, trnh t DNA s c sp xp theo tng chic nhim
sc th trong b n bi.
Vic xc nh trnh t ton b genome cng tri qua lch s pht trin ring. Bt
u vo nm 1977, genome hon chnh u tin c xc nh trnh t l
bacteriophage X174 c kch thc 5,386. Tip vo nm 1995, vi khun u tin
Haemophilus influenzae c xc nh trnh t c kch thc 1,8 Mbp. Vo nm
2000, genome thc vt u tin c xc nh trnh t l Arabidopsis thaliana c kch
thc 157 Mb. Vo u nm 2003, genome ngi c xc nh hon chnh c
kch thc 3,2 Gbp. Vo nm 2008, d n xc nh trnh t 1000 genome c
khi ng. Cho n nay, hng hng chc nghn genome ca cc loi c xc nh
trnh t. Vi s pht trin nhanh chng ca cc k thut xc nh trnh t th h mi
vic xc nh trnh t genome c th hng vo i tng tng c th vi thi gian
v chi ph thp.
52

Nguyn l xc nh trnh t genome


Mt cch n gin, xc nh trnh t ton b genome l vic ch ra trt t sp
xp ca tt c cc nucleotide t u n cui ca tt c nhim sc th trong b n bi
ca mt loi. Do nhim sc th c kch thc rt ln nn cho n nay cha c k thut
no cho php xc nh ton b trnh t ca nhim sc th ch trong mt ln. xc
nh c trnh t ton b chiu di ngi ta phi ct genome (NST) mt cch ngu
nhin thnh nhng on c kch thc ph hp vi k thut s dng. Sau trnh t
ca tng on ngn c xc nh v cui cng cc on ngn c ni li vi nhau
thnh mt NST hon chnh. K thut shotgun l k thut c dng xc nh
trnh t genome ngi trong khong 10 nm, ban u genome c ct thnh nhng
mnh tng i ln c kch thc khong hng trm kilobase pair (kbp). Sau cc
on DNA c kch thc ln ny c a vo nhng vector/plasmid c kh nng
mang on DNA ln nh YAC, BAC, Cosmid... Cc plasmid ny sau c bin
np v gi trong t bo vi khun. Mi dng t bo vi khun (colony) mang mt trong
s cc on DNA c kch thc ln . T dng t bo ny, trnh t DNA cha trong
plasmid li c ct nh thnh nhng on c kch thc ph hp tng ng vi k
thut xc nh trnh t, mi on nh ny li c a vo cc plasmid v a vo t
bo vi khun to thnh cc dng khc nhau, mi dng t bo mang 1 on DNA nh
. Cc on ny sau c xc nh trnh t ring r (read) v cui cng c khp
ni li vi nhau ti to li ton b chiu di NST. Qu trnh ti to ny c gi l
lp rp trnh t.
Gn y, k thut xc nh trnh t th h mi da trn nguyn l ct nh
genome sinh vt thnh nhng on ngn, sau la chn cc on c kch thc trong
khong t 50 n 500 bp (ty thuc vo k thut v dng my xc nh trnh t). Cc
on ngn ny c xc nh trnh t sau c ni li vi nhau ti to ton b
genome ban u. Tuy nhin m bo chnh xc v hiu qu cn thit c cc
genome ca cng loi c xc nh trnh t lm d liu tham chiu (reference).
Trong trng hp cha c cc trnh t genome thuc cng mt loi th c th s dng
cc genome khc c mc gn gi v mt tin ha.
5.3. Lp rp trnh t
V nguyn l, vic xc nh trnh t cc on DNA ngn tng i n gin.
Tuy nhin, lp rp trnh t li l mt qu trnh rt phc tp. Sau khi c trnh t ngi
ta thu c rt nhiu on trnh t ngn c kch thc khc nhau ty thuc vo
phng php s dng. Cc on ny ting Anh c gi l read. V mt nguyn l,
cc on trnh t ngn ny c sp xp, ging vi nhau pht hin cc vng trnh t
ging chng ln nhau (overlap) to thnh nhng on di hn c gi l cc
contig (hnh 22). Nguyn l c bn ca qu trnh lp rp gm 3 bc.
- Cn (alignment) cc on trnh t ngn c c trnh t (read) pht hin
ra cc vng chng ln nhau
- Sp cc mnh vo nhng v tr k cn nhau vi mt hng ng ca mi mnh.
- Sp xp nhiu mnh ln thu c t bc trn suy din ra trnh t gc

53

che ph
(coverage)

10

11

10

11

12

10

11

12 13 14

10

11

12 13 14

Cc on trnh t ngn
(reads)

Mt contig (contigous)

Hnh 22. Nguyn l ca vic lp rp trnh t


Trong qu trnh lp rp, nhiu on read ni li vi nhau thnh cc contig.
Cc contig ni li vi nhau to thnh cc scaffold. Trong mi scaffold c th tn ti
nhng phn/vng m trnh t cha bit nhng v c bn ngi ta bit chiu di ca n
v cc phn ny c chn bi 2 on "reads" bit trnh t. Tt nhin, gia cc
scaffold khc nhau cn c th c nhng khong trng. Cc khong trng ny sau s
c xc nh trnh t tip da vo vic phn tch cc u cui ca cc trnh t
scaffold bit (hnh 23). Cc scaffold li c lp rp vi nhau da trn nguyn l
tng t to thnh bn trnh t genom hon chnh. Khi trnh t genome c xc
nh hon chnh c ngha l tt c cc scaffold u c ni vi nhau ng hng,
khng cn nhng khong trng hoc nhng vng trnh t cha bit. Ngoi cc bn
genome hon chnh trong CSDL hin nay cn nhiu bn trnh t genome ang vn
trong giai on ni cc scaffold hoc scaffold/BAC.

54

Hnh 23. Kt ni cc contig to thnh cc scaffold


Theo phng php xc nh trnh t t ng th h trc, cc on read
thng c chiu di t 800 bp n 1000 bp. Hin nay, vi cc phng php xc nh
trnh t th h mi, chng hn nh Alumina (Solexa), chiu di ca mi on read
khong t 50 n 500 bp hoc di hn ty theo th h my. Vn pht sinh l lm
th no cc trnh t ngn c sp xp theo ng v tr ca chng trong si DNA
ban u. Vic lp rp cc on trnh t ngn gp nhiu kh khn so vi cc on di
c bit i vi cc genome ca sinh vt nhn chun. Nguyn nhn l do trong
genome ca cc sinh vt ny cha nhiu trnh t lp li nm ri rc trong genome, cc
gene m ha cho rRNA hoc cc trnh t lp li trong vng d nhim sc. Nhng v tr
ny c th dn n nhng ch trng kh c th v c v trong mt s trng hp
c l kh m bo xc nh c hon chnh vi mc chnh xc v tin cy.
CAP3 l phn mm h tr lp rp dng ph bin quy m genome nh (thc hnh),
cc cng c khc tham kho (wikipedia:
http://en.wikipedia.org/wiki/Sequence_assembly).
5.4. ng k trnh t
Ti sao cn ng k trnh t ?
ng k trnh t vo CSDL cng l mt cch cng b kt qu nghin cu. Khi
cc nh khoa hc ng k trnh t vo GenBank, h s c c hi gi s liu ca mnh
trong mt khong thi gian theo yu cu. iu ny to ra bng chng d liu ca h
c gi trong GenBank trc khi cng b cng trnh nghin cu. Vic trch dn s
truy cp cho trnh t cng b l mt trong nhng yu cu trong vic ng bo. Khi
bi bo ng cng trnh nghin cu cha trch dn ca trnh t hoc s truy cp ca n
c cng b, trnh t s c m v mi ngi c th truy cp. ng k trnh t
cng gp phn xy dng CSDL chung, ng gp vo cc d n xc nh trnh t
genome ca cc sinh vt v m t genome (mt phn ca vic gii m genome).
ng k trnh t trc tip (Direct submissions)
Trnh t ng k vo GenBank in hnh bao gm 1 si DNA hoc RNA n i
km vi nhng m t chi tit. M t y bao gm cung cp nhng thng tin sinh hc
c cung cp km theo trnh t, cc thng tin ny phi tun theo tiu chun ca
INSDC (International Nucleotide Sequence Database Collaboration). C th ng k
trnh t mt cch ring r hoc nhiu trnh t mt lc. Kiu ng nhiu trnh t cng
mt lc c gi l batch submissions.
55

ng k vi s lng trnh t ln HTGS


ng k vi s lng trnh t ln HTGS (High-Throughput Genomic
Sequence) thng c thc hin bi cc trung tm xc nh trnh t genome c cc
h thng t ng. Hin nay c khong 30 trung tm genome ang ng k cc trnh t
ca nhiu loi nh: ngi, chut, la, k sinh trng st rt
Cc d liu HTGS c ng k theo 4 pha: 0, 1, 2, v 3. pha 0, cc trnh t
l t mt n mt s ln c (read) ca mt clone ring r (one-to-few reads of a
single clone) v thng khng c gn thnh cc contig. y l nhng trnh t c
cht lng thp thng c dng kim tra liu cc trung tm khc xc nh
trnh t mt phn clone. Pha 1, cc mc (entries) c lp rp thnh cc contig v
c phn tch bi cc vng trng (sequence gaps), trong trt t tng i v
hng ca cc contig ny cha c bit (xem hnh). Pha 2, cc mc (entries) l
nhng trnh t cha hon tt c th hoc khng cha cc ch trng. Nu chng l cc
ch trng th cc contig th cc contig ng v trt t v hng. Pha 3, cc trnh t
thuc nhm c cht lng tt xc nh hon chnh v khng c ch trng.

Hnh 24. S th hin hng v cc ch trng c th thy trong cc pha ca


HTGS
Pha 0, 1, 2 l cc bn ghi trong nhnh HTG ca GenBank, trong cc mc ca
pha 3 i vo nhnh phn loi ca cc sinh vt, chng hn, PRI (linh trng, primate)
cho ngi. Mt mc (entry) gi s truy cp ca n khi n tri qua t mt pha ny sang
pha khc nhng nhn mt s truy cp mi. S phin bn (version number) v mt s
gi mi (new gi number) mi ln l c mt s thay i trnh t.
ng k d liu vo HTG (Submitting Data to the HTG Division)
ng k mt khi cc trnh t vo HTG, ngi ng k cn c 1 ti khon
FTP. c ti khon cn vit th n a ch htgs-admin@ncbi.nlm.nih.gov. C 2
cng c cho php ng k: Sequin hoc fa2htgs. C 2 cng c ny i hi trnh t phi
nh dng FASTA. Con ng x l trnh t HTG. Ngi ng k to ra file ng k
trnh t to ra bng Sequin, fa2htgs c th gi thng qua ti khon FTP. Qu trnh x
l lc ng k c th ko di ti nhiu gi hoc c ngy.
3 li thng gp khi ng k
- Li nh dng: khng nh dng ng theo kiu ca Seq-submit format
- Cc thng s: genome center tag, tn trnh t (sequence name), s truy cp
(accession number), thng tin cung cp cha chnh xc.
- D liu: d liu b li

56

Khi qu trnh x l ng k HTG khng thnh cng, GenBank s gi email ti trung


tm trnh t, thng bo cc li. iu ny gip cho cc nhn vin ca trung tm ng k
genome sa cha cc vn trong CSDL ca h.
i vi cc ng k thnh cng, 2 file c to ra, mt file cha ni dung trnh
t ng k vo GenBank dng flat file (khng c trnh t) v mt file thng bo tnh
trng. File ny cha cc thng tin: trung tm genome (genome center), tn trnh t
(sequence name), s truy cp (accession number), phase (pha c trnh t), ngy to ra
trnh t v cc thng tin update cho ng k. ng k tht bi s nhn c mt file
li vi m t li i km. Nhn vin trong GenBank cng gi email gii thch cc li
vi chi tit hn.
Quy trnh kim tra
Khi ng k trnh t thnh cng, trnh t c a vo trong GenBank, chng
phi tri qua mt qu trnh kim tra. Nu nhng nhn vin trong GenBank tm thy li
hoc cha y thong tin h s vit th n ngi ng k, yu cu h sa cha li
cc li v ng k mt bn update.
ng k t kt qu xc nh trnh t ton b genome (Whole Genome Shotgun
Sequences/ WGS)
Cc trung tm genome s dng nhiu hng tip cn xc nh trnh t ton
b genome ca mt s sinh vt. Ngoi hng tip cn xc nh genome truyn thng
bng cc clone nhng mnh trnh t sau lp rp nh k trn, nhng trung tm ny
thng s dng hng tip cn WGS ng k vo HTGS. Cc trnh t c c t
phng php shotgun (read) s c lp rp thnh cc contig, nhng mnh contig ny
by gi cng c chp nhn ng k vo GenBank (di dng vi hoc inclusion).
Cc bn lp rp ca cc contig t WGS c th c update khi cc pha ca d n xc
nh hon tt hoc c cc bn lp rp mi. Cc trnh t WGS cng c m t
(annotation) tng t nh vi cc trnh t n ng k trong GenBank.
Mi d n xc nh trnh t c giao cho mt con s nht nh (ID), gm 4 k
t. S truy cp cho mt trnh t WGS cha ID ca d n, tip theo l 2 con s th hin
phin bn (version number), v 6 con s k hiu cho tn ca contig (contig ID). Chng
hn, mt d n xc nh trnh t genome c th c mt s truy cp cho mt trnh t
WGS nh th ny: AAAX00000000. Phin bn lp rp u tin s l
AAAX01000000. Su k t tip theo s xc nh tng contig ring r.
Vic ng k trnh t WGS c th c to ra bng cch s dng tbl12asn, mt
chng trnh tin ch c i km vi phn mm Sequin. Thng tin chi tit cho qu
trnh ng k c th truy cp theo trang web ca Whole Genome Shotgun
Submissions.
ng k cc trnh t EST, STS v GSS theo lng ln (Bulk Submission: EST,
STS, and GSS)
Cc trnh t biu hin c nh du (Expressed Sequence Tags/ EST), cc
trnh t c v tr c nh du (Sequence Tagged Sites/ STSs), v cc trnh t t kho
st genome (Genome Survey Sequences/ GSSs) thng c ng k vo ngn hang
gene theo dng mt lot batch v thng l mt phn ca cc d n xc nh trnh t
ln hoc ca mt genome c th. Nhng bn ghi ca cc trnh t (entries) s c

57

ng k lin tc (streamlined submission process) v cng tri qua qu trnh x l


trc khi c a vo trong GenBank.
Cc trnh t EST thng c kch thc tng i ngn (<1 kb), nhng trnh t
cDNA n (single-pass cDNA sequences) t mt m nht nh trong mt giai on
pht trin nht nh. c im chung ca cc EST l c rt t thng tin m t v chng.
STSs l nhng trnh t ngn c mt duy nht trong genome v v tr ca chng
c bit (trn NST). Cc trnh t STS c th nhn ln c bng PCR, chnh v
vy cc trnh t ny thng c s dng nh du (mapping).
GSSs l nhng trnh t gn c ngun gc t DNA genome v thn tin m t
nhng trnh t ny rt hn ch. Cc trnh t GSSs bao gm GSSs n (single-pass
GSSs), BAC ends, exon-trapped genomic sequences, v cc trnh t Alu PCR.
Cc trnh t EST, STS, v GSS c sp xp trong nhng CSDL ring trong
GenBank ch khng phn loi theo sinh vt. Trong GenBank chng c k hiu l
cc CSDL dbEST, dbSTS, v dbGSS.
ng k d liu vo CSDL dbEST, dbSTS, hoc dbGSS
Thng thng ngi ng k to ra cc file c nh dng nht nh, cc file
ny cha trnh t cn ng k. Sau file c gi qua email ti a ch: batchsub@ncbi.nlm.nih.gov. Nu file qu ln cho email c th dung phng thc chuyn
file FTP thong qua ti khon FTP. Sau cc bc sang lc v kim tra GenBank, cc
trnh t c a vo nhng CSDL tng ng, mt s truy cp (accession number) s
c a cho ngi ng k.
c thm: http://www.ncbi.nlm.nih.gov/books/NBK21105/
5.5. Cc cng c ng k trnh t
C nhiu cch ng k trnh t chng hn s dng giao din web c th dng cng
c BankIt ca NCBI, Webin ca EMBL hoc s dng cng c ca DDBJ (Nucleotide
Sequence Submission). C th ng k theo dng offline bng cng c Sequin ca
NCBI. Trong phm vi bi ging ny mt s cng c h tr ng k trnh t s c
gii thiu.
5.5.1. S dng World wide web (WWW)
Cng c BankIt
Dng ng k bng giao din web, thun tin v d dng p dng cho vic ng
k s lng trnh t nh. Thng tin i km cho cc trnh t thng khng i hi qu
nhiu. Cc thng tin ng k cn phi c in y . Cc trnh t vector (ln vi
trnh t ca vector s dng tch dng v c trnh t) s phi loi b bng cch
dung cng c BLAST cho CSDL VecScreen. Dng ng k hon tt s c lu li
di dng ASN.1 v c xc nhn bng email cho thy qu trnh ng k c hon
tt.

58

Hnh 25. Cng c ng k trnh t BankIt

Hnh 26. Cng c ng k trnh t DDBJ


i vi ngn hng gene chu u EMBL, ngn hng ny cung cp cng c Webin theo
a ch:
http://www.ebi.ac.uk/ena/about/submit_and_update

59

Hnh 27. Cng c ng k trnh t ti EMBL


Ngn hng gene NCBI cung cp cng c Sequin.
a ch http://www.ncbi.nlm.nih.gov/Sequin/index.html

60

5.5.1. Cc thng tin cn thit phi chun b trc khi ng k trnh t


Thng tin v c nhn
Nhng thng tin v c nhn bao gm tn, a ch email, a ch c quan, s in
thoi.
Bn cht ca trnh t
Trnh t c ngun gc t genomic hay mRNA? Nhng ngi s dng cc
CSDL mun bit bn vt l ca on DNA c bt ngun t u. Chng hn, mc d
trnh t cDNA c thc hin t DNA (khng phi RNA), dng phn t ny c mt
trong t bo l mRNA. iu tng t i vi cc trnh t genom ca cc gen rRNA
trong phn t c c trnh t hu ht l DNA genome. Trnh t c ng k
nn l mt dng phn t duy nht, n khng th l dng hn hp ca DNA genome v
dng mRNA v trong c th sng khng tn ti trnh t trn ln ny.
chnh xc ca trnh t
Trnh t ng k cn c chnh xc cao, i vi trnh t DNA cn c c
c 2 chiu. Ngoi ra, i vi cc on DNA c gn vo vector nhn dng cn phi
xc nh c cha trnh t ca cc vector ny sau khi c trnh t hay khng. C nhiu
cng c h tr cho vic xc nh vic ny, chng hn cng c ca NCBI.
Ngun sinh vt
Trnh t DNA phi c xc nh r ngun gc t sinh vt no. NCBI h tr
cng c tra cu v tr h thng hc ca cc sinh vt (taxon).
Trch dn
Khi ng k trnh t nhng thng tin lin quan v trnh t cng chi tit th cng
tt k c bao gm cc bi bo c ng hoc d nh ng.
Cc trnh t m ha (coding sequence CDS)
Trong trng hp ng k trnh t gene m ha, vic xc nh vng trnh t m
ha l bt buc. n nay trnh t amino acid hu nh c dch m t trnh t
nucleotide. Vic dch m t trnh t nucleotide sang trnh t axit amin khng c vn
g kh khn tuy nhin cn lu v m di truyn c s khc nhau mt s sinh vt.
Hin nay, c s d liu protein (chng hn SWISS-PROT v PIR) phn ln
c xy dng da vo trnh t protein dch m t trnh t DNA (c trnh t DNA i
km). Do , vic cung cp thm trnh t protein l khng th thiu trong qu trnh
ng k trnh t. Khi ng k trnh t c ngun gc mRNA, vic ch r ORF cng rt
quan trng. C nhiu cng c h tr cho vic ny, chng hn ORF Finder ca NCBI.
Cc vng exon, intron ca mt on ADN khi ng k trnh t cng cn c ch r.
Ngoi ra cc thng tin khc lin quan cng cn phi cung cp bao gm chiu
di trnh t, dng phn t. Cui cng phn quan trng copy trnh t hoc a ng
dn n file cha trnh t
Cc bc ng k tip theo c thc hin tun t theo hng dn ca WebIn,
trnh t ng k nu c chp nhn bi EMBL ngi ng k s nhn c nhng
thng tin cn thit v m s truy cp di dng th in t theo a ch hp th ng
k. Mt iu cn lu l do vic ng k trc tip bng WebIn ph thuc vo kh
nng kt ni v tc truy cp Internet. Thng tin cung cp ng k bng WebIn
61

i hi phc tp v chnh xc do m bo qu trnh ng k thnh cng cn phi


chun b cc thng tin lin quan n trnh t ng k tht y .

5.5.2. V d ng k trnh t bng WebIn


Webin ca EMBL cung cp 3 ty chn chn cho ngi ng k : (i) ng k d
liu trnh t n l, tc l trnh t ca nhng on DNA hoc gene di dng on
c c ring r (read data), (ii) ng k di dng cc on trnh t c lp rp
v m t (assembled sequence and/or annotation), (iii) ng k cc bn trnh t
genome lp rp (genome asseblies) (hnh 28).
Qu trnh ng k trnh t kh n gin, bc u tin cn ng k vi EMBL
m mt ti khon. Vic m ti khon c thc hin thng qua lin h vi EMBL
theo a ch datasubs@ebi.ac.uk. Sau khi c ti khon cc bc ng k c thc
hin tun t theo hng dn.

Hnh 28. Trang ng k trnh t SRA Webin

5.5.3. V d ng k trnh t bng Sequin


a/ Gii thiu
Sequin l mt phn mm c lp c NCBI pht trin ng k v update
trnh t vo cc CSDL GenBank, EMBL, DDBJ. Sequin c kh nng thc hin vi
nhiu trnh t di hoc mt b cc trnh t. Cho php sa cha, update v cung cp
nhng m t cn thit trong qu trnh ng k trnh t, ngoi ra Sequin cn cha mt
s cc chc nng khc i km. Chi tit v Sequin c th truy cp theo a ch:
http://www.ncbi.nlm.nih.gov/Sequin/ hoc qua email: info@ncbi.nlm.nih.gov
b/ Cu trc c bn ca Sequin
Sequin gm mt lot cc form vi giao din n gin, d s dng
62

Trang thng tin v tc gi ng k


Tn sinh vt v trnh t ng k
Thng tin v tn chng, tn gene v protein
Xem li ton b cc thng tin chun b ng k
Sa cha v m t trnh t

c/ Trc khi bt u:
Chun b d liu trnh nucleotide v trnh t axit amin. Sequin thng nhn trnh t
di dng FASTA, ngoi ra c th l PHYLIP, NEXUS, MACAW hoc
FASTA+GAP.
Xem
chi
tit
nh
dng
file

a
ch:
http://www.ncbi.nlm.nih.gov/Sequin/faq.html#Orgnameforphyl k t c nh s s
dng bng m ASCII di dng text (plain text).
d/ ng k trnh t
y s dng v d l trnh t genom ca D.melanogaster m ha cho hai yu
t khi u 4E-I v 4E-II (S truy cp trong GenBank l U54469).
Sau khi hon tt vic chun b cc file trnh t, bt chng trnh Sequin.
Form u tin ca Sequin xut hin nh sau:

Trong trng hp c nhng vn cha r c th c cc tr gip ca Sequin i km


(Show Help) nh hnh di y.
bt u Click vo ntStart New Submission. Trang bt u ng k c m ra
nh sau:
Form v tc gi ng k
Trang ny s hi tc gi ng k cung cp nhng thng tin m bao gm: v tr lm
vic, tn v cc thng tin lin h

63

Form format trnh t


Trnh t phi dng FASTA i vi ng k mt hay nhiu trnh t n gin
(PHYLIP, NEXUS, MACAW, or FASTA+GAP). Trong trng hp cc trnh t ng
k khng lin quan g n nhau tt nht nn ng k tng trnh t mt.
Form v trnh t v tn sinh vt

bc ny, cc trnh t v nucleotide v protein phi c ghi vo cc file


notepad, khi nhp trnh t (Import nucleotide FASTA) hoc (Import protein
FASTA) ta ch cn a ng dn n file v nhp vo nt Next Page.
Xem kt qu trc khi gi ng k
Sau khi hon tt cc bc trn, Sequin s cho chng ta xem mt giao din cha
nh dng trnh t theo tiu chun ca GenBank. Kt qu ny c lu li di dng
mt file gi cho ngn hng gen qua email. Sau mt khong thi gian x l d liu
ng k, ngn hng gen s cho chng ta bit h nhn c v chp nhn hay khng.
Nu c chp nhn ngi ng k s c cung cp mt a ch truy cp di dng
s v ch chng hn nh U54469.
64

Tm tt chng 5
1. Vi s tin b v k thut, hin nay trnh t nucleotide ca cc sinh vt c
xc nh mt cch rt nhanh chng, ton b genome ca mt sinh vt c th
c xc nh trong vi ngy. Nhng k thut xc nh trnh t hin i ang
dung hin nay c gi l cc k thut xc nh trnh t th h mi ( phn
bit vi k thut ca Maxam Gilber v phng php t ng ca Sanger).
Phng php xc nh trnh t th h mi bao gm: Pyrosequencing, Alumina
(Solexa) sequencing, Solid sequencing
2. Vic lp rp cc trnh t ngn c c (read) c thc hin da trn c s
cn trnh t. Hin nay qu trnh lp rp c h tr bi cc phn mm v h
thng my tnh mnh.
3. Cc trnh t nucleotide hoc ton b trnh t genome ca cc sinh vt c
ng k vo ngn hng gene thng qua cc cng c ng k. Cng vic ny v
cng c ngha ngoi vic cng b cng trnh nghin cu ca cc nh khoa hc
cn c ngha trong vic xy dng kho CSDL trnh t genome ca cc sinh vt
quy m ton th gii.

Cu hi n tp chng 5
1. M t nguyn l ca phng php xc nh trnh t ca Maxam-Gilbert
2. M t nguyn l ca phng php xc nh trnh t t ng ca Sanger. Hy
cho bit trnh t genome ngi c xc nh theo nguyn l v phng php
no?
3. M t nguyn l ca phng php xc nh trnh t: Pyrosequencing, Alumina
(Solexa) sequencing v Solid sequencing.
4. Bng cc cng c tm kim hc tm hiu v nu nguyn l ca cc phng
php xc nh trnh t khc.
5. Gii thch nguyn l ca vic lp rp trnh t. Cho v d mt cng c h tr lp
rp trnh t.
6. Hy nu cc cng c h tr vic ng k trnh t vo ngn hng gene
7. ngha ca vic ng k trnh t
8. S dng cng c Sequin ng k mt trnh t (bt k) vo ngn hng gene.

65

PHN 3
CC CNG C PHN TCH
KHAI THC V X L D LIU TRNH T SINH HC
CHNG 6. GENOME BROWSER

6.1. Khi nim genome browser


Genome browser l mt trnh duyt giao din ha cho php hin th thng tin
t mt CSDL sinh hc cho cc d liu lin quan n genome. Genome browser cung
cp nhng thng tin c bn gm: kch thc genome, s lng NST, s lng gene v
thng tin c th hn bao gm v tr ca cc gene, cc vng m ha (CDS), v tr ca
cc trnh t STS, EST di dng ta vt l trn NST. Ngoi ra, thng tin v chc
nng ca cc gene, cc trnh t DNA c m t chi tit. Ngoi ra cn c nhng dng
d liu so snh hoc i chiu cc gene, vng DNA gia cc loi c lin quan.
Genome browser cho php cc nh nghin cu quan st v thc hin cc thao tc quan
st tm kim thng tin cha trong cc genome, chc nng ca cc gene hoc d on
thng tin v cc protein, s biu hin, iu ha v cc dng bin i trong genome.
Genome browser khc vi nhng CSDL ch cc d liu m t c ngun gc
t nhiu CSDL. Tuy nhin chng khc nhau t cc CSDL ban u cch m chng
c hin th dng ha, v tr ta ca genome trn mt trc honh v cc v tr
m t c hin th bng cc khong trng c in hoc t mu cho bit s c
mt ca cc gene v cc thnh phn khc nh khong trng gia cc gene, intron, cc
v tr a hnh (SNP)
Ngoi ra genome browser cn cung cp thng tin th cp chng hn ch ra cc
vng bo th trong genome, cc gene hay h gene c lin quan vi nhau, cc protein
tng ng. Gn y cc CSDL v cu trc protein cng c tch hp vo trong cc
genome browser. Cc cng c phn tch c bn nh tm kim, so snh trnh t, cu
trc sinh hc cng c a vo cc genome browser nhm h tr cc mc ch khc
nhau ca nh nghin cu. Cc genome browser c th cho cc loi cng c xy
dng, chng hn cc genome browser cho ngi, chut, la, ng, u tng,
arabidopsis... C 3 genome browser c bit n nhiu nht l Ensembl Genomes,
NCBI's Map Viewer, University of California Santa Cruz (UCSC) Genome Browser.
Mi genome browser cung cp mt giao din ha v cc c im h tr ngi s
dng tm thng tin ca gene cng nh thng tin v cc c im ca cc gene chng
hn nh cc exon, cc vng khng m ha v cc bin th ca chng. Ngoi ra cn c
nhiu genome browser khc c trng cho mt hoc mt s genome khc nhau chng
hn nh la (Rice genome browser h tr bi NSF), Arabidopsis thaliana (ti
PlantGDB), ng (MaizeGDB)
6.2. Gii thiu mt s genome browser quan trng

6.2.1. Ensembl
Ensembl l mt d n kt hp gia EMBL-EBI v Sanger Institute pht trin
h thng phn mm to ra v duy tr cc m t ca mt genome eukaryote nht
nh. Ensembl ban u c ti tr bi Wellcome Trust. Trang web cho php truy cp
min ph tt c cc d liu v phn mm t d n Ensembl.
D n Ensemble cung cp cc CSDL genome ca cc ng vt c xng sng
v cc loi eukaryotic v a thng tin ca chng online cho php truy cp min ph.
66

Hin nay trong Ensemble, genome ngi cha khong 3,2 t cp base, m ha cho
khong 20.000 n 25.000 gene. Genome browser khng ch cung cp trnh t genome
m quan trng nht l cung cp thng tin v v tr v cc mi quan h ca cc gene
c m t v xc nh c th trn cc NST. Giai on u khi thng tin v cc trnh
t cn hn ch, cc nh khoa hc phi m t th cng da, nh v cc gene bng cch
s dng cc d liu thu c t cc th nghim, cc tp ch khoa hc v cc CSDL. V
m t th cng nn cc thng tin c kim sot cht ch v c thm nh bi cc
chuyn gia nn chnh xc ca d liu rt cao. Tuy nhin, y l mt qu trnh i
hi rt nhiu thi gian v cng sc. Khi d liu tch ly c mt mc nht
nh v do s lng trnh t genome c c ngy cng nhiu nn vic m t th
cng s khng th p ng c. Chnh v vy, vic pht trin cc thut ton h tr
cho vic m t genome t ng c pht trin. Trong d n Ensembl, d liu trnh
t c a vo mt phn mm pipeline vit bng ngn ng Perl cho php to ra
mt b cc v tr ca cc gene c d on v lu li trong mt CSDL MySQL
phn tch v hin th sau . Ensembl cho php nhng d liu ny c truy cp t do
v ti v my trn ton th gii.

Hnh 29. Ensembl genome browser


Tri qua mt thi gian, d n m rng cho nhiu loi sinh vt bao gm (cc
ng vt m hnh nh chut, rui dm, c nga) cng nh mt phm vi rt rng cc d
liu genome bao gm nhng bin i di truyn, cc c im iu ha biu hin gene.
T thng 4 nm 2009, mt d n tip theo l Ensembl Genomes m rng phm vi
ca Ensembl vo cc i tng ng vt khng xng sng metazoan, thc vt, nm,
vi khun, v protista, trong khi d n ban u vn tp trung vo nhm ng vt c
xng sng.
67

Hin nay d liu genome trong Ensembl c chia thnh nhiu nhm. Nhm
thng dng bao gm cc genome ca ngi, chut v zebrafish. Nhm linh trng
bao gm 10 ging. Ngoi ra cn cc nhm ca lp chim, b st, lng c, nm v cc
sinh vt khc.

Hnh 30. Genome ngi ti Ensembl

6.2.2. UCSC
i hc California (University of California, Santa Cruz (UCSC)) thnh lp
mt trung tm gi l UCSC Genome Browsercha cc trnh t genome ca rt nhiu
sinh vt bao gm c ng vt c v khng c xng sng. Cc trnh t ny c sp
xp v m t chi tit. Browser l mt cng c hin th ha h tr cho cc thao tc
tm kim v truy cp CSDL mt cch nhanh chng rt nhiu mc . UCSC gn y
m rng s lng genome trong CSDL, tng s hin nay ln ti hng trm loi.

68

Hnh 31. UCSC Genome browser


UCSC Genome Browser s thc hin nhng cng vic ny bao gm: cn cc
trnh t mRNA, nh du cc thnh phn DNA lp li, d on cc gene trong
genome, d liu cc gene c biu hin, d liu cc trnh t lin kt vi cc bnh
(thng qua mi lin h ca cc gene vi cc bnh tt), lp bn cc gene chip ang
c thng mi ha (chng hn ca Illumina v Agilent).
Thng thng, trnh t genome s c hin th theo chiu ngang v k hiu
ha s ch cc v tr ca mRNA, cc gene, EST Mt khi cc mu khc nhau dc
theo trc honh s th hin v tr ca cc on trnh t c cn bi nhiu d liu t
cc ngun khc nhau. Ngi s dng c th phng to hoc thu nh thun tin cho
vic quan st trn mn hnh. Mc chi tit m t (annotation) cng cao th phn
gii ca trnh t hin th trn genome browser cng ln.
tm mt gene nht nh hoc mt vng trong genome ngi s dng c th
nh tn ca gene (chng hn BRCA1), mt s truy cp cho mt RNA, tn ca mt
vng DNA trn NST (chng hn 20p13 l vt bng nhum th 13 trong vng cnh tay
ngn ca NST s 20), hoc v tr ca on DNA trn NST (v d: chr17: 38,450,00038,531,000 cho mt vng xung quanh gene BRCA1).
Cc d liu c hin th di dng ha cho php ngi dng c th truy
cp xem thng tin chi tit khi a chut vo hoc nhp chut. Ngoi ra UCSC
Genes track cn cung cp cc lin kt ti nhng thng tin chi tit v gene ang quan
tm cc ngun d liu khc, chng hn nh t Online Mendelian Inheritance in Man
(OMIM) v SwissProt. UCSC c thit k hin th cc d liu phc tp mang tnh
nh lng v th i hi phi c tc x l nhanh. Bng cch thc hin vic cn
trnh t ca khong 55 triu phn t RNAs ca Genbank vi mi b genome c
xc nh trnh t sn t trc nn UCSC cho php ngi dng truy cp ngay vo cc
kt qu cn trnh t ca bt k mt RNA vi genome ca mt loi c mt trong UCSC.
Mt im khc bit duy nht m ngi dng c th phn bit gia UCSC vi
cc Genome browser khc l kh nng hin th ang dng v lin tc ca n. Trnh t
bt k kch thc no u c th hin th c, t mt nucleotide cho n ton b
NST (chng hn NST s 1 ca ngi c chiu di 245 Mb) u c m t. Cc nh
nghin cu c th hin th mt gene, mt exon hoc ton b NST, hng nghn gene
69

hoc t hp ca nhiu yu t. Cch truyn thng l r th drag and drop cho php
ngi dng c th chn bt k vng no trong ton b genome v phng to vng ra
ton mn hnh.
Cc nh nghin cu cng c th dng Genome browser hin th d liu ca
chnh mnh nh cng c Custom Tracks. Cng c ny cho php ngi dng upload
mt file cha trnh t ca chnh mnh v quan st d liu nhng mc khc nhau.
Ngi dng cng c th s dng d liu ca UCSC to ra nhng b d liu s dng
Table Browser tool (chng hn nh SNPs thay i trnh t amino acid ca mt protein)
v hin th cc b d liu trong Browser nh dng Custom Track.
Khng ch n thun l mt genome browser, UCSC cn lu tr mt b cc
cng c phn tch genome k c giao din y (full-featured GUI interface) cho
php nh nghin cu khai thc thng tin trong d liu browser (Table Browser), mt
cng c cn trnh t nhanh (BLAT) cng rt hiu qu trong vic tm cc trnh t trong
mt kho rt ln cc trnh t. Cng c liftOver cn ton b genome cho php chuyn
i cc trnh t t mt bn lp rp ny sang bn lp rp khc hoc gia cc loi vi
nhau. n Cng c Genome Graphs cho php ngi dng quan st tt c cc NST trong
cng mt lc v hin th kt qu ca hip hi nghin cu genome (GWAS). Cng c
Gene Sorter hin th cc gene c nhm li theo mt s tiu ch hoc thng s khng
lin quan n v tr genome, chng hn nh cc m hnh biu hin gene (expression
pattern) cc m.

6.2.3. NCBI Genomes and MapViewer


NCBI Mapviewer s tr li cho ngi dng nhng cu hi v yu cu sau y
- Kch thc genome, s lng NST
- Khong cch vt l gia hai gene
- Xc nh v tr vt l ca mt gene khi bit v tr trn bn di truyn.
- V tr v trt t sp xp ca cc gene trn mt NST?
- Xc nh v tr ca mt gene quan tm trong genome ca mt sinh vt v ch r
cc markers chn hai u ca gene .
- Hin th cc gene tn ti trong mt vng nht nh ca NST v hin th cc d
liu trnh t tng ng cho vng .
- Hin th vng ca mt NST gia hai im bt k. Hin th c bn di truyn
v bn trnh t ca vng ng thi cn cc trnh t vi nhau da vo cc
marker c mt trong c hai bn
NCBI Mapviewer hin th 4 mc :
- Hom Page: Hin th thng tin chung ca mt sinh vt, tm tt cc ngun thng
tin chung ca sinh vt .
- Genome View: Hin th hon ton genome di dng bn b NST cho php
ngi dng c th tm cc thng tin lin quan n ton b genome thng qua t
kha hoc tn cc gene. Kt qu tm kim s c nh du trn bn NST.
- Map View: Hin th mt hoc mt vi bn lin quan n NST c la
chn, sp xp vo mt Master Map v cho php ngi s dng xem cc vng
quan tm nhng mc phn gii khc nhau.
- Sequence View: Hin th d liu trnh t mt vng NST nht nh v m t
di dng ha cc c im sinh hc cho vng .
70

Hnh 31. V tr cc gene cytochrome ca ngi

Hnh 32. Cc mc hin th genome ngi ti NCBI Mapviewer


6.3. c im v ng dng ca cc genome browser
c im
Genome browser l mt dng CSDL tch hp cc thng tin v genome ca mt
s loi c t chc di dng ha v cc lin kt vi nhiu CSDL khc. Hiu mt
cch n gin Genome browser c xem nh mt bn gii phu genome trong
mi phn ca genome c ch r v tr trn bn i km vi cc thng tin sinh hc
lin quan n chng. Mc phn gii cao nht ca bn l trnh t nucleotide trn
tng NST. Vic hin th v tr ca gene hoc on DNA c th nhiu mc khc
nhau ty theo yu cu ca ngi s dng. Ngoi ra, cng ging nh cc genome
71

browser khc, thng tin v cc trnh t DNA hoc v tr cc locus thng c gn vi


cc thng tin m t v cc CSDL liu nh EST, UniGene...
ng dng ca genome browser
Genome browser cho php ngi s dng quan st thng tin v cc gene ca
mt loi trong phm vi lin h vi cc gene khc trn NST. Genome browser cng
gip nh nghin cu so snh v tr ca mt hoc nhiu locus trn NST gia cc loi.
Genome browser gip cho nh nghin cu tr li cc cu hi nh: nhng loi no
c xc nh trnh t, tnh trng lp rp trnh t, s lng NST ca mt loi, s lng
gene d kin, v tr ca gene nm u trn NST cng nh cc gene ln cn, cc gene
c mi lin h gn gi vi nhau (h hng), cc marker phn t lin quan n gene ,
v tr locus tng ng ca gene mt s loi c mi quan h gn gi ( c xc
nh trnh t v m t). Nhiu dng d liu khc nhau cng c th c hin th bi
Genome browser, chng hn ton b cc d liu SNP (dbSNP) NCBI c lp
bn (nh du) trong genome ca ngi, chut v cc sinh vt khc.

Tm tt chng 6
1. Genome browser l mt trnh duyt Web cho php tm kim v hin th thng
tin v genome ca cc sinh vt di dng giao in ha. Genome browser
cung cp nhng thng tin c bn bao gm: kch thc genome, s lng NST,
bn NST, s lng gene, v tr v khong cch tuyt i gia cc gene.
Ngoi ra thng tin chi tit v gene, chc nng ca gene, thng tin ca cc
locus... u c m t chi tit trn c s lin kt vi cc CSDL.
2. Ba genome browser Ensembl, UCSC v NCBI MapViewer m t y thng
tin v genome ca nhiu loi sinh vt. Hin nay c thng tin v genome ca
hn 1000 loi trong CSDL ca cc genom browser ny.
3. Genome browser cung cp rt nhiu thng tin hu ch cho nh nghin cu nh
xc nh v tr ca gene trong genome, so snh cc locus gene ca mt s loi,
tm cc gene c mi lin h gn gi vi nhau (h hng), cc marker phn t lin
quan n gene , v tr locus tng ng ca gene mt s loi c mi quan
h gn gi, chc nng ca gene v sn phm protein cng nh h cc protein
lin quan.
Cu hi n tp chng 6

1. Genome browser l g? Hin nay nhng Genome browser no ang c dng


ph bin?
2. UCSC Genome browser gip g cho nh nghin cu? C bao nhiu genome c
th tm thy UCSC Genome browser?
3. Lm th no xc nh c v tr ca mt gene trong genome v hin th v
tr vt l ca gene (locus) trn NST? Cho v d minh ha.
4. Th no l Gene sorter? ng dng ca Gene sorter?
5. Hy nu ng dng ca cng c Blat trong UCSC?
6. Hy k tn cc cng c ca Ensembl? Cho v d minh ha
7. Tm hiu cc genome browser v cho bit cch download trnh t v cc thng
tin m t v trnh t.
72

8. Cho bit ng dng ca cng c In silico PCR trong UCSC? Cho v d minh
ha.
9. Tm hiu cc genome browser v cho bit cch xc nh cc gene c cng
ngun gc tin ha vi mt gene cho trc? Ly v d minh ha.
10. S dng cc cng c genome browser hy so snh v tr locus ca mt gene
tng ng gia hai genome ca ngi v chut.
11. Hy nu s khc bit gia cc genome browser v d n xc nh trnh t 1000
genome.

73

CHNG 7
LM QUEN VI CC CNG C PHN TCH CSDL SINH HC

7.1. Lm quen vi cc cng c phn tch c bn


Chng ny s gii thiu v cc cng c thng dng phn tch cc CSDL
sinh hc. Do mc ch nghin cu v phn tch CSDL sinh hc rt khc nhau v ty
thuc vo ngi s dng v vy vic phn nhm cc cng c phn tch thc s rt
phc tp. Cc cng c phn tch c bn c tch hp trong cc CSDL nh GenBank,
EMBL, DDBJ v nhiu CSDL khc. Ngoi ra cc cng c phn tch khc c th c
tch hp trong cc trang web ring. Bng 1 tng kt cc cng c v nhm cng c c
bn v tn sut s dng.
Nhm cc yu cu phn tch
Tm kim trnh t ging nhau (similar sequence searching)
Nucleic acid vi nucleic acid
Protein vi protein
Dch m ra trnh t amino acid
Tm cc trnh t DNA khng m ha
Cc trnh t khc
Tm cc vng chc nng, vng bo th (finding domain, motif)
Tm v copy trnh t
Cn nhiu trnh t (multi sequence alignment)
Xy dng bn gii hn (restriction map contruction)
D on cu trc bc 2 v bc 3
Phn tch trnh t DNA bao gm dch m (DNA sequence analysis,
translation)
Thit k mi cho PCR, lai DNA (PCR primer, hybridization)
Xc nh khung c m (ORF)
Tm cc bi bo, tp ch (literatural searching)
Phn tch quan h tin ha (phylogenetic analysis)
Phn tch protein (cc c im, tnh cht vt l v ha hc)
Lp rp trnh t (sequence assembly)
Nghin cu biu hin gene (gene expression)
Cc cng c hn hp
Tng s

Tn sut s
dng (%)
35
9
12
2
3
9
11
8,5
7
6
4,5
4,5
4
3,5
3
3
3
2,5
2,5
2,5
100

Trong chng ny cc cng c phn tch s c gii thiu mt cch n gin


trn c s cc ng dng thng gp. Do cc cng c thng c tch hp trong cc
trang web v cc CSDL nn vic update ca cc trang web cng nh cc CSDL c th
cc cng c c th thay i t nhiu.

7.1.1. Tm v copy trnh t


Tm kim trnh t v download trnh t v my tnh s dng cho cc mc
ch khc nhau l mt trong nhng thao tc u tin m nh nghin cu thng thc
hin. Nh trnh by cc phn trn, trnh t sinh hc c lu tr trong cc ngn
74

hng CSDL, in hnh nht l ngn hng gene GenBank, ngn hng CSDL Chu u
EMBL, ngn hng gene ca Nht Bn (DDBJ).
tm kim trnh t nh nghin cu cn mt cng c gi l browser sau a
tn trnh t, tn gene, sn phm gene hoc cc thng tin lin quan n trnh t sinh
hc. Trc y do s lng gene hoc protein pht hin v ng k vo trong cc ngn
hng CSDL cn cha nhiu th vic t tn tng i n gin v d qun l. Tuy
nhin vi tc pht trin nhanh chng ca cc k thut xc nh trnh t, cc phng
php xc nh chc nng ca gene v protein to ra mt s lng ln d liu tn
ca cc trnh t trong c nhng trnh t tng ng, gene tng ng... Vic qun
l cc d liu s cng kh khn hn khi xy dng cc CSDL da vo cc lin kt cho
(cross database). Chnh v vy, mt trong nhng vn kh khn hin nay trong vic
tm kim trnh t l vic thng nht tn gi. Thut ng ontology ra i ch mt lnh
vc nghin cu thng nht tn gi ca cc trnh t sinh hc, gene hoc protein.
Hnh 33 minh ha giao din CSDL ca NCBI. Ty thuc vo loi CSDL cn
tm kim ngi s dng s la chn CSDL tng ng v in tn, thng tin ca trnh
t, gene hoc protein... vo tm kim v chn Search. Kt qu s cho ra cc thng tin
tng ng. Trong phn ny tm kim trnh t DNA hoc protein ngi s dng c
th la chn CSDL nucleotide, gene, protein, EST, STS... Sau khi NCBI tr v kt qu
tm kim, ngi s dng c th copy trnh t hoc ti trnh t vo my tnh.

Hnh 33. Giao din ca NCBI vi cc CSDL

7.1.2. Nhm cng c tm kim trnh t ging nhau


y l nhm cng c c s dng nhiu nht do gi tr ng dng thc tin ca
chng. V bn cht cc cng c ny cho php tm cc trnh t c trong cc CSDL
ging vi trnh t cho trc. Trnh t cho trc y l mt hoc mt s trnh t m
nh nghin cu ang quan tm v mun tm hiu thng tin lin quan n chng. Mi
cng c trong nhm ny c c im khc nhau tuy nhin chng u c im chung l
cung cp mt hoc ng dn ngi s dng copy v paste trnh t quan tm (cn
75

gi l trnh t truy vn hoc query) vo v la chn CSDL cha trnh t ging vi


trnh t quan tm.
Cc cng c in hnh thuc nhm ny bao gm BLAST v FASTA. Cc cng
c ny cho php tm kim nhanh chng v tr v kt qu cc trnh t c trong CSDL
ging vi trnh t truy vn. Bn cht ca qu trnh tm kim ny l vic so snh trnh
t truy vn vi cc trnh t trong CSDL. Nguyn l ca qu trnh cn trnh t s c
trnh by chng sau.
Vic tm kim trnh t ging nhau c ngha trong c nghin cu v ng dng
thc tin. Trn c s cc phn t DNA hay protein c trnh t ging nhau s c cu
trc v chc nng tng t nhau. Chng hn cu trc ca phn t protein c quyt
nh bi trnh t sp xp ca cc amino acid v cu trc c lin quan cht ch ti hot
ng chc nng ca phn t. Nu hai hoc nhiu protein c trnh t sp xp ca cc
amino acid ging nhau hoc tng t nhau th cu trc ca chng cng s ging hoc
tng t nhau iu ny dn n kh nng chng s c cng chc nng sinh hc. Da
trn c s ny vic tm kim cc trnh t ging nhau hoc tng t nhau s cung cp
rt nhiu thng tin cho nh nghin cu. Chng hn nh nghin cu c trnh t
nucleotide ca mt gene nhng cha bit cc thng tin lin quan n trnh t ny, bao
gm chc nng ca gene, protein do gene m ha c c im, tnh cht v cu trc
nh th no, gene c mt nhng loi no.
Vic tm kim cc trnh t ging nhau cn tr li cho nhng cu hi lin quan
n tin ha, phn tch cc t bin hoc s a hnh gia cc trnh t cc loi hoc
cng mt loi... So snh genome cng da trn c s so snh cc trnh t ging nhau
hoc tng ng gp phn lm sng t chc nng ca gene, pht hin cc h gene
v xc nh ngun gc hoc quan h tin ha. Nhm cng c ny cn h tr hiu qu
cho vic xc nh mi cho PCR, mu d cho phn ng lai axit nucleic hoc tm cc
trnh t ch cho cng ngh RNAi.
BLAST (Basic local alignment search tool) v FASTA (FAST-All) l 2 nhm
cng c tm kim trnh t ging nhau c s dng ph bin nht hin nay. Trong 2
nhm cng c ny BLAST c ng dng nhiu nht v c rt nhiu bin th k c
nhm BLAST chuyn dng (Specialized BLAST). Cc trnh t truy vn (query
sequence) c th l trnh t nucleotide hoc amino acid. Cng c BLAST v FASTA
s c gii thiu trong cc phn sau. Ngoi ra cn c nhiu cng c khc h tr cho
cng vic ny chng hn nh Sequence Similarity Search/SSS ca EMBL hoc
nhm cng c Proteomics ca ExPASy. Di y l mt s hnh nh v cc cng c
tm kim thuc nhm ny.

76

Hnh 34. Nhm cng c BLAST thng dng

Hnh 35. Nhm cng c BLAST c th


77

Hnh 35. Nhm cng c Sequence Similarity Search/SSS ca EMBL

Hnh 36. Nhm cng c Proteomics ca ExPASy

78

7.2. Tm cc vng chc nng, vng bo th


Axit nucleic v protein l nhng i phn t c cu to bi cc n phn l
cc nucleotide v amino acid. Trnh t sp xp ca cc n phn ny s quyt nh cc
c im, tnh cht v chc nng ca cc i phn t ny. thc hin chc nng sinh
hc, khng phi tt c cc nucleotide hay amino acid c trong phn t axit nuleic v
protein u c vai tr nh nhau m ch c mt s hoc mt vng trnh t nht nh
trong cc phn t ny thc hin cc chc nng sinh hc. Mt c im quan trng na
l nhng phn t c cu trc v chc nng tng t nhau s c cc vng trnh t ging
nhau hon ton hoc gn ging nhau. Chnh v vy cc phn t cng chia s nhng
vng ging nhau nh vy c th cng thc hin chc nng sinh hc hoc cng xut
pht t mt ngun gc chung ban u.
Trong phn t protein, nhng vng trnh t amino acid tham gia vo vic hnh
thnh cc cu trc (trung tm hot ng ca enzyme) c gi l cc domain. Cc
vng trnh t c tnh c trng ca mt h cc protein c gi l cc motif. Cc
protein c nhng vng tng t nhau hoc c mt dng sp xp nht nh ca cc
amino acid th nhng vng ny c gi l cc pattern (hnh mu, dng). Vic tm
kim v xc nh cc domain, motif hoc pattern c ngha rt quan trng trong vic
nghin cu cu trc, chc nng v quan h tin ha ca cc phn t. Hin nay, da vo
vic so snh trnh t ngi ta c th nhn din, xc nh hoc d on c cu trc
v chc nng ca cc i phn t.
Cc cng c h tr thng c dng phn tch bao gm CD-Search, Cn3D,
CDART...trong mc Tools for 3-D structure Display and Similar searching v
Conserved Domain Database/CDD trong mc Tools for Sequence analysis thuc
nhm cng c phn tch ca NCBI (http://www.ncbi.nlm.nih.gov/About/tools/). Trung
tm EMBL cng cung cp nhm cng c c chc nng tng t nh Protein
Functional Analysis/PFA.

7.2.1. Cn nhiu trnh t (multi sequence alignment)


Cn nhiu trnh t l mt trong nhng thao tc c ng dng rt ph bin
trong tin sinh hc ng dng. Bn cht ca vic cn trnh t l vic ging trnh t ca
cc nucleotide trong phn t axit nucleic hoc trnh t sp xp ca cc amino acid
trong phn t protein tm ra cc trnh t hoc vng trnh t ging nhau. Nh vy
vic tm kim cc trnh t ging nhau phi da trn c s cn trnh t. Nh gii
thiu trn, cc trnh t ging nhau s c cu trc ging nhau v v vy c th thc
hin chc nng ging hoc tng t nhau. Hai trnh t ging hon ton c th c coi
l cng mt gene khi cc gene ny c ly t cc sinh vt c cng ngun gc tin
ha. Cn hai trnh t vi nhau c gi l cn cp trnh t (pariwise aligment). Khi
nim cn nhiu trnh t (multisequence aligment) c hiu l vic cn t 3 trnh t
tr ln.
Cn trnh t l mt bc trong qu trnh tm kim cc trnh t ging nhau v
trnh t tng ng. Chnh v vy cn trnh t c coi l vn ct li ca tin sinh
hc. Da vo kt qu cn trnh t ngi ta c th tm c mi quan h gia cc gene,
h gene hoc cc s kin bin i (cc dng t bin) xy ra vi cc trnh t DNA
hoc protein trong qu trnh tin ha. Cn trnh t da vo mt s thut ton khc nhau
v vy kt qu cn trnh t ch c gi tr khi kt hp vi vic nh gi thng k. Cc
trnh t cng ngn th kh nng c mc ging nhau cng cao v ngc li. Cn

79

trnh t khng a ra kt qu chnh xc hoc ng m ch c ngha hp l nht


(heuristic).
Cn trnh t gm 2 loi, cn trnh t cc b (local aligment) v cn trnh t ton
b (global aligment). Cn trnh t ton b l qu trnh ging cc nucleotide hoc amin
acid trong phn t axit nucleic hoc protein t u n cui nh gi mc ging
nhau trn ton b chiu di trnh t. Khc vi cn trnh t ton b, cn trnh t cc b
l qu trnh ging cc nucleotide hoc amin acid trong phn t axit nucleic hoc
protein pht hin cc vng trnh t ging nhau m thi. Chnh v vy ngha ca
hai kiu cn trnh t ny cng khc nhau. Cn trnh t ton b cho php xc nh c
nhng bin i v trnh t trong cc trnh t so snh. S bin i y bao gm vic
pht hin cc t bin (mt, thm, thay th, mt on, o on, lp on). V th cn
trnh t cc b thng p dng vi cc trnh t c mc ging nhau cao v c kch
thc v ni dung ca trnh t. Cc trnh t nh vy thng c mi quan h gn gi v
mt tin ha (homology). Cn trnh t cc b s cho php pht hin cc vng trnh t
ging nhau, v th kiu cn trnh t ny s h tr cho vic xc nh cc vng chc
nng (domain), vng bo th (conservative region), cc motif v pattern (xem khi
nim motif v pattern phn trn) c trong cc phn t protein, RNA
Cc cng c h tr cho vic cn nhiu trnh t in hnh nht l nhm cng c
CLUSTAL ca EMBL. Hin nay cng c CLUSTAL c nhiu phin bn khc nhau
nh CLUSTALX, CLUSTALW, CLUSTAL OMEGA. Cc trnh t cn c cn
thng thng phi c t chung vo trong 1 file v nh dng FASTA.

Hnh 37. Cng c CLUSTALW2 ca EMBL

80

Hnh 38. nh dng FASTA ca 4 trnh t protein

7.2.2. Xy dng bn gii hn (restriction map contruction)


Bn gii hn l mt s trong cc v tr nhn bit v ct ca cc enzyme
gii hn s c ch r trn mt chui trnh t DNA. Vic xc nh cc v tr nhn bit
ca cc enzyme gii hn i vi mt on trnh t DNA c ngha quan trng trong
nghin cu sinh hc phn t, k thut di truyn v cc ng dng. ct hoc gn mt
on DNA vo trong mt vector hoc chn vo mt on DNA khc cn phi bit
chnh xc v tr. C nhiu cng c h tr cho vic lp bn gii hn, n gin v
in hnh nht l cng c NEBcutter ca NewEngland Biolab
(http://tools.neb.com/NEBcutter2/),
RestrictionMapper
(http://www.restrictionmapper.org/) hoc RESTRICTION ENDONUCLEASE
DIGESTION (http://www.molbiol-tools.ca/Restriction_endonuclease.htm).

81

Hnh 39. Cng c NEBcutter ca NewEngland Biolab

Hnh 40. Cng c RestrictionMapper

82

xy dng bn gii hn nh nghin cu a trnh t DNA vo khung v


la chn cc enzyme ct gii hn hoc p dng cho tt c cc enzyme ct gii hn c
trong CSDL. on DNA a vo c th dng mch thng hoc dng vng ty theo
mc ch ca nh nghin cu. Kt qu phn tch s cho ra dng sau:

Hnh 41. Kt qu bn phn tch bn ct gii hn bng cng c NEB cutter

7.2.3. D on cu trc bc 2 v bc 3 ca phn t protein


Cu trc ca phn t protein c quyt nh bi trnh t sp xp ca cc
amino acid trong chui polypeptide (cu trc bc 1). Vic d on cu trc bc 2 v
bc 3 ca phn t protein c ngha quan trng trong nghin cu chc nng (function
analysis) v s tng tc gia cc phn t (molecular interaction hoc protein
docking).
Vic hnh thnh cc cu trc bc 2 da trn c s s tng tc gia cc gc ca
cc amino acid trong cc vng nht nh ca phn t. Cu trc bc 2 ch yu tn ti
di dng xon alpha (helix), phin beta (beta sheet), xon (coil)... Cu trc bc 3 l
dng cun xon, gp np tip theo ca cc cu trc bc 2 trc . Cu trc bc 3 cng
l dng cu trc th hin hot tnh hoc chc nng ca cc phn t protein. Trong thc
nghim xc nh cu trc ngi ta phi kt tinh cc phn t sau phn tch cu
trc da vo cc phng php nh phn tch tinh th bng tia X (X-ray
crystallography), NMR (nuclear magnetic resonance), CD (Circular dichroism)... C
s ca qu trnh cun xon l s tng tc gia cc phn t. Cc phn t c xu hng
to thnh cc cu trc sao cho nng lng t do thp nht hay dng cu trc bn vng
nht. Da vo phn tch cc thng s nhit ng hc bao gm nng lng t do,
entropy v enthalpy. Vic tnh ton ny da trn c s cc thng tin v c im ca
cc amino acid, cc gc v kh nng tng tc gia cc phn t.
Hin nay c 2 phng php c s dung ph bin d on cu trc protein,
(i) protein threading hoc fold recognition v (ii) Homology modeling hoc
comparative modeling. Phng php protein threading s dng xy dng m hnh
ca cc protein c cng mt kiu gp np (folding) vi cc protein bit kiu gp
83

np v cp trc tuy nhin cc protein cn xy dng m hnh ny khng c cng ngun


gc tin ha vi cc protein bit. Ngc li phng php Homology modeling li
da trn c s so snh trnh t pht hin ra cc protein tng ng (c mi quan h
gn gi v mt tin ha) t d on cu trc v kt hp vi vic so snh cu trc.
Tuy nhin cn phi lu l trong nhm cc protein c cng ngun gc tin ha th cu
trc protein c xu hng bo th hn so vi trnh t protein. Chnh v vy chng hn
khi kt qu so snh trnh t ca 2 protein ch cho kt qu ging nhau rt thp th khng
th d on cu trc. Thng thng mc ging nhau t nht 40% v mt trnh t th
mi c th d on hoc so snh cu trc.
Vic pht trin cc thut ton t ra cho vic d on hnh thnh cu trc hin
nay vn ang l mt thch thc ln cho cc nh tin sinh hc. Cho d cc thut ton
c pht trin nh th no th chng vn cn phi c kim chng bng cc nghin
cu thc nghim.

Hnh 42. So snh cu trc ca cc phn t protein

7.2.4. Phn tch trnh t axit nucleic


Phn tch trnh t DNA nhm xc nh cc thng tin cha ng bn trong trnh
t . Nhng thng tin c th rt a dng ty thuc vo trnh t DNA v mc ch
phn tch ca nh nghin cu. n gin nht l xc nh chiu di (c bao nhiu
nucleotide) hoc thnh phn cc nucleotide c trong phn t . Nh nghin cu cng
c th mun xc nh vng m ha (coding sequence) trong phn t DNA hoc cc
khung c m (ORF). Ngoi ra, cc thao tc khc nh dch m on DNA ra 6 khung
84

c khc nhau, xc nh bn ct gii hn, tm kim cc trnh t ging vi trnh t


DNA cho trc hoc xc nh trnh t primer hoc thit k cc mu d (probe) cho cc
k thut lai DNA, RNA... thc hin nhng thao tc ny i hi cc phn mm phn
tch. C v vn cc cng c phn tch h tr cho cc cng vic trn bao gm c cng
c phn tch online hoc cc phn mm ci t vo my tnh.
i vi nhng trnh t DNA ln, chng hn nh mt nhim sc th th vic
phn tch trnh t s tr ln v cng phc tp. Nhng yu cu phn tch c th l xc
nh s lng gene c trong phn t DNA , cc gene m ha cho mRNA, tRNA,
rRNA v cc loi RNA khc, cc trnh t intron, exon, cc trnh t khng m ha,
thng tin v cc gene, trnh t promoter, vng trnh t kt thc phin m... Nhng thao
tc ny cng tng t nh vic m t genome.
i vi trnh t RNA, yu cu v phn tch cu trc (tRNA, rRNA), kh nng
hnh thnh cc cu trc bc 2 ca RNA, nghin cu v cc miRNA, siRNA trong qu
trnh iu ha biu hin gene sau phin m.
Cho n nay s lng phn mm v cc cng c online h tr phn tch trnh t
DNA c th ln ti hng nghn cng c khc nhau v s lng cng c ny tng ln
hng ngy. Phn ln cc cng c tin sinh hc c bn cho cc mc ch nghin cu u
c cung cp min ph. Mt s phn mm phn tch chuyn dng hoc tch hp nhiu
cng c phn tch ngi s dng phi mua. Cc cng c phn tch v phn mm
thng i km vi cc CSDL ln nh b cng c ca NCBI, EMBL, ExPASy, PDB
hoc Biology Work Bench (SDSC/San diego supercomputer center). Mt s phn
mm phn tch phi tr ph nhng c kh nng ng dng cho nhiu mc ch nghin
cu in hnh nh DNAStar-Lasergene, Vector NTI ca Invitrogen, PREMIER
Biosoft...

7.2.5. Thit k mi cho PCR v mu d lai axit nucleic


n nay PCR tr thnh k thut khng th thiu trong nghin cu sinh hc
phn t v k thut di truyn. Nhng ng dng t k thut ny nhiu ti mc khng
th k ht. thc hin c PCR cn c phi c cp mi (primer). C nhiu phn
mm h tr la chn cp mi cho phn ng PCR. n gin nht nh nghin cu c th
s dng chng trnh Primer-BLAST ca NCBI, Primer3 WWW primer tool hoc
phn mm tr ph ca PREMIER Biosoft nh Beacon Designer, SYBR Green
PCR primers, DNAStar-Lagergene...

85

Hnh 43. Giao din ca chng trnh Primer-BLAST


i vi primer cho PCR mt s thng s quan trng cn ch bao gm chiu
di, nhit nng chy, nhit gn mi, thnh phn GC, GC clamp (cc base G,C
u 3), cu trc bc 2 (hairpin, self dimer, cross dimer) Cc phn mm s dng cc
cng thc v thut ton cc nhau nn thng s a ra c th sai lch nhng khng
ng k. i vi PCR, ngoi vic tham kho cc thng s gi ca nh sn xut tng
hp mi nh nghin cu cn phi ti u trong iu kin th nghim ring ca mnh.
i vi mu d v bn cht l mt on DNA hoc RNA c chiu di dao ng
trong khong t 100 n 1000 bp dng pht hin s c mt ca cc trnh t
ncuelotide (trnh t DNA ch) nh cc si DNA n sau lai vi trnh t DNA
(Southern blotting) hoc RNA (Northern blotting) c c nh trn mng hoc
trong m (in situ). Hin nay cc cng ty cung cp ng thi mu d v cc phn mm
phn tch tng ng, chng hn nh Scorpion probes, Molecular Beacon probes,
TaqMan probes, LNA (Locked Nucleic Acid) probes, Cycling Probe Technology
(CPT).

7.2.6. Xc nh khung c m
Vic xc nh khung c m c ngha trong vic pht hin hoc d on
gene. Khung c m (open reading frame/ORF) c nh ngha l mt on trnh t
c bt u bi m khi u AUG sau l cc b ba m ha lin tc v kt thc bi
mt trong 3 b m kt thc (UAA, UAG v UGA). S lng cc nucleotide c trong
mt khung c m lun l bi s ca 3. Lu i vi mi trnh t DNA cho trc
lun c 6 khung c, trong 3 khung theo chiu dng (+) v 3 khung theo chiu m
(-). Trn mi khung c c th khng c, c 1 hoc nhiu ORF.
86

C nhiu cng c h tr cho vic tm kim khung c m cho mt trnh t


DNA. Cng c online ph bin nht l ORF finder cung cp bi NCBI.

Hnh 44. Xc nh khung c m bng cng c ORF Finder

7.2.7. Tm cc bi bo khoa hc
Vic tm kim cc bi bo khoa hc l cng vic khng th thiu ca nh
nghin cu. Cc bi bo, cng trnh nghin cu ng trn cc tp ch cng c sp
xp vo trong cc CSDL cho php nh nghin cu c th tm, ti min ph hoc tr
ph. C nhiu CSDL lu tr cc tp ch khoa hc thuc nhiu lnh vc khc nhau.
PubMed l mt trong nhng CSDL lu tr cc tp ch lin quan n khoa hc s sng,
sinh y hc v mt s ngnh lin quan c t trong ngn hng CSDL NCBI ca M.
Cho n nay s lng bn ghi cc bi bo Pubmed ln ti hng chc triu (xem
phn CSDL PubMed). Trong CSDL ca NCBI, Pubmed tp hp cc thng tin v cc
bi bo ng trn cc tp ch lin quan n y sinh hc, sinh hc v cc tp ch lin
quan. n thi im hin nay, s lng tp ch m Pubmed c lin kt ln ti hng
nghn tp ch. Thng thng cc trng i hc ln, vin hoc trung tm nghin cu
thng mua cc ti khon hoc cng cho php truy cp v ti v cc bi bo khoa hc.
Vit Nam ngi s dng cng c th mua ti khon truy cp vo cc tp ch
online nh ScienDirect, Springerlink... ti v cc bi bo khoa hc. Trong ngn
hng CSDL NCBI, PMC l c s d liu trong NCBI cha cc bi bo cho php ti v
min ph.

7.2.8. Lp rp trnh t
Ngy nay vic xc nh trnh t tr nn n gin v chi ph cho vic xc nh
trnh t ton b genome ngy cng gim. Tuy nhin bi ton kh khn y l vic
lp rp cc trnh t DNA n l to thnh mt genome hon chnh. Nguyn l ca
vic lp rp trnh t rt n gin da vo c s ca s chng lp ln nhau ca cc on
DNA c cc phn trnh t ging nhau. V nguyn tc khi ct genome (cc NST) mt
87

cch ngu nhin s to ra s lng mnh ct nhng v tr ngu nhin. Sau khi xc
nh trnh t cc on ngn, cc on ny s phi c ni li vi nhau bng cch xp
chng ln nhau (overlaping) tm cc vng trnh t ging nhau.

Hnh 45. Nguyn l lp rp trnh t

7.2.9. Phn tch quan h tin ha


Tin ha l nn tng ca sinh hc hin i, n kt hp tt c cc lnh vc ca
sinh hc di mt phm tr l thuyt chung. Tin ha khng phi l mt khi nim
kh nhng rt t ngi ch yu l cc nh sinh hc c c s hiu bit tha ng i
vi n. Mt s hiu lm thng thy cho rng cc loi c th c sp t trn mt
thang tin ha t vi khun qua ng vt bc thp, bc cao v cui cng l con ngi.
Tin ha l s thay i v vn gen ca mt qun th theo thi gian. Tin ha
bao gm hai mc i tin ha hay tin ha ln (macroevolution) v vi tin ha hay
tin ha nh (microevolution). Tin ha ln dn n s thay i ln mc qun
th dn n s hnh thnh loi mi. Tin ha ln xy ra mc cao hn loi. Tin
ha nh l nhng bin i v tn s allele mc c th hoc qun th ca mt loi.
Trong tin sinh hc cc nghin cu mi quan h tin ha c da ch yu vo vic
88

phn tch, so snh trnh t sinh hc v c genome. Nhng nghin cu so snh cc trnh
t DNA m ha ribosome, cytochrome c, gene ty th, gene m ha ribulose-1,5bisphosphate carboxylase oxygenase (RuBisCO) ang c s dng ph bin hin nay
trong nhn din, phn loi sinh vt v sp xp vo cc n v phn loi (taxon).
Khi so snh trnh t hoc cu trc cc i phn t ngi ta thy rng cc phn
t DNA, RNA hoc protein c trnh t ging nhau th cu trc ca chng s tng t
nhau hoc ging nhau v cng thc hin chc nng nh nhau. Trong qu trnh tin ha
nhng bin i trong trnh t sinh hc c th xy ra ngu nhin do chnh bn thn sinh
vt hoc nh hng ca cc yu t gy t bin. S bin i v trnh t din ra ngu
nhin khp genome ca mi c th. Khi c s tc ng ca cc iu kin ngoi cnh,
nhng bin i ny lin quan trc tip n kh nng thch nghi, tn ti ca sinh vt.
Qu trnh ny dn n s thay i tn s allele trong qun th, lm nn tng cho s
hnh thnh loi mi. Mc d s kin phn loi c th xy ra nhng theo quan im ca
tin ha, cc loi mi c pht sinh t cc loi t tin gn gi vi chng nht. Chnh
v vy bng cch so snh trnh t genome hoc mt s gene c th c th h tr cho
vic xc nh mi quan h tin ha cng nh v tr ca sinh vt trong h thng phn
loi.
Cn nhiu trnh t l cng c h tr ch yu nh gi s bin i trong trnh
t DNA, protein. Cc phn mm phn tch tin ha u da trn c s cn trnh t.
Mt s phn mm in hnh c s dng ph bin bao gm Mega5, ClustalX kt hp
vi
cng
c
Treeview,
Phylip
(University
of
Washington
http://evolution.genetics.washington.edu/phylip/software.html). Vic xy dng cy
phn loi c da vo 2 nhm, nhm th nht da vo xc nh khong cch
(distance based methods) v nhm th 2 da vo cc k t ging nhau ca trnh t
(character based methods). i vi nhm th nht cc phng php UPGMA,
Neighbor Joining Method (NJ), Weighted Neighbor-Joining (Weighbor), FitchMargoliash (FM) and Minimum Evolution (ME) Methods. i vi nhm th 2 cc
phng php c s dng gm: Maximum parsimony (MP), Maximum Likelihood
(ML).

89

7.2.10. Phn tch protein


Cng tng t nh phn tch trnh t DNA, phn tch protein cng bao gm rt
nhiu thao tc vi mc ch khc nhau. Phn t protein cng c trnh t, trnh t
amino acid. Ngoi ra, do protein c cu trc lin quan n hot ng chc nng nn
phn tch protein s phc tp hn rt nhiu.
Phn tch protein bao gm vic xc nh khi lng phn t (kch thc tnh
theo n v Dalton), cc c im vt l, tnh cht ha hc, thnh phn v t l cc
amino acid. Cn trnh t cng p dng i vi protein, thay v ch c 4 nucleotide
trong phn t DNA, protein c ti t nht 20 amino acid khc nhau v vy thut ton s
dng trong vic phn tch chui cng khc nhau.
Xc nh cu trc l mt trong nhng nhim v kh khn khi nghin cu
protein. kt tinh mt protein trong iu kin thc nghim thng rt phc tp v
tn nhiu thi gian. Vic pht trin v ng dng cc phn mm so snh cc m hnh
gp np hoc so snh trnh t c s dng ph bin xc nh cu trc ca phn t
protein.
Khc vi DNA, protein hot ng trong s tng tc vi cc phn t protein v
cc loi phn t khc. Vic nghin cu m hnh tng tc c ngha trong vic xc
nh enzyme c cht, nghin cu cht c ch hot ng ca enzym, nghin cu cu
trc ca trung tm hot ng, nghin cu tng tc gia khng nguyn khng th.
Tng t, protein docking l khi nim ch s tng tc gia cc protein. Vic m
phng qu trnh tng tc protein vi cc phn t khc c ngha ln trong vic gii
thch c ch bnh, pht trin cc thuc mi.

7.2.11. Nghin cu biu hin gene


SAGE (Serial analysis of gene expression) l mt k thut rt hiu qu c s
dng phn tch biu hin gene. K thut s dng bi cc nh sinh hc phn t
nghin cu mt tp hp cc mRNA trong mt mu quan tm dng cc th nh tng
ng vi cc mnh i din cho cc bn phin m ny. K thut SAGE c pht trin
bi Victor Velculescu Trung tm nghin cu ung th i hc Johns Hopkins v cng
b nm 1995. Hin nay cc c s d liu v trnh t gene, mRNA c lu gi trong
CSDL SAGE. tng chnh xc, hin nay nhiu k thut c ci thin to
ra cc th gene di hn. Nhng d liu ny rt hu ch trong vic pht hin v xc nh
gene.
Cc CSDL quan trng khc lin quan n nghin cu biu hin gene chng hn
nh CSDL EST ca NCBI, ArrayExpress ca EBI/EMBL, Stanford Microarray
Database ti trng i hc Stanford. Gn y khi nim exom c m t bao gm
tp hp ca cc exon c trong cc phn t mRNA. Khc vi khi nim transcriptome
l tp hp cc bn phin m (CSDL EST l mt phn ca transcriptome), exom ch l
cc on gene trong mRNA sau khi c loi b cc intron. Nhng CSDL ny c vai
tr quan trng khai thc cc d liu biu hin gene hoc pht trin cc mu d trong
cc k thut lai Northern blot v cc ng dng microarray.

90

7.3. Cc nhm cng c phn tch

7.3.1. Cng c phn tch ca NCBI


Cc CSDL v cng c trong NCBI c th truy cp theo a ch:
http://www.ncbi.nlm.nih.gov/About/tools/. Cc cng c thng c i km vi
CSDL gm cc nhm sau:
- Literature Databases: Cung cp CSDL v cng c tm kim, truy cp v tra cu
thng tin v sch (Book), tp ch (Journals), cc thut ng (MeSH), OMIM, OMIA,
Pubmed, PMC (Pubmed Central).
- Entrez Database: H thng cho php tm kim thng qua cc lin kt vi nhiu
CSDL vi nhau
- Nucleotide Database: Ch yu cung cp cc CSDL trnh t c bn v cng c cn
thit nht cho gn nh tt c cc nghin cu bao gm: GenBank, EST, GSS,
HomoloGene, HTG, SNPs, RefSeq, STS, UniSTS, UniGene.
- Genome-Specific Resources: NCBI cung cp cng c truy cp vo genome ca
hn 3,2000 sinh vt (k c hon tt v ang trong qu trnh lp rp v m t).
- Tools for Data mining: Cung cp rt nhiu cng c cho php tm kim thng tin
(Entrez), phn tch trnh t sinh hc bng nhm cng c BLAST (xem chng
sau), phn tch h thng phn loi (Taxonomy), ng k trnh t (Sequin, BankIt).
- Tools for Sequence analysis: Cung cp kho cng c phn tch trnh t bao
gm: nhm BLAST, phn tch vng/cu trc bo th (Conserved Domain
Database/CDD), xc nh cc trnh t STS c trong mt trnh t DNA (e-PCR), tm
khung c m (ORF finder), phn tch v nhn dng cc mnh peptid trong CSDL
khi ph (Open Mass Spectrometry Search Algorithm), sng lc trnh t vector c
trong trnh t DNA cn phn tch hoc ng k trnh t (VecScreen).
- Tools for 3-D structure Display and Similar searching: Cung cp cng c cho
php phn tch v so snh cu trc ba chiu ca cc i phn t sinh hc m ch
yu l protein v nucleic acid. Cc cng c cho php xc nh cc vng bo th
(CD-Search), hin th v so snh cp trc ba chiu (Cn3D), hin th cc vng chc
nng (domain) ca cc phn t protein c nhng vng cu trc ging nhau
(CDART), tm kim v so snh cu trc ba chiu ca protein da vo vic so snh
tng v tr cc gc hoc nhm amino acid (VAST Search)...
- Maps: Cung cp cc cng c hin th v phn tch bn di truyn v bn vt
l. Cng c NCBI Mapviewer (m t hng trm genome ca ng vt c xng
sng, khng xng sng, nguyn sinh ng vt, thc vt v nm), Human Map
(bn di truyn v vt l ca ngi), Model Maker (cho php xy dng cc trnh
t mRNA t trnh t genome, xc nh cc exon bng cch cn trnh t mRNA v
cc EST ng thi kim tra cc khung c m, vng trnh t m ha CDS),
OMIM Gene Map (cung cp v tr cc gene trn bn di truyn da trn cc kt
qu c cng b t cc bi bo khoa hc v cc phng php lp bn ,
OMIM Morbid Map (cung cp danh sch cc bnh di truyn lin quan n cc
gene v v tr ca cc gene trn bn di truyn).
- Collaborative Cancer Research: Cung cp cc cng c v CSDL phn tch cc
gene ung th, hot ng v iu ha hot ng ca cc gene ung th.
- FTP Download: Cung cp cng c cho php download cc d liu trnh t,
genome, bn , d liu h thng phn loi (taxon) v cc cng c h tr, phn
mm phn tch khc.
91

Resource Statistics: Cung cp cc d liu phn tch thng k lin quan n cc


CSDL nh s lng trnh t trong GenBank, tnh trng lp rp cc genome, cc d
liu thng c tm kim v phn tch...

Hnh 33. Cc cng c ca NCBI

7.3.2. Nhm cng c ca EMBL


Cc CSDL v cng c trong NCBI c th truy cp theo a ch:
http://www.ebi.ac.uk/Tools/webservices/. Cng c thng c i km vi CSDL
gm cc nhm:
- Phn tch chc nng protein (Protein Functional Analysis/PFA),
- Tm cc trnh t ging nhau (Sequence Similarity Search/SSS),
- Cn nhiu trnh t (multiple sequence Alignment/MSA),
- Phn tch tin ha (Phylogeny),
- Cn cp trnh t (Pairwise Sequence Alignment/PSA),
- i nh dng trnh t (Sequence Format Conversion),
- Tnh ton cc c im ca phn t protein, DNA da vo trnh t (Sequence
Statistics),
- Dch m (Sequence translation)
- Phn tch cu trc (Structural analysis)
Phn tch chc nng protein (Protein Functional Analysis/PFA)

Nhm cng c cho php so snh trnh t protein, xc nh cc dng (motif), cc


vng chc nng (domain), cc m hnh (pattern) v cc c im ging nhau gia cc
protein. Ngoi ra cc cng c cng cho php d on cu trc lp th ca cc protein
xuyn mng, cc peptide tn hiu da vo trnh t amino acid ca chng.

92

Tm cc trnh t ging nhau (Sequence Similarity Search/SSS)

Cc nhm cng c cho php nh nghin cu cn cc trnh t (cn cp hoc nhiu trnh
t) DNA, RNA v protein.

Multiple Sequence Alignment (MSA)

93

Phn tch tin ha (Phylogeny)

Cn cp trnh t (Pairwise Sequence Alignment/PSA)

i nh dng trnh t (Sequence Format Conversion)

Thng k trnh t (Sequence Statistics)

94

Dch m trnh t (Sequence Translation)

Phn tch cu trc (Structural Analysis)

Hnh 34. Cc dch v ca EMBL-EBI

7.3.3. Nhm cng c ca ExPASy


ExPASy cung cp cc CSDL v cng c phn mm phn tch trong cc lnh
vc proteomics, genomics, genomics, phylogeny, system biology, di truyn qun th,
transcriptomics, biophysics. a ch truy cp: http://www.expasy.org/

95

Hnh 35. Giao din chnh ca ExPasy


Cc nhm cng c ca ExPASy bao gm:
1. Proteomics: Trong nhm ny gm cc cng c: phn tch trnh t amino acid v nhn
dng protein; phn tch cc d liu in di 2 chiu v khi ph; xc nh chc nng v cc
c im ca protein; phn tch h protein, cc m hnh; ci bin sau dch m; phn tch
cu trc protein, nghin cu tng tc protein; cn trnh t v tm cc trnh t ging nhau.
2. Genomics: cn trnh t, tm kim trnh t ging nhau, trnh t tng ng, xc nh cc
c im v m t trnh t.
3. Strutural bioinformatics: phn tch trnh t amino acid, m phng v d on cu trc
phn t.
4. Systems biology: xc nh v m t cc con ng chuyn ha, mng li trao i cht,
iu ha biu hin gene k c mc genome.
5. Phylogeny/evolution: phn tch cc mi quan h tin ha (orthology), gia cc genome,
cc h gene, cc h gene m ha miRNA, cc h protein, phn tch codon-bias (tn sut
s dng codon).
6. Population genetics: cung cp cc phn mm phn tch di truyn qun th, xc nh cc
iu kin la chn t nhin, m phng cc d liu genome lin quan n tin ha.
7. Transcriptomics: cung cp cc cng c so snh m hnh biu hin gene (gene expression
patterns), cc v tr iu ha hot ng gene, cc protein gn RNA, DNA, d on ch
bm ca cc miRNA, pht hin vng m ha v thng tin v exome.
8. Biophysics: xy dng, nghin cu, so snh v hin th m hnh cc protein c cu trc
tng ng.
9. Imaging: cung cp cc phn mm m phng, xy dng v hin th cu trc cc phn t v
tng tc gia cc phn t.
10. IT infrastructure: Cung cp cc cng c h tr cho tin sinh hc.
11. Drug design: H tr cho phn tch cc thng s ng hc, m phng tng tc gia cc
phn t v cc cng c pht trin thuc.

96

7.3.4. Cc nhm cng c khc


Ngoi cc cng c phn tch trn cn c nhiu cng c phn tch khc c
tch hp nhiu trang web. Mt trong nhng v d l nhm cng c ca Biology
WorkBench. Ngi s dng ch cn ng k mt ti khon min ph sau c th s
dng cc cng c tch hp trong trang web ny.

Hnh 37. Cc cng c nhn dng v xc nh cc c im protein

Tm tt chng 7
1. phn tch cc trnh t sinh hc, cu trc phn t cn phi s dng cc cng c
hoc phn mm h tr. Cc cng c phn tch bao gm: (i) tm kim trnh t ging
nhau, (ii) xc nh cc vng chc nng, vng bo th, (iii) cn trnh t, (iv) xc
nh bn gii hn, (vi) phn tch cc c im vt l, tnh cht ha hc ca
protein, d on cu trc bc 2, bc 3, tng tc protein (vii) phn tch mi quan h
tin ha, (viii) so snh genome, tm kim gene trong genome...
2. Cn trnh t l mt bc trong qu trnh tm kim cc trnh t ging nhau v trnh
t tng ng. Cn trnh t c coi l vn ct li ca tin sinh hc. Da vo kt
qu cn trnh t ngi ta c th xc nh c thng tin ca mt on DNA,
protein, tm c mi quan h gia cc gene, h gene hoc cc vng chc nng.
BLAST l cng c h tr tm kim v phn tch cc trnh t tng ng theo kiu
cc b rt nhanh v hiu qu. C nhiu bin th khc nhau ca Blast cho cc mc
ch khc nhau.
3. CSDL ExPASy v cc cng c phn tch h tr cho nghin cu proteomics bao
gm phn tch cc c im vt l, ha hc, d on cu trc, tng tc protein...
Cc cng c phn tch khc bao gm xc nh khung c m, bn ct gii hn,
phn tch mi quan h tin ha... h tr cho cc nh nghin cu.
97

Cu hi n tp chng 7
1. Cng c phn tch CSDL sinh hc l g? C bao nhiu nhm cng c phn tch?
Cho v d mt vi cng c ca mi nhm v ng dng c th ca chng.
2. tm v copy mt trnh t gene ngi ta s dng ngn hng CSDL v cng c
g? Th no l nh dng FASTA ca trnh t sinh hc?
3. Th no l trnh t ging nhau v trnh t tng ng? Nhm cng c no cho
php tm kim cc trnh t v nu ng dng ca vic tm kim ny?
4. Hy m t nhm cng c BLAST? V ng dng ca BLAST?
5. Nhm BLAST chuyn dng (specialized BLAST) gm nhng cng c g? Cho
v d minh ha.
6. Hy m t cc nhm cng c ca EMBL, trong mi nhm chn mt cng c
in hnh v cho v d minh ha ng dng ca cng c .
7. ExPASy l g? Hy cho bit cc nhm cng c v ng dng ca chng?
8. Th no l cn trnh t, hy cho bit cng c no cho php cn nhiu trnh t v
ng dng ca vic cn nhiu trnh t.
9. Bn ct gii hn ca mt on DNA l g? Cng c g cho php xy dng
bn gii hn? ng dng ca vic xy dng bn gii hn.
10. Ti sao cn phi d on cu trc ca phn t protein? d on cu trc ca
phn t protein ngi ta s dng hng tip cn g? Cho bit cng c no h
tr cho vic d on cu trc bc 2 ca phn t ptoein.
11. Cng c no h tr thit k mi (primer) hoc mu d (probe) trong cc k
thut lai axit nucleic. Hy chn mt cng c v phn tch ng dng ca cng c
trong vic thit k mi v mu d.
12. Khung c m l g? Ti sao cn phi xc nh khung c m? Hy cho bit
cng c no h tr cho vic phn tch ny?

98

CHNG 8

LM QUEN VI PHN TCH D LIU SINH HC


8.1. Tm d liu trong cc ngn hng CSDL
Tm cc d liu trong ngn hng CSDL l thao tc m tt c cc nh nghin cu
u phi thc hin. D liu sinh hc rt a dng (xem phn CSDL), n gin nht l
tm cc bi bo khoa hc (xem chng 3), tip n l trnh t gene, trnh t amino
acid, trnh t NST, thng tin v h thng phn loi hc (taxon), genome, cu trc 2
chiu v 3 chiu ca cc phn t sinh hc...

8.1.1. D liu trnh t


Trnh t sinh hc bao gm trnh t acid nucleic (DNA, RNA) v trnh t amino
acid (protein). D liu v genome ca cc sinh vt, cc gene, EST, STS, cc trnh t
ang lp rp... c lu tr trong cc ngn hng CSDL t kt qu ca cc nghin cu
(phng th nghim, d n xc nh trnh t...). Mi trnh t khi a vo ngn hng
CSDL u c t tn v c mt m truy cp. Nh vy tm c trnh t sinh hc
cn bit tn (hoc thng tin v trnh t) hoc m s truy cp (accesion number).
Trc y, vic t tn cho mt gene hay mt protein thng khng thng nht
v ch yu l do ngi ng k trnh t t. Vic qun l m s truy cp l do nhng
ngi qun l CSDL t. Sau ny khi d liu sinh hc c to ra ngy cng nhiu i
hi cn phi m t hoc gn thng tin vo cc trnh t (chng hn qu trnh m t
genome/genome annotation) th vic thng nht s dng thut ng tr nn quan trng.
thun li cho vic tm kim cng nh qun l trnh t, vic thng nht tn gi cho
mt gene hay mt protein lun l vn c t ra. Vic t tn v thng nht t
tn trnh t trong tin sinh hc c gi l ontology.
Vic tm kim cc trnh t sinh hc phn ln da vo tn v cc thng tin lin
quan n trnh t. Thng tin s c s dng lm t kha tm trong cc CSDL
tng ng.

8.1.2. D liu cu trc


Bao gm cc dng d liu cu trc ba chiu ca cc i phn t m ch yu l
protein v RNA. Cc thao tc tm kim v phn tch d liu cu trc lin quan n vic
d on, so snh cc m hnh gp np, gp np cc b, s hnh thnh cc cu trc
motif hoc domain. Ngoi ra thng tin v s tng tc phn t, mi tng quan gia
cu trc v chc nng t kt qu thc nghim v t m hnh cu trc m phng bng
my tnh.
ExPaSy cung cp nhm phn tch cu trc (structural bioinformatics) gm
CSDL v cc cng c phn tch. CSDL bao gm SWISS-MODEL Repository, Protein
Model Portal... Cc cng c xy dng m hnh cu trc tng ng (structure
homology-modeling), tng tc gia cc protein (protein ligand docking server), d
on vng cu trc trong phn t protein (prediction of coiled regions in proteins),
hin th cu trc 3-D ca protein (Swiss PDBviewer), thng tin v cu trc protein
(Protein Model Portal).
CSDL cu trc protein v cu trc ca cc i phn t sinh hc gm DNA,
RNA v k c polysaccharide c lu tr trong PDB database
(http://www.rcsb.org/pdb/home/home.do). CSDL cu trc cng c lu tr ti NCBI
(http://www.ncbi.nlm.nih.gov/Structure/index.shtml).
99

Hnh 38. CSDL v cng c phn tch cu trc protein ca ExPASy

Hnh 39. CSDL cu trc cc i phn t sinh hc PDB

100

Hnh 49. CSDL cu trc cc i phn t sinh hc NCBI


Cc d liu cu trc c th tm da vo tn ca cc i phn t. Cc d liu v
trnh t, cu trc v chc nng c lin kt cho vi cc CSDL tng ng. Cc thng
tin v kt qu phn t NMR hoc X-ray. Chng hn tm kim cu trc ca phn t
proteinase K ti NCBI cho ra kt qu nh hnh 50.

Hnh 50. Kt qu tm kim CSDL cu trc protein NCBI


101

8.1.3. Cc d liu khc


Ngoi d liu lin quan n trnh t v cu trc, cc dng d liu khc thng
c khai thc trong tin sinh hc bao gm sch, bi bo khoa hc, d liu h thng
phn loi (taxon), kiu gene v kiu hnh (dbGaP), d liu bnh di truyn, cc hp cht
ha hc... u c th khai thc t cc ngn hng CSDL ln ch yu ca NCBI. Vic
tm kim da vo tn ca cc d liu hoc t kha tm kim. Cng tng t nh vic
tm kim cc d liu trnh t v cu trc, thng tin v cc d liu khc cng c lin
kt vi cc CSDL tng ng.

8.2. Phn tch trnh t


8.2.1. So snh trnh t
So snh trnh t ng vai tr quan trng nht trong phn tch tin sinh hc. y
l bc u tin trong qu trnh phn tch cu trc v chc nng ca nhng trnh t
mi. Khi s lng trnh t sinh hc mi c tm ra ngy cng nhiu vic so snh
trnh t li cng tr nn quan trng tm ra chng nng v mi lin h v mt tin
ha gia cc trnh t ny vi cc trnh t bit trong cc c s d liu, c bit khi
phn tch trnh t protein. Nn tng ca vic so snh trnh t l qu trnh cn trnh t.
y l qu trnh m nh cc trnh t c so snh bng cch tm ra cc vng k t
ging nhau trn c s so snh tng k t gia cc trnh t phn tch. Cn cp trnh t
l qu trnh dng 2 trnh t v l nn tng ca tm kim cc trnh t ging nhau
(similarity) v cn nhiu trnh t l c s pht hin ra cc vng chc nng (domain)
hoc cc vng c bit (motif/pattern) hoc phn tch t bin v quan h tin ha.
Khi nim

So snh l qu trnh tm ra nhng c im ging v khc nhau gia 2 hoc


nhiu trnh t. im ging nhau l nhng on trnh t c trt t sp xp ca cc
nucleotide hoc amino acid ging nhau. Trong tin sinh hc, so snh trnh t ngi
ta phi sp xp cc trnh t vi nhau theo nhng cch nht nh tm ra nhng im
ging nhau. Vic sp xp cc trnh t hay cn c gi l dng hoc cn trnh t
(sequence alignmen).
Mc ch chung ca cn cp trnh t l tm ra cc trt t bt cp ging nhau ca
cc k t (nucleotide hoc amino acid) gia hai trnh t. t c iu ny, mt
trnh t cn phi thay i nht nh tm ra vng bt cp ging nhau ln nht. S
thay i ny l vic a cc v tr trng vo trong chui to ra kh nng bt cp ln
nht cho cc k t trong chui.
C s tin ha ca vic so snh trnh t
Qu trnh tin ha bt u xy ra mc phn t, DNA v protein. Nhng
bin i v mt trnh t nucleotide trong DNA xy ra mt cch ngu nhin dn n
nhng bin i v trnh t amino acid trong phn t protein v dn n nhng bin i
v mt kiu hnh. La chn t nhin l mt qu trnh lun lun xy ra v h qu ca n
l nhng kiu gene ph hp nht (fitness) s thch nghi v c nhn ln qua nhiu th
h sau . Cc kiu gene khng ph hp s b loi b dn dn trong qu trnh tin ha.
Vic tch ly cc t bin v phn li qua thi gian vn c th lu li nhng phn
trnh t nht nh lm du hiu cho php nhn ra v xc nh t tin chung. Trong qu
trnh tin ha c mt s vng trnh t c coi ng vai tr quyt nh trong cu trc
v chc nng ca cc phn t c gi li trong qu trnh chn lc t nhin. Trong khi
102

cc trnh t khc, c th t tham gia vo cc hot ng chc nng c tn s t bin


cao hn. Mc d v mt l thuyt t bin c th xy ra mt cch ngu nhin mi v
tr trong trnh t tuy nhin vn tn ti nhng vng trnh t d b bin i hn so vi
nhng vng khc. Hn na trong qu trnh chn lc t nhin nhng kiu gene b t
bin nghim trng n kh nng sng cn ca c th mang t bin s b loi b. V
vy nhng kiu gene tn ti cho n nay thng l nhng kiu gene c s thch ng
cao vi nhiu iu kin bin i. V d cc gc amino acid tham gia vo hnh thnh
trung tm hot ng ca mt h enzyme c xu hng bo tn bi v chng chu trch
nhim cho chc nng xc tc. Nh vy so snh trnh t thng qua qu trnh cn trnh
t c th xc nh c cc vng trnh t bo th v nhng vng bin i. Cc trnh t
c mc ging nhau cao cho thy chng c th c mi quan h tin ha gn gi,
ngc li cc trnh t t ging nhau chng t chng t c mi lin h. Nhng v tr bin
i ca cc trnh t phn nh s bin i v cc nucleotide hoc amino acid trong qu
trnh tin ha, chng hn t bin thay th cp nucleotide, mt hoc thm mt hay
nhiu nucleotide.
Khi phn tch mt nhm trnh t ging nhau c th xc nh c mi quan h
tin ha, tc l cc trnh t ny c th thuc cng mt h hoc cng c mt t tin
chung. Nu mt trong cc trnh t so snh trong nhm bit cu trc hoc chc nng
th c th nhng trnh t cn li cng c nhng c im tng t mc nht nh.
y chnh l c s d on cu trc v chc nng ca trnh t cha bit da vo
nhng thng tin c trong CSDL da vo so snh trnh t. Chnh v vy, cn trnh t
c dng da on cu trc v chc nng ca cc trnh t mi cha bit. Ngoi ra,
khi so snh trnh t ngi ta c th pht hin c vai tr ca cc amino acid trong
vic hnh thnh nn chc nng ca phn t protein. Chng hn vic thay th hoc mt
i mt hoc mt s amino acid trong nhng vng nht nh trong phn t protein dn
n nhng bin i v mt cu trc v chc nng. Nhng pht hin ny gp phn
trng trong nghin cu v ci bin phn t theo nhng nh hng nht nh. V d
khi so snh trnh t amino acid ca mt s enzyme thy phn tinh bt, ngi ta pht
hin c s thay th mt amino acid mt v tr nht nh trong phn t dn n
enzyme ny tng tnh bn nhit. Da vo kt qu ny cc nh nghin cu c th to ra
cc gene t bin mi nh hng nng cao tnh bn nhit ca enzyme ng dng trong
cng nghip.
Mt s thut ng lin quan n so snh trnh t
Homology, similarity v identity
Trnh t tng ng (sequence homology) v trnh t ging nhau (similarity).
Mt khi nim quan trng trong phn tch trnh t l trnh t tng ng. Khi
hai trnh t cng c ngun gc t mt t tin chung th chng c xem l c mi
quan h tin ha. Khi nim lin quan nhng khc hn l trnh t ging nhau khi
phn trm ca cc k t bn trong trnh t ging nhau v cc c im vt l v sinh
ha nh kch thc, in tch v tnh k nc. Cn phi phn bit hai thut ng ny bi
v chng thng c dng ln ln v gy ra s hiu lm. Trnh t tng ng l
nhng trnh t c cng ngun gc t tin hoc c suy lun c cng ngun gc
chung da trn c s phn tch mc ging nhau ca cc trnh t. Trnh t ging
nhau hay mc ging nhau ca trnh t c th nh lng, tc l c th ni hai trnh t
c mc ging nhau 40% hoc cng c trnh t ging nhau chim 40%. Tuy nhin,
khng th ni l hai trnh t tng ng 40% m ch c th kt lun hai trnh t
103

tng ng hoc khng tng ng. iu ny cng tng t nh cch ni rng 2


ngi no ging nhau ti 60% nhng khng th ni 2 ngi c h hng 60%.
Xt mt cch tng th, nu hai trnh t ging nhau ln th c th suy lun
chng c mi quan h tin ha hoc c cng t tin chung. Cu hi t ra y l th
no l ln? Vic gii quyt nhng vn pht sinh trong nghin cu lin quan n
hai khi nim ny khng phi lc no cng r rng. Cu tr li ph thuc vo loi trnh
t nghin cu v chiu di trnh t. Trnh t nucleotide bao gm 4 loi k t A, T, G
v C v vy ngay c khi hai trnh t khng c mi lin h g th mt v tr nht nh
lun c t nht 25% c hi ging nhau hon ton. i vi trnh t protein, do c 20
amino acid khc nhau nn hai trnh t khng lin quan c c hi ging nhau 5% mi
v tr. Chnh v vy khi cn trnh t, nu a mt ch trng vo (-) th c hi ging
nhau s tng ln 10-20%. Nh vy chiu di trnh t l mt yu t quan trng. Cc
trnh t cng ngn th c hi chng ging nhau cng cao, trnh t cng di th kh
nng bt cp ging nhau cng thp. Chng hn nh tm mt t hoc cm t xut
hin trong 1 cun sch s d dng hn vic tm kim c mt cu di. iu ny cho
thy khi phn tch cc trnh t ngn i hi ngng t ra (cutoff) phi cng cao.
Chng hn xc nh mi quan h tng ng ca hai protein, nu hai trnh t
c cn trn ton b chiu di 100 amino acid, mt mc ging nhau 30% hoc cao
hn c th coi l tin cy kt lun chng c mi quan h tin ha gn gi hay hai
trnh t ny l tng ng. Tuy nhin cng cn phi nhn mnh rng, gi tr phn trm
ging nhau ch cung cp c s nht nh xc nh mi quan h tng ng ch
khng phi l quy tc tuyt i xc nh mi quan h ny, c bit l nhng trnh t
nm trong vng mp m (twighlight). Trong nhng trng hp nht nh cn c
nhng phn tch thng k nh gi mc tin cy ca qu trnh cn trnh t. Di y
l biu m t kh nng mi lin h gia trnh t tng ng v trnh t ging nhau.

Hnh 51. Mi quan h gia mc ging nhau v chiu di trnh t


C ba vng to ra khi cn trnh t protein. Hai protein c th c coi l tng ng
khi phn trm trnh t ging nhau nm trong vng an ton (safe zone). Khi mc ging
nhau di vng an ton, nhng mc ging nhau trn 20% c th c xp vo vng mp
m (twilight zone), khi vic kt lun hai trnh t l tng ng s thiu chc chn. Vng
di 20% l vng ti (midnight), trong vng ny khng th kt lun hai trnh t l tng
ng.
104

Sequence similarity v sequence indentity


Mt thut ng na lin quan n vic so snh trnh t l sequence similarity
v sequence identity. V c bn hai thut ng ny cng ngha i vi trnh t
nucleotide. Tuy nhin, i vi trnh t protein, hai khi nim ny li rt khc nhau.
Khi cn trnh t protein, sequence identity c hiu lin quan n phn trm ca
vic bt cp ging nhau ca cc gc amino acid ging nhau gia hai trnh t c cn.
Sequence similarity c hiu l phn trm cc gc amino acid sau khi cn trnh t
c cc c im vt l, sinh ha tng t hoc ging nhau v c th thay th cho nhau.
C hai cch tnh ton i vi sequence similarity v sequece identity.
Mt cch lin quan n vic s dng ton b chiu di trnh t ca c 2 trnh t, cch
th hai s dng cch chun ha (normalize) theo kch thc ca trnh t ngn hn.
Phng php 1 s dng cng thc:
S=[(Ls2)/(La+Lb)]100
Trong S l phn trm trnh t ging nhau, Ls l s amino acid c c im tng t
hoc ging nhau v La v Lb l tng chiu di ca c hai trnh t.
Trnh t ging nhau (I%) c tnh theo cng thc tng t
I =[(Li 2)/(La+Lb)]100
Trong Li l s gc amino acid ging nhau hon ton.
Phng php th hai l tnh t l phn trm ca s amino acid ging nhau hon
ton/ s amino acid tng t trn ton b chiu di ca trnh t ngn hn theo cng
thc:
I(S)% = Li(s)/La%
Trong La l chiu di ca trnh t ngn hn.
ngha ca vic so snh trnh t
Tm ra nhng vng trnh t ging hoc tng t trong phn t DNA, RNA hoc
protein c ngha ng dng rt ln. Xt trn quan im sinh hc v tin ha, nhng
phn t DNA, RNA hoc protein c trnh t ging nhau hoc tng t nhau s c th
c cu trc v chc nng sinh hc ging hoc tng t nhau, c quan h gn gi nhau
v mt tin ha hoc c kh nng xut pht t mt ngun gc chung. Tht kh c th
k ht nhng dng ca so snh trnh t. Trong phm vi bi ging ny ngha ca vic
cn trnh t bao gm:
- Tm ra nhng trnh t c mc ging hoc tng t vi mt trnh t cho
trc (query sequence)
- Pht hin c cc cu trc ging nhau trong cc trnh t so snh: motif,
domain, pattern
- Tm cc vng chc nng trong trnh t cha bit
- Xc nh mi quan h tin ha gia cc trnh t
- Phn tch vai tr v nh hng ca cc amino acid, nucleotide trong cc trng
hp t bin.
- Pht hin cc SNP
- H tr lp rp trnh t

105

8.2.2. Phn tch khung c m v vng trnh t m ha


i vi mt trnh t DNA vi chiu di ln lun c 6 khung c. Trn mi
khung c c th tn ti mt, nhiu hoc khng c khung c m (ORF). Khung c
m l mt on trnh t DNA c chiu di l bi s ca 3 bt u bi 1 m khi u
(ATG trn phn t DNA hoc AUG i vi phn t RNA) sau l cc m b ba lin
tc v kt thc bi 1 trong cc b m kt thc. Lu m b ba c th khc nhau i
vi cc nhm sinh vt. Thng thng ngi ta s dng b m chun (standard code)
vi m khi u l AUG v 1 trong 3 b m kt thc AUA, AUG v UGA. Vng trnh
t m ha (coding sequence) v bn cht l mt ORF, tuy nhin vng trnh t m ha
thng c hiu l vng trnh t m ha cho mt protein. Chnh v vy khng phi
ORF no cng c th l vng trnh t m ha. Vic xc nh ORF v vng trnh t m
ha c ngha trong vic pht hin s c mt ca gene trong mt on DNA hoc
RNA v d on trnh t amino acid do gene m ha. C nhiu cng c cho php phn
tch ORF, NCBI cung cp cng c ORF finder rt hiu qu v thn thin cho ngi s
dng (hnh 44).

8.2.3. Tm kim Promoter v cc vng iu ha hot ng gene


Xc nh promoter l mt trong nhng tiu ch quan trng trong m t genome
(genome annotation). sinh vt Prokaryote, chng hn nh vi khun, vic xc nh
promoter tng i n gin. Vng promoter ca cc sinh vt ny thng c 3 c
im: (i) im bt u hay cn gi l v tr (+1), (ii) hp TATA nm khong v tr 10 ngc dng t im bt u, (iii) trnh t TTGACA nm xung quanh v tr -35.
sinh vt eukaryote vic xc nh promoter phc tp hn nhiu do s a dng
v cc trnh t cng nh khong cch t im bt u (+1). Cc Promoter c nhn
ra bi RNA polymerase II thng cha cc yu t TATA (TATAAA) nm v tr -30
n -40 (trung bnh -35). Do promoter khng th t thc hin qu trnh phin m mt
cch hiu qu m n cn nhng vng trnh t gn xung quanh n (promoter-proximal
elements). Cc trnh t ny thng tm thy vng -100 n -200 bp so vi v tr khi
u. Cc hp CCAAT l mt trong nhng trnh t hot ng theo kiu cis-acting v
mt vng giu GC nm trc hp CCAAT.

Hnh 55. Cng c tm promoter v cc TF


Ngoi cc trnh t cis, trong genome cn c nhiu vng DNA cho php cc
protein c vai tr iu ha hot ng phin m nhn ra v gn vo. Phn ln chng l
cc yu t phin m (transcription factor/TF). Nhng protein ny hot ng theo kiu
trans-acting. Cc hp CCAAT v GC thng c nhn ra bi cc protein gn DNA.
Vng trnh t gi l enhancer cng c mt s protein nhn ra v bm vo trong qu
trnh phin m, nhng trnh t ny gi l UAS (upstream activating sequence). Chng
hn protein GCN4 s nhn ra cc trnh t UAS cha trnh t ATGACTCAT. Da trn
vic pht hin cc trnh t ny ngi ta c th xc nh c v tr cc promoter
Eukaryote. C s d liu Promoter cng c pht trin bi SIB (Swiss Institute of
106

Bioinformatics) vi CSDL khong hn 200.000 trnh t promoter khc nhau c ngun


gc t rui dm, chut v ngi (http://epd.vital-it.ch/).

Hnh 56. Cng c tm promoter v cc TF


Song song vi vic xy dng CSDL trnh t promoter, nhiu chng trnh v phn
mm my tnh h tr pht hin trnh t promoter c pht trin bi nhiu nhm nghin cu
khc nhau. Chng hn NCBI cung cp cng c Finding the Promoter
(http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/DNA/dna21b.html), NIH cung cp
cng c WWW Promoter scan (http://www-bimas.cit.nih.gov/molbio/proscan/).

107

nghin cu iu ha biu hin gene, hiu bit chnh xc v y thng tin v cc


yu t iu ha phin m l cn thit. CSDL cc yu t iu ha phin m
(Transcriptional Regulatory Element Database (TRED)) c xy dng p ng
vi yu cu ngy cng tng v cc yu t iu ha phin m (k c cis v trans).

108

Hnh 58. Cng c tm promoter v cc TF

8.2.4. Tm kim vng chc nng ca protein (functional motif searching)


nghin cu chc nng protein cn phi xc nh cc vng trnh t amino
acid tham gia vo vic hnh thnh cc trung tm xc tc, v tr nhn bit hoc tng tc
protein hoc protein vi c cht hoc vi DNA. Vi s phong ph v trnh t amino
acid vic so snh trnh t ca cc protein c chc nng tng t nhau cho php xc
nh c cc vng chc nng ca chng. Khi CSDL v cc vng chc nng c bit
ngy cng nhiu ch cn tm hoc qut vng trnh t nht nh c mt trong mt trnh
t protein c th d on c chc nng ca chng.
Cc cng c cho php phn tch cc vng chc nng ca protein bao gm:
InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan/)
Motif search (http://www.genome.jp/tools/motif/)
Conserved domain ca NCBI (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml)

109

Hnh 59. Cng c pht hin cc vng domain v motif

8.2.5. D on v m phng tng tc protein


Tng tc proteinprotein l qu trnh hnh thnh mi tip xc gia mt s
vng ca hai hoc nhiu protein do kt qu ca cc lc tnh in hoc cc s kin ha
sinh. Nghin cu tng tc protein l mt lnh vc kt hp gia tin sinh hc v nghin
cu cu trc xc nh v phn nhm tng tc phn t gia cc cp hoc cc nhm
protein. Hiu bit tng tc protein c ngha quan trng trong nghin cu cc con
ng dn truyn tn hiu trong t bo, m phng cu trc cc phc hp protein v
hiu c cc qu trnh sinh ha. Nghin cu cu trc v tng tc protein s pht
hin c cc vng motif, d on chc nng, nghin cu lch s tin ha v cc trnh
t bo th, xc nh c cc gc amino acid ng vai tr quan trng trong chui trnh
110

t. Ngoi ra nghin cu tng tc protein cn gp phn lm sng t qu trnh ci bin


sau dch m, s phosphoril ha, acyl ha, glycosyl ha v ubiquitin ha v xc nh
c v tr vn chuyn v nh v ca protein trong t bo. t bin trong chui trnh
t amino acid c th dn n s thay i trong cu trc v kh nng tng tc vi cc
phn t khc. Chng hn nghin cu tng tc protein c th xc nh nhng thay i
v trnh t amino acid s dn n thay i c tnh xc tc, kh nng gn c cht, v tr
d lp th... t c th ci bin enzyme hoc iu khin phn ng xc tc theo mong
mun. Trong nghin cu min dch, s tng tc protein l c s nh gi kh nng
phn ng ca khng th vi khng nguyn, d on kh nng phn ng c hiu hoc
phn ng cho...
V mt thc nghim, cc tng tc vt l gia cc cp protein c th c xc
nh bng nhiu k thut khc nhau. Chng hn nh: Protein-fragment
complementation assays (PCA), affinity purification/mass spectrometry, protein
microarrays, fluorescence resonance energy transfer (FRET)... Trong tin sinh hc,
d on kh nng tng tc gia protein vi protein i hi nhng CSDL xy dng t
kt qu thc nghim. Chng hn t nhng kt qu nghin cu cho thy cc protein c
cu trc ging nhau c quyt nh bi mt s vng amino acid nht nh u c kh
nng tng tc vi cc protein c cu trc tng ng. So snh trnh t nucleotide hoc
amino acid s pht hin c cc vng trnh t bo th, cc motif, domain hoc cc
amino acid ng vai tr chnh hnh thnh nn cc mi tng tc. Da trn c s ny
c th d on c cu trc ca phn t protein quan tm v kh nng tng tc ca
chng vi cc phn t khc.
Cho n nay c nhiu phng php c ng dng trong tin sinh hc d
on tng tc protein bao gm nghin cu quan h tin ha: Xc nh cc h protein
c m hnh cu trc tng t nhau nhiu loi sinh vt khc nhau. Phng php ny
da trn c s ng tin ha gia cc protein v cc cu trc ortholog gia nhng loi
gn gi nhau. Trong mt con ng chuyn ha hoc mt phn ca con ng chuyn
ha c s tham gia v tng tc ca cc protein th cc loi khc nhau nhng cng
s dng con ng chuyn ha ging nhau th s tng tc gia cc protein cng s
ging hoc tng t nhau. Cc d liu v phn tch tin ha ca cc protein c th
c s dng suy on kh nng tng tc ca chng. Khi so snh genome ca
nhiu loi nu c s xut hin ng thi hoc bin mt ca 2 protein th chng t
chng phi c mi lin h vi nhau. Nhng vng trnh t hoc cu trc bo th c
pht hin da vo phn tch tin ha cng cho php d on kh nng tng tc gia
cc phn t protein. Da vo cy tin ha, cc cu t (ligand) v th th (receptor) c
xu hng ng tin ha theo m hnh tng t nhau ch khng phi mt cch ngu
nhin. Phng php ny s dng cy tin ha ca cc cp protein quyt nh liu s
tng tc c xy ra hay khng. lm iu ny cng c BLAST v cn nhiu trnh t
(chng hn nh Clustal) thng c s dng. Ngoi ra d on v so snh cu trc
protein vi cc protein bit da trn c s so snh trnh t tng ng d on
s tng tc gia cc trnh t protein truy vn. Phng php ny khng ch xc nh
c s tng tc protein m cn gi m hnh ca s tng tc v mt cu trc. Tuy
nhin, mc chnh xc ca hng tip cn ny ph thuc rt nhiu vo d liu v
cc protein v s tng tc gia chng. Cho n nay d liu v tng tc ca cc phc
hp protein thu c t thc nghim cn rt hn ch v vy cc m hnh d on
tng tc protein ch c tham kho sng lc ban u.

111

Tm tt chng 8
D liu sinh hc bao gm d liu trnh t, d liu cu trc v cc loi d liu
khc nh bi bo, sch, d liu kiu gene kiu hnh, cc hp cht ha hc, cc con
ng chuyn ha... Cc d liu ny c lu tr trong cc ngn hng gene, NCBI,
PDB, ExPaSy... Phn tch trnh t bao gm cc thao tc so snh thng qua vic cn
trnh t. nh gi mc ging nhau v tng ng gia cc trnh t c ngha quan
trng trong phn tch cu trc, chc nng v quan h tin ha. Vic so snh trnh t
cho php pht hin t bin, SNP, cc vng bo th... Phn tch trnh t protein, d
on cu trc bc 2, bc 3 c h tr bi cc cng c ca ExPaSy, ngn hng CSDL
protein (PDB). So snh cu trc s h tr d on chc nng v tng tc gia cc
phn t. Phn tch khung c m h tr xc nh cc vng m ha, xc nh cc gene.
Tm kim cc vng chc nng ca protein (motif, pattern, domain) c h tr bi rt
nhiu cng c nh Conserved Domain and Protein Classification ca NCBI, Motif
Scan. Xc nh trnh t promoter v cc trnh t lin quan n qu trnh iu ha biu
hin gene l c s xc nh gene trong genome.
Cu hi n tp chng 8
1. Lm th no tm c trnh t nucleotide ca mt gene hoc trnh t amino acid
ca mt phn t protein quan tm trong ngn hng CSDL.
2. S truy cp (accesion number) ca mt gene hay protein l g? Lm th no c s
truy cp ny?
3. Ontology l g? Ti sao ontology li quan trng?
4. Nhng CSDL no cha cc d liu cu trc protein? Cho v d minh ha.
5. So snh trnh t l g? Ti sao li cn phi so snh trnh t
6. Hy nu c s tin ha ca so snh trnh t, cho v d minh ha.
7. Phn bit cc khi nim homology, similarity v identity
8. Khung c m l g? Da vo c s g xc nh khung c m? ngha ca vic
phn tch khung c m.
9. Promoter l g? Lm th no xc nh c trnh t promoter iu khin hot
ng ca mt gene trong c th eukayote.
10. Trnh t m ha (coding sequence/CDS) l g? C mi lin h g gia khung c
m v CDS?
11. Da trn c s g xc nh trnh t iu ha phin m (TF)? ngha ca vic
xc nh TF?
12. Th no l vng chc nng trong phn t protein? C s g cho php xc nh
vng chc nng ca phn t protein.
13. Phn tch ch ra mi quan h cht gia cu trc v chc nng ca protein
14. Tng tc phn t l g? Da trn c s g d on hay m phng tng tc
phn t.

112

CHNG 9
CN TRNH T V NGUYN L CA CN TRNH T
9.1. Gii thiu v cn trnh t
Mc tiu ca cn cp trnh t l tm ra kh nng bt cp ging nhau ln nht
ca cc nucleotide hoc amino acid ca hai trnh t. Gi s c 2 trnh t cn so snh l
GAATTCAG v GGATCGA. S c rt nhiu kh nng xy ra. Trng hp (A) c 5
nucleotide ging ht nhau (match), 1 nucleotide bt nhm (mismatch) v 2 khong
trng (gap). Trng hp (B) cng tng t c 5 nucleotide ging nhau, 1 v tr bt
nhm v 2 khong trng. Trng hp (C) c 5 nucleotide ging nhau, 1 v tr bt cp
nhm v 1 khong trng trn mi trnh t. Trng hp (D) cng tng t, tuy nhin v
tr ca khong trng l khc nhau. Vy trong cc trng hp th trng hp no
phn nh ng kt qu cn trnh t? S khng c cu tr li no l ng hay sai m ch
c th a ra kt qu ph hp nht theo mt tiu ch no .

C 2 phng php cn trnh t, cn trnh t ton b v cn trnh t cc b. i


vi cn trnh t ton b, hai trnh t c gi thuyt ging nhau, vic so snh c
thc hin trn ton b trnh t. Qu trnh cn trnh t c thc hin c t u n
cui c hai trnh t tm ra kh nng cho kt qu so snh ton b cc trnh t t mc
ging nhau cao nht. Phng php ny c p dng i vi nhng trnh t c mi
quan h gn gi v chiu di gn tng t nhau. i vi nhng trnh t c mi quan
h xa hoc c chiu di khng bng nhau th phng php ny c th khng th to ra
kt qu ti u. Cn trnh t cc b khng t ra gi thuyt hai trnh t ging nhau trn
ton b chiu di. Cn trnh t cc b ch tm nhng vng c mc ging nhau cao
nht gia hai trnh t v cn nhng vng ny m khng cn quan tm n vic cn
nhng vng trnh t cn li. Hng tip cn ny c th c p dng cn cc trnh
t c c mi quan h khng gn gi, c di khc nhau vi mc ch l tm cc vng
bo th, cc vng chc nng (domain) hoc dng trnh t (pattern) trong trnh t DNA
hoc trnh t protein.

113

Hnh 52. Cn trnh t ton b (global) v cn trnh t cc b (local)


Thut ton cn trnh t i vi c cn trnh t ton b v cc b l tng t
nhau v ch khc nhau chin lc ti u ha s dng trong khi cn cc k t ging
nhau. C hai dng thut ton c da vo mt trong 3 phng php, ma trn im
(dot matrix), chng trnh ng (dynamic programing) v t hay k t (word).
9.2. Nguyn l ca cn trnh t
a) Ma trn im
Phng php c bn nht trong cn trnh t l s dng ma trn im, cn gi l
dot plot method. y l cch so snh 2 trnh t theo dng ha da vo ma trn 2
chiu. Trong phng php ny, hai trnh t so snh s c vit theo chiu dc v
chiu ngang ca ma trn. Vic so snh c thc hin bng cch qut tng k t
(nucleotide hoc amino acid) trn trong mt trnh t tm ra cc k t no ging vi
k t trong trnh t kia.
Nu mt k t bt cp ging nhau c pht hin, mt im chm tng ng s
c nh du vo th. Nu k t khng ging s trng. Khi hai trnh t c
nhng vng ging nhau, nhiu im chm ni li vi nhau to thnh cc ng ni
tip th hin vng trnh t c cn. Nu c nhiu ch ngt quang gia mt ng
ni, iu c ngha l nhng t bin mt hoc thm on. Cc ng ni song song
nm trong ma trn im s biu hin cc vng trnh t lp li (hnh 53). Cc ng ni
gia cc im theo ng cho ch ra kt qu cn trnh t. Cc ng cho trn hoc
di th hin cc vng trnh t lp li trong trnh t ny hoc trnh t kia.

114

Hnh 53. V d so snh hai trnh t s dng phng php ma trn im


Vn pht sinh khi so snh cc trnh t c kch thc ln theo phng php
ma trn im l s to ra nhiu nhiu. Trong hu ht cc ma trn im, cc im c
chm ton b th s lm nhiu hoc gy kh khn cho vic xc nh ng cho
cn trnh t thc. i vi trnh t DNA, vn ny sinh l do c 4 k t trong trnh t
v mi k t c kh nng bt cp vi trnh t kia. gim nhiu, thay v s dng 1
k t n qut pht hin s ging nhau, mt k thut lc c p dng, s dng
mt khun/khung ca s c nh chiu di ca mt on cc k t (mt vng k
t ngn). Khi p dng cch lc nhiu, khun/khung s trt dc theo hai trnh t
so snh tt c cc on k t. Cc im ch c t khi mt vng trnh t trng vi
kch thc ca khung ca s t mt trnh t ny bt cp hon ton vi vng trnh t
tng ng ca trnh t kia. Phng php ny t ra kh hiu qu trong vic lm gim
nhiu. Khung/ca s cn c gi l mt tuple, kch thc ca khung ca s c
th c iu chnh m bo cho mt vng trnh t ngn hoc mt pattern bt
cp c th c nh du 1 im trn th. Tuy nhin nu la chn kch thc
khung ca s qu di th nhy ca vic cn trnh t s gim.
C nhiu bin th ca vic s dng phng php ma trn im. Chng hn, mt
trnh t c th c cn vi chnh bn thn n xc nh cc vng trnh t lp li.
Theo cch ny, c mt ng cho chnh th hin s bt cp hon ho ca mi k t.
Nu trong trnh t c nhng vng lp li, nhng ng k cho song song ngn s
c quan st c trn v di ng cho chnh. i vi nhng trnh t DNA c
kh nng t bt cp b sung (lp li o ngc), nhng trnh t ny c kh nng hnh
thnh cu trc kp tc cng c th pht hin d dng bng phng php ma trn im.
Trong trng hp ny, trnh t DNA c so snh vi trnh t o ngc b sung ca
n. Cc ng cho song song s th hin cc trnh t lp li o ngc. i vi vic
so snh trnh t protein, mt s khi lng (a weight scheme) c s dng gii
thch cho s ging nhau v cc c im vt l v sinh ha ca cc gc amino acid.
Phng php ma trn im a ra kt lun trc tip v mang tnh trc quan v
mi quan h gia hai trnh t. N gip nh nghin cu d dng xc nh cc vng trnh
t c mc ging nhau ln nht. Mt li th quan trng na ca phng php ny l
xc nh c cc vng trnh t lp li da vo cc ng cho song song c cng
kch thc theo chiu dc v chiu ngang ca ma trn. Phng php ny rt hu ch
115

khi xc nh cc vng trnh t lp li trn NST v so snh trt t cc gene bo th gia


hai genome c mi quan h gn gi. Ngoi ra n cng h tr xc nh cc cu trc bc
hai thng qua kh nng t bt cp b sung ca mt trnh t.
Hn ch ca phng php ma trn im ch khi thc hin cn trnh t di vi
nhiu v tr mt on hoc thm on thng qua vic ni cc ng cho gn nhau th
n ph thuc vo tnh ch quan ca ngi s dng. Mt hn ch na l cch quan st
v phn tch cc im chm trong ma trn. Vic nh gi cht lng ca ma trn
tm ra ng cho thc (kt qu cn trnh t) i khi rt kh, c bit i vi cc trnh
t di. Phng php ny cng hn ch l ch p dng c khi cn hai trnh t. Khi
cn nhiu trnh t phng php ny tr nn rt phc tp v khng th thc hin.
Mt s chng trnh cho php so snh nhiu trnh t theo phng php ma trn
im nh: Dotmatcher (bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html) v
Dottup (bioweb.pasteur.fr/seqanal/interfaces/dottup.html). y l 2 chng trnh ca
EMBOSS cho php dng online.
- Dothelix (www.genebee.msu.su/services/dhm/advanced.html) l chng trnh
ma trn im cho c trnh t DNA v protein.
- MatrixPlot (www.cbs.dtu.dk/services/MatrixPlot/) l chng trnh ma trn im
kh phc tp p dng khi cn trnh t amino acid.
b) Phng php chng trnh ng
Chng trnh ng l phng php cn trnh t ti u bng cch ging hai trnh
t tm ra tt c kh nng bt cp gia cc k t ca hai trnh t. Phng php ny
v c bn ging vi phng php ma trn im ch n cng to ra mt ma trn hai
chiu. Tuy nhin, n tm nhng kt qu cn trnh t c tnh cht nh lng bng cch
chuyn i ma trn im thnh ma trn im s tnh ton cc kt qu bt cp ging
nhau (match) v bt cp khng ging (mismatch) gia hai trnh t. Cch sp xp m
kh nng bt cp cho im s cao nht ng ngha vi vic a ra kt qu cn trnh
t tt nht c th.
Chng trnh ng hot ng bng cch thit k mt ma trn 2 chiu trong
cc trc ca n l hai trnh t c so snh. Cc k t bt cp ph thuc vo ma trn
im s nht nh. Cc im s c tnh ton mt hng trong mt ln. N bt u
bng hng u tin ca mt trnh t qut ton b chiu di trnh t khc, sau li
qut tip hng th hai. Cc im s bt cp qua mi ln qut c tnh ton. Vic qut
hng th hai s phi xem xt trn c s cc im s thu c vng 1. im s tt
nht c a vo gc cui bn phi ca mt ma trn trung gian. Qu trnh ny c
lp i lp li cho ti khi cc gi tr ca tt c cc c in kn. Nh vy, im s
c tch ly theo ng cho xut pht t gc tri pha trn cng ti gc phi cui
cng. Khi cc im s c tch ly trong ma trn, bc tip theo l tm ra con
ng th hin kt qu cn ti u. Bng cch kim tra ngc ton b ma trn theo
trnh t ngc li t gc cui cng bn phi sang im u tin bt u ca ma trn,
gc trn cng bn tri. Con ng bt cp tt nht l con ng cho im s cao nht.
Nu hai hoc nhiu con ng u cho im s nh nhau th mt trong nhng con
ng s c la chn ngu nhin lm kt qu cn trnh t ti u. Con ng
cng c th di chuyn theo chiu ngang hoc chiu dc mt im nht nh, tng
ng vi vic a ch trng (gap) vo mt trong hai trnh t phn tch.

116

Gap penalties
Vic ti u ha cn trnh t thng lin quan n vic a cc khong trng
tng ng vi cc dng t bin mt hoc thm. V qu trnh tin ha trong t nhin
(mt, thm) tng i him so vi cc dng t bin thay th (substitution) nn vic
a cc ch trng vo trnh t cn phi cn nhc rt k, xt v mt thut ton vic ny
khng d dng g, bi v n phi phn nh c cc s kin t bin mt v thm cc
nucleotide trong qu trnh tin ha.
Vic gn cc gi tr pht (penalties) c th t nhiu mang tnh ty bi v khng c gi
thuyt tin ha no a ra gi cho mt t bin mt hoc thm. Nu gi tr penalty
qu thp, cc khong trng s c nhiu trong kt qu cn trnh t, v th ngay c cc
trnh t c t mi lin h cng c th a ti kh nng bt cp ging nhau cao v im
s cng c th tng t nhau. Nu gi tr penalty t qu cao, cc ch trng c
a vo s b hn ch v vic cn trnh t li kh c th thc hin.
Nhiu nghin cu mang tnh thc nghim i vi cc protein dng cu (globular
protein) s dng cc gi tr penalties khc nhau th nghim pht trin
phng php ph hp nht cn trnh t. Nhng gi tr penalties thng c t
mc nh hu ht chng trnh cn trnh t. Mt yu t khc cng quan trng l v tr
ca ch trng trong mt dy cc ch trng lin tc trong trng hp t bin mt
on. V tr trng u tin c a vo s gi l opening gap/introducing gap v
cc v tr trng tip theo sau c gi l extending gap, v tr trng cui cng
c gi l closing gap. R rng, a mt v tr trng tip theo sau opening gap s
d dng hn so vi vic ch ra v tr u tin a ch trng vo. Chnh v vy, v tr
trng u tin nn b pht nhiu hn so vi cc v tr trng tip theo. Chng hn, ngi
ta c th dng dng s -12/-1 trong v tr trng u tin s b pht 12 im cn
cc ch trng tip theo b pht 1 im. Tng s im pht (W) s c tnh theo cng
thc: W=+(k1) trong k: chiu di ca gap, : introducing gap v :
extending gap.
Ngoi cch tnh im pht ny ngi ta cn c th p dng mt vi cch khc,
chng hn nh constant gap penalty tc l mi v tr trng u nhn mt gi tr pht
nh nhau bt k n nm u. Ngoi ra, gi tr im pht cng c th thay i ty
thuc vo mc ch ca ngi s dng.
c) Word methods
Cn gi l phng php k-tuple, l phng php s dng kh nng tt nht ch
khng m bo l tm c mt kt qu cn trnh t ti u nhng n c hiu qu cao
hn so vi chng trnh ng. Phng php ny c biu hu ch vi cc trnh t di
hoc s lng trnh t ln. Phng php ny c bit n nhiu nht do ng dng
ca n trong cc cng c tm kim CSDL thuc nhm FASTA v BLAST. Word
methods s dng mt lot cc word cha cc trnh t ngn, khng chng ln nhau
trong trnh t truy vn (query) bt cp vi cc trnh t trong CSDL. Cc v tr tng
i ca word trong 2 trnh t c so snh v loi tr thu c mt kt qu v ch
ra mt vng ca trnh t c cn nu nh cc t khc nhau to ra cng mt kt qu.
Ch nhng vng c xc nh ny s tip tc c thc hin tip vi cc thng s cn
trnh t mc cao hn. Nhng trnh t khc s b loi b ngay. Chnh v th rt
ngn c thi gian phn tch.
Trong phng php FASTA, ngi dng c th t mt gi tr k nh l chiu
di t word length ring, vic iu chnh gi tr k rt c ngha khi tm cc trnh t
117

ngn. Khi gi tr k nh, tc tm kim s chm nhng nhy s tng ln, do n


cho php tm c nhng trnh t ngn c mi lin quan. Nhm BLAST cng cung
cp cc thng s cho php ti u tm kim vi cc dng query khc nhau, chng hn
nh tm kim cc trnh t c mi lin h tng i xa.
9.3. Cn nhiu trnh t v nguyn l cn nhiu trnh t
a) Cn nhiu trnh t
Cn nhiu trnh t (multiple sequence alignment) l mt dng m rng ca so
snh cp trnh t. Phng php ny cn t nht t 3 trnh t tr ln. Cn nhiu trnh t
thng dng pht hin cc vng bo th trong mt nhm cc trnh t nghi ng l c
mi lin h tin ha. Nhng vng bo th c th l nhng motif lin quan n v tr
xc tc ca enzyme, v tr gn c cht, iu ha... Cn nhiu trnh t tr gip cho vic
xy dng cc mi lin h tin ha (phylogenetic tree). Cc thut ton s dng trong
cn nhiu trnh t bao gm: Dynamic programming, Progressive methods, Iterative
methods v Motif finding.

Hnh 54.Cn nhiu trnh t


b) Nguyn l cn nhiu trnh t
Dynamic programming
V l thuyt chng trnh ng c th p dng cho mt s lng khng hn ch
cc trnh t. Tuy nhin, do hn ch v thi gian x l, kh nng ca my tnh v chi
ph cho b nh n thng t c s dng khi c nhiu hn 3 hoc 4 trnh t.
Phng php ny i hi xy dng mt dng ma trn n chiu, trong n l s
lng trnh t trong mt b query. u tin n thc hin so snh cp gia cc trnh t
v sau cc khong trng alignment space c in y bng cc cn nhc cc
kh nng bt cp v ch trng nhng v tr trung gian. Mc d k thut ny i hi
chi ph my tnh, nhng n m bo mt gii php ton b trong nhng trng hp ch
c t trnh t c phn tch v i hi chnh xc cao. Mt phng php khc gim
i hi v chi ph cho my tnh da vo tng ca cc cp sum of pairs" c cung
cp bi MSA software package.
Progressive methods
Phng php ny da trn c s phn tch th bc, tun t hoc dng cy. u
tin n cn cc trnh t ging nhau nht sau b sung dn dn cc trnh t c mc
ging nhau gim dn hoc gp thnh kt qu cn trnh t khi ton b trnh t query
c kt hp li vi nhau.
118

Cy ban u m t cc trnh t c mi lin h da vo vic so snh cp


(pairwise) s dng thut ton hp l nht (heuristic) tng t nh FASTA. Cc kt
qu cn tip tc (progressive alignment results) ph thuc vo vic la chn cc trnh
t c mi lin h nht ("most related" sequences)v sau c th m rng vi cc cp
trnh t c chnh xc km hn. Hu ht cc phng php cn trnh t thuc nhm
ny weight cc trnh t trong mt b trnh t query da vo mi lin h ca chng,
iu ny gim kh nng a ra mt s la chn km chnh xc i vi cc trnh t ban
u v nh vy s tng c tin cy ca kt qu cn trnh t.
Cc chng trnh thuc nhm Clustal s dng phng php ny so snh
nhiu trnh t, xy dng cy phn loi (phylogenetic tree) v d oan cu trc protein.
Mt chng trnh chy chm nhng chnh xc s dng phng php progressive
method c bit n l T-Coffee.
Iterative methods
Phng php ny ra i nhm b sung cho im yu ca progressive
methods, l s ph thuc rt nhiu v chnh xc ca kt qu cn cp trnh t
ban u (initial pairwise alignment). Phng php ny ti u mt phng trnh hng
i tng (objective function) da vo vic la chn phng php tnh im cho kt
qu cn trnh t da trn c s ca global alignment sau sp xp li cc b trnh
t thnh vin (cc trnh t c trong mt b query). Cc b trnh t c sp xp li ny
bn thn chng sau c sp xp to ra cc kt qu cn trnh t tip theo. Kt
qu cui cng s chn ra nhng nhm trnh t c cn tt nht.
Motif finding
L mt dng cn trnh t da trn c s so snh ton b nhiu trnh t (global
multiple sequence) tm ra cc thng tin ca cc motif v cc vng bo th. u tin
thut ton thc hin cn nhiu trnh t ton b, sau vng trnh t bo th cao c
tch ring v c dng thit k nhng b ma trn. Cc b ma trn sau c s
dng tm cc trnh t khc kim tra tn sut ca cc motif. Trong cc trng hp
b d liu ban u cha t trnh t, hoc ch c nhng trnh t c mi lin quan rt gn
vi nhau, vic m gi (pseudocounts) s c b sung chun ha s phn b cc
k t c mt trong motif.
9.4. Cc cng c tm kim trnh t tng ng
Blast v Fasta l 2 phn mm c s dng so snh trnh t sinh hc DNA,
amino acid, protein. Cng c FAST ra i t nhng nm 1980 p ng nhu cu so
snh v tm kim cc gene ging nhau. Blast vit tt ca Basic Local Alignment
Search Tool) c s dng so snh hai trnh t. FAST l mt phn mm c bit
di dng Fast A (A l ch vit tt ca All). Fast A p dng cho trnh t DNA v Fast
B i vi protein. C Blast v Fasta u l cc cng c cho php so snh trnh t rt
nhanh bt k d liu genome no.
a) Cng c FASTA

Chng trnh Fasta c vit nm 1985 ban u dng so snh trnh t


protein nhng sau n c ci bin tm kim c trnh t DNA. Chng trnh ban
u FASTP c thit k tm kim cc trnh t protein tng ng. Hin nay gi
cng c FASTA cha tt c cc chng trnh cho protein: protein, DNA:DNA,
119

protein:translated DNA (tt c cc khung), DNA:protein v tm kim cc peptide.


Ngoi ra FASTA package cng cung cp chng trnh SSEARCH, mt ph tr ca
thut ton Smith-Waterman algorithm. So vi cng c Blast, Fasta t c s dng
hn.
b) Cc cng c BLAST

BLAST c bn (Basic BLAST)


Nucleotide blast: Tm cc trnh t trong CSDL nucleotide ging vi trnh t truy vn
l trnh t nucleotide (nucleotide query). C 3 thut ton tng ng vi cc mc ch
khc nhau: blastn, megablast v discontiguous megablast.
Protein blast: Tm cc trnh t trong CSDL protein ging vi trnh t truy vn l
protein (protein query). C 3 thut ton tng ng vi mi mc ch khc nhau:
blastp, psi-blast, phi-blast.
Blastx: So snh trnh t DNA truy vn vi CSDL protein bng cch dch m trnh t
truy vn theo c 6 khung c ri so snh mi khung c vi CSDL protein.

Tblastn: So snh trnh t protein truy vn vi CSDL DNA theo c 6 khung ca CSDL
ny. Ni cch khc CSDL DNA c dch m thnh trnh t amino acid theo c 6
khung ri so snh vi trnh t protein truy vn.
Tblastx: So snh trnh t protein c m ha bi trnh t DNA truy vn vi cc trnh
t protein c m ha bi CSDL trnh t nucleotide. Theo cch ny, s kh nng s
l rt ln v trnh t truy vn DNA a vo s to ra 6 trnh t protein. CSDL
nucleotide s dch m theo 6 khung c. Vy tng cng s c 36 kh nng, chnh v
vy phng php ny i hi thi gian v ti nguyn my tnh.
Cc bin th ca BLAST

BLAST 2: cn gi l Advanced BLAST. N cho php sp xp trnh t c


khong trng (gapped alignments).
PSI-BLAST(Position Specific Iterated): Tm kim CSDL pht hin cc trnh
t c mi quan h tin ha xa. V nguyn l ban u PSI-BLAST thc hin tm
kim thng thng nh BLAST, sau chn ra cc trnh t c im s cao
nht. Mt s v tr nht nh trong cc trnh t ny c la chn to ra mt
ma trn im c hiu v tr (position-specific scoring matrix). Ma trn im
ny to ra mt h s v nhng v tr quan trng ca cc trnh t amino acid bo
th trong mt motif. Trn c s h s c to ra PSI-BLAST tip tc thc
120

hin tm kim cc protein c motif tng t, chnh v vy n thng c dng


pht hin cc protein c mi quan h xa hn v cu trc hoc chc nng m
khng th tm c bng BLAST thng thng.
WU-BLAST: Washington University BLAST (WU BLAST) version 2.0 l mt
phn mm hiu qu xc nh, tm kim gene v protein bng cch s dng
cng c tm kim nhanh, nhy v c hiu vi CSDL protein v DNA. WU
BLAST 2.0 xy dng t WU BLAST 1.4 (ging NCBI BLAST version 1.4).
WU BLAST 2.0 bt ngun t gapped BLAST vi cc cng c thng k v v
th n c bit l chng trnh tm kim tiu chun cao v nhy, tc ,
mc chnh xc v tin cy hn, c tnh cnh tranh vi tt c cc chng trnh
tng ng.
PHI BLAST(Pattern-Hit Initiated BLAST): tm kim cc trnh t protein bng
cch dng kt hp phng php matching pattern v local alignment
gim xc sut dng tnh gi.
RPS BLAST: Reverse Position Specific BLAST (RPS-BLAST) l mt cch tm
cc vng domain bo th trong cc phn t protein nhy hn nhiu so vi
BLAST. N so snh mt trnh t protein vi mt CSDL ca cc ma trn im
c hiu v tr (PSSMs).
Gapped BLAST: Tm ch trnh t c cn cha nhiu v tr trng. Chng
trnh ny c tc nhanh gp 3 ln so vi BLAST thng thng (ungapped
BLAST).

c) Cch la chn chng trnh BLAST


Chn chng trnh BLAST cho trnh t truy vn l nucleotide
Chiu di

T 20 bp tr ln

CSDL

Nucleotide

28 bp hoc di
hn
cho
megablast

Chng trnh

Nhn dng trnh t truy vn

discontiguous
megablast,
blastn

megablast,
hoc

Tm cc trnh t ging vi trnh discontiguous megablast hoc


t truy vn
blastn
Tm cc trnh t ging vi cc Trace megablast, hocTrace
trnh t Trace archive
discontiguous megablast
Tm cc trnh t proteins ging
vi trnh t query c dch m Translated BLAST (tblastx)
trong mt CSDL dch m.

Peptide
T 7 n 20 bp

Mc ch

Tm cc protein ging vi trnh


Translated BLAST (blastx)
t truy vn trong CSDL protein

Tm cc v tr bm ca mi
Tm cc v tr bt cp ngn v
Nucleotide hoc cc motif ngn lin tc
bt cp gn nh ton b.

Megablast:

121

Cng c cho php xc nh mt trnh t cha bit liu c trong CSDL khng?
Cng vi 3 cng c c tnh nng tng t: MEGABLAST, discontiguous-megablast,
v blastn, MEGABLAST c bit thit k cho cc trnh t di v tm kim cc trnh t
c mc ging rt cao. Ngoi ra, cc thng s b sung bao gm gi tr cut-off
(ngng thng qua gi tr k vng), chng trnh cho php iu chnh phn trm ging
nhau ca cc trnh t tm kim c so vi trnh t truy vn. Ngoi ra Megablast cho
php tm kim s dng nhiu trnh t truy vn cng mt lc.
Discontiguous megablast:
Tm cc trnh t c mc khc nhau bng cc ct trnh t query thnh cc
trnh t ngn gi l t (word). Chng trnh s tm ra cc kt qu bt cp chnh xc vi
cc t (query word) gi l word hit, sau m rng phm vi cc t theo nhiu bc
to ra kt qu cn trnh t cui cng c cha c cc ch trng (gap). Nu nh chiu
di t ban u (initial word size) hay cn gi l word size cng ln th kt qu tm
kim s b thu hp v ch cho ra kt qu l nhng trnh t c mc ging nhau rt
cao v ngc li. Chng trnh Megablast s dng word size l 11, trong khi
blastn thng s dng gi tr thp hn, gi tr thp nht l 7.
Blastn: c ti u tm kim cho tc hn l nhy. Kt qu tm kim s cho ra
cc trnh t c mc ging nhau t cao v gim dn xung thp so vi trnh t truy
vn.

122

Chn chng trnh BLAST cho cc truy vn l trnh t protein


Chiu di

CSDL

Mc ch

Chng trnh

Xc nh cc trnh t ging vi trnh t truy


Standard
Protein
vn hoc tm cc trnh t protein ging vi
BLAST (blastp)
trnh t truy vn
Tm cc thnh vin ca mt h cc protein
PSI-BLAST
hoc xy dng mt ma trn im
T
15 Peptide
amino acid
tr ln

Tm cc protein ging vi trnh t truy vn


PHI-BLAST
xung quanh mt m hnh (pattern) nht nh.
Tm cc domains c mt trong trnh t truy CD-search
vn
BLAST)

(RPS-

Tm cc domain bo th trong trnh t truy Conserved Domain


vn v xc nh cc protein khc ging vi Architecture
cc cu trc domain.
Retrieval
Tool
(CDART)
Nucleotide
T
5-15
Peptide
amino acid

Tm cc protein ging vi mt CSDL Translated


nucleotide c dch m
(tblastn)

BLAST

Tm cc v tr bt
cp ngn gn nh
hon ton

Tm cc motif

d) BLAST vi query l nhng on trnh t ngn


y l cng c rt hu ch trong vic xc nh cc primer trong k thut PCR hoc tm
cc trnh t ngn ging vi vng trnh t nht nh trong genome. Cng c ny c
p dng trong vic xc nh cc trnh t ch cho k thut RNAi. Cc trnh t ngn
(ngn hn 20 bases) thng s khng tm c bt k mt kt qu no vi cc chng
trnh BLAST vi cc thng s ci t thng thng. Nguyn nhn l do ngng tin
cy (significant threshold) chu s kim sot bi gi tr k vng (expect value) c t
qu nghim ngt. V vy, thc hin c qu trnh tm kim ngi ta phi iu
chnh c word size v gi tr k vng ca BLAST thng thng tm ra cc thng
s ti u cho cc trnh t ngn. Di y l bng tm tt cc thng s. Chn cc thng
s i vi blasn tm cc trnh t da vo query ngn. Tuy nhin NCBI cng cung
cp cng c t ng iu chnh cc thng s cho cc tm kim thng thng.
Chng trnh

Word
size
Blastn chun
11
Tm cho cc trnh t bt cp 7
khng hon ton chnh xc

DUST
Setting
ON
Off

Filter Expect Value


10
1000

Bi v BLAST s thc hin c qu trnh cn trnh t theo kiu cc b v n t ng


tm kim c 2 si v th khng cn thit phi o ngc b sung mi reverse (reverse
primer) trc khi ni hoc tm kim. Tng t p dng vi tm kim cc mnh peptide
123

ngn, ngi ta iu chnh c kch thc ch word size v gi tr k vng. Ngoi ra


nn s dng PAM30 thay v BLOSUM62.
Thng s t cho chng trnh blastp c bn v tm cc on trnh t bt cp ngn v
gn chnh xc (Search for short and nearly exact matches)
SEG
Filter

Expect
Value

Score Matrix

Protein Blast c bn (Standard


3
Protein Blast)

On

10

BLOSUM62

Search for short and nearly


2
exact matches

Off

20000

PAM30

Chng trnh

Word
Size

S khc nhau gia BLAST v FASTA


BLAST nhanh hn nhiu so vi FASTA v cng chnh xc hn FASTA. i
vi cc trnh t c mc ging nhau cao, BLAST cc k chnh xc v i vi nhng
trnh t c mc ging nhau thp th FAST chim u th. Khi s dng BLAST
ngi dng c nhiu ty chn thay i cc thng s, tuy nhin FAST thng khng
cung cp cc ty chn ny. Do c nhiu u im nn hin nay ngi dng s dng
BLAST.

124

CHNG 10. PHN TCH MI QUAN H TIN HA


Cy tin ha l dng s m t mi quan h tin ha gia cc thc th sinh vt. Mi
quan h tin ha ny c phn tch da trn c s l cc loi u c chung t tin ban
u. Mi thc th trong cy tin ha c gi l mt n v taxon (OTU/operational
taxonomic unit). n v taxon c th l mt loi, mt gene hoc mt genome. Mi
quan h tin ha gia cc loi c xy dng da trn c s phn tch s ging v
khc nhau gia chng. Ty thuc vo dng d liu v phng php phn tch, vic xy
dng mi quan h tin ha s khc nhau. Cc d liu bao gm d liu hnh thi, c
im phn b, sinh l ha sinh hay trnh t nucleotide, trnh t amino acid hoc phn
tch cu trc tng ng (homology) ca cc phn t DNA, RNA v protein.
xy dng c cy tin ha nh nghin cu cn phi tr li nhng cu hi sau:
1. D liu no c dng xy dng cy tin ha?
2. Vi d liu ang c th nn la chn phng php phn tch no?
3. M hnh tin ha no nn c s dng?
4. Lm th no kim tra hay xc nh c mc chnh xc ca cy tin ha
va c to ra.
10.1. Khi nim
Hy xem mt v d n gin. gi s vo mt thi im nht nh trong qu trnh
tin ha t mt t bo ban u phn chia thnh 2 t bo, cc t bo ny li tip tc
phn chia 500 ln lin tc to ra mt qun th t bo vi s lng rt ln. Trong qu
trnh phn chia mi t bo tri qua nhng bin i ngu nhin khc nhau. Gi thit ly
ngu nhin 10 t bo trong qun th v so snh chng vi nhau. Nu hai t bo (1 v 2)
cng chia s t tin chung gn nht (nhm A) s c mc ging nhau cao hn so vi
hai t bo c ngun gc t tin xa nhau (nhm B).

Tt nhin qu trnh tin ha khng n thun ch din ra theo m hnh n gin


nh vy. Cc s kin tin ha xy ra khc nhau khng ch mc m cn tc .
Chnh v vy khi phn tch mi quan h tin ha nu p dng cng mt m hnh tin
ha s khng phn nh c chnh xc cc s kin tin ha xy ra vi tng i
tng phn tch.
125

Xy dng cy tin ha phn nh ng lch s tin ha ca cc taxon rt kh


khn bi v cc s kin tin ha xy ra mt cch ngu nhin m con ngi khng
chng kin. Ngoi ra cc d liu trnh t sinh hc (DNA, RNA, protein) m chng ta
s dng phn tch liu c phn nh c y v chnh xc mi quan h tin ha
ca cc taxon hay khng?
Khi s dng trnh t sinh hc xy dng cy tin ha, cu hi u tin t ra
l trnh t no l ph hp. V mt nguyn l c th ly bt k trnh t no, chng hn
trnh t mt gene, mt phn vng m ha hay thm ch c cc trnh t intron. Ngoi ra
cc vng khng m ha nh vng promoter cc khong trng gia cc gene hoc cc
thnh phn khc cng c th s dng. Cn lu l khng c mt trnh t no ph hp
cho tt c cc mc ch. Chng hn ngi ta hay s dng cc gene m ha cho cc tiu
n v ribosome ss-rRNA (small subunit ribosomal RNA) nghin cu tin ha vi
sinh vt. Nguyn nhn l do trnh t phn t ny c tnh bo th cao gia cc loi v
cc vng khc nhau ca gene ny c tc tin ha khc nhau. Gene ss-rRNA cn
c dng phn bit vi khun thc (bacteria) v vi khun c (archae) hoc nhn
dng loi vi khun cha bit.
Mc d c dng ph bin nhng cc gene m ha rRNA cng c nhiu hn
ch. Chng hn mt s loi vi khun a nhit, gene m ha cho rRNA thng c hm
lng G+C cao hn cc gene khc v th khi phn tch cng vi cc vi khun khc s
rt kh m bo xy dng cy tin ha chnh xc. V th khi nghin cu cc sinh vt
c kh nng sinh trng cc nhit khc nhau, ngi ta thng la chn gene khc
ch khng phi rRNA. Mt hn ch na ca cc rRNA l tc tin ha ca chng
thng chm hn so vi tc tin ha ca mt s gene m ha protein. V vy
phn tch tin ha gn, chng hn mi quan h gia cc loi bn trong cc chi hoc
gia cc loi, th gene m ha rRNA s khng ph hp. Trong trng hp ny cc
gene m ha cho protein s thch hp hn v nu cc vng m ha protein khng
mc bin i pht hin s a hnh th cc intron, cc gene gi hoc cc vng
trng gia cc gene c th c s dng thay th. Chnh v vy, cc nh nghin cu
cn phi la chn loi trnh t ph hp pht hin c s khc nhau hoc a hnh
gia cc taxon.
Tc bin i trnh t cng khng phi ch l tiu ch duy nht la chn
mt gene hoc mt vng DNA trong genome. iu quan trng l cc trnh t c la
chn c lin quan cht ch vi m hnh tin ha ang nghin cu hay khng? Lu
cc gene c s lng lp li ln cn c xem xt cn thn bi v s pht sinh loi mi
c th khng i km vi cc s kin lp gene.
Phn tch trnh t tng ng l c s quan trng xy dng cy tin ha. Mi
quan h tng ng ca cc trnh t c xc nh da vo vic so snh mc ging
nhau ca trnh t. Tuy nhin, nu ch da vo mc ging nhau th cng cha th kt
lun c ngun gc tng ng ca cc trnh t bi v s ging nhau c th xy ra
ngu nhin khng lin quan n ngun gc tin ha chung. Trnh t ging nhau c th
phn nh mc tng ng nhng n cng c th l kt qu ca qu trnh tin ha
hi t (convergent) hoc song song (parallel) hay cn gi l analogy. Vy lm th no
phn bit c cc trnh t ging nhau l homology hay analogy? gii quyt bi
ton ny cn phi t ra mt ngng ging nhau t kt lun cc trnh t phn tch l
tng ng khi chng c mc ging nhau cao. Mt cch khc c th p dng l ly
mi trnh t trong s cc trnh t ging nhau ang phn tch tm ra t tin gn nht
hoc cc t tin chung ca chng sau so snh cc t tin chung ny vi nhau.
126

V d, khi nghin cu tin ha chim v di, ngi ta thy cnh ca di v chim


tng t nhau tuy nhin di v chim khng c cng t tin chung. S ging nhau ny
l kt qu ca tin ha hi t. Mc d di v chim c cu trc cnh ging nhau nhng
l do chng c chung cuc sng bay ln nn mt s c quan cn thit s c cu
trc ging nhau. Cn lu rng s tng ng (homology) l kt qu suy din da
trn c s so snh mc ging nhau ca trnh t. Tng ng khng c gi tr th
nguyn trong khi mc ging nhau ca trnh t c th o c, chng hn l phn
trm ging nhau (%). V th s khng ng khi ni hai c th c h hng vi nhau
40% m ch c th ni chng ging nhau (v trnh t) 40%.
Mt on DNA hoc vng genome c la chn phn tch mi quan h tin
ha phi c tnh cht c trng cho loi v c tc tin ha hay bin i nhanh
phn bit gia cc c th phn tch. Tc tin ha cng khng nn qu nhanh dn
n cc trnh t khng phn nh ng c mi quan h ca cc loi c mi quan
h xa. n nay, trnh t DNA m ha cho cc tiu n v ribosome c coi l vng
DNA genome ph hp nht phn tch tin ha. Ngoi gene m ha cho 16S
ribosome, mt s vng gene khc nh 23S, ITS1, ITS2, cytochrome c oxidase subunit
I (COI), Cyt b, Ribulose 1,5-bisphosphate (RuBP) carboxylase/oxygenase (RubisCO),
polymerase -subunit gene (rpoB) cng c s dng... Trong nhiu trng hp, vic
kt hp nhiu gene phn tch l cn thit.
Mt vn kh khn na khi phn tch mi quan h tin ha l la chn
phng php v m hnh tin ha no cho ph hp. Cho n nay c nhiu thut ton
v tiu ch khc nhau xy dng cy tin ha, tuy nhin hai phng php c s
dng ph bin l da vo khong cch (distance based method) v da vo phn tch
k t (character based method).
10.2. D liu dng xy dng cy tin ha
Trc khi s dng d liu trnh t sinh hc (nucleotide, amino acid), d liu
hnh thi c s dng ph bin nghin cu mi quan h tin ha gia cc sinh vt.
S pht trin nhanh chng ca cc k thut xc nh trnh t v cc d n genome
to ra s lng khng l trnh t sinh hc. Cng vi cc d liu hnh thi, d liu trnh
t dn dn chim u th trong cc nghin cu tin ha. Xt mc tin ha nh
(microevolution) vic phn tch trnh t sinh hc l ph hp v n phn nh c mc
bin i nh nht (tn s alelle) trong qun th. Nhng bin i ny c th hnh
thnh do t bin, tri dt gene (genetic drift), dng gene (gene flow), ti t hp v
chn lc t nhin. ng v mt nh ngha loi, nu t loi trong h thng phn loi
tc l mt nhm cc th c nhng c im ring c trng v phn bit c vi cc
nhm c th khc. Theo cch ny, vic phn loi s khng phn nh c ng qu
trnh tin ha. Nu nh ngha loi l nhm c th c mc ging nhau cao nht v
mt di truyn hay ging nhau nht v trnh t genome th s phn nh chnh xc hn
qu trnh tin ha. Bi v nhng bin i nh nht v mt trnh t c th pht hin
ngay khi khng c s bin i v mt kiu hnh.
Chng ny ch yu cp n vic s dng trnh t sinh hc phn tch mi quan
h tin ha. Cc trnh t sinh hc c th thu c t kt qu nghin cu ring r hoc
c th ly t ngn hng CSDL. Hai tiu ch lin quan trc tip n xy dng cy tin
ha l dng d liu u vo v thut ton. D liu u vo c th hoc l d liu k t
hoc d liu khong cch.
D liu k t: to ra t kt qu cn trnh t sinh hc (nucleotide hoc amino acid)
127

Hnh 58. D liu t kt qu cn trnh t amino acid


D liu khong cch: Bng cch cn nhiu trnh t ngi ta chuyn kt qu cn trnh
t thnh cc s liu khong cch gia cc trnh t sau khi cn.

Hnh 58. Tnh ton ma trn khong cch t kt qu cn nhiu trnh t


u im ca d liu khong cch l cho php tnh ton v a ra s cy tin
ha nhanh chng. Tuy nhin, nhc im ca phng php ny l thng tin khi
chuyn t d liu k t sang d liu khong cch c th b mt. Tht vy, t vic cn
nhiu trnh t, da vo thut ton to ra mt bng ma trn. T ma trn ny suy ra gi
tr khong cch gia cc trnh t so snh. Do bn cht ca qu trnh cn trnh t da
vo cc thut ton m bn thn cc thut ton ny ch c a ra kt qu hp l nht
(heuristic) ch khng phi l chnh xc nht nn d liu khong cch ch c th p
dng i vi mt s loi trnh t v cht lng ca trnh t.
C nhiu cch tnh khong cch t kt qu cn trnh t. Cho n nay c cc
cch tnh sau: (1) Uncorrected Distance, (2) Jukes-Cantor Distance, (3) Tajima-Nei
Distance, (4) Kimura Two-Parameter Distance, (5) Tamura Distance, (6) Jin-Nei
Gamma Distance, (7) Kimura Protein Distance.
a) i vi trnh t DNA c th dng cng thc:
D = mismatch/align length
128

Trong mismatch: s nucleotide bt cp khng ging nhau


Align length: chiu di trnh t c cn (tnh theo trnh t di nht).
Theo cch tnh ny cc v tr trng khng tham gia vo vic hnh thnh khong cch.
b) i vi protein c th dng cch tnh Kimura
Khong cch = -ln(1- D- 0,2D2)
Trong D = 1-S, v S = s v tr bt cp chnh xc/s v tr bt cp
10.3. Phng php xy dng cy tin ha
C hai phng php xy dng cy tin ha, da vo khong cch (distance method) v
da vo phn tch k t (character based method).

10.2.1. Phng php da vo khong cch


Theo phng php ny, cc trnh t trc ht c so snh cp vi nhau. Kt
qu so snh s a ra gi tr khong cch. Hai trnh t c khong cch gn nhau nht
s c gp li thnh mt nhm (cluster). Nhm ny sau s c so snh cp vi
cc nhm khc v cc khong cch li c tnh ton. Cc nhm no c khong cch
ngn nht li c xp vi nhau thnh mt cm v qu trnh tnh ton c tip tc cho
ti khi tt c cc trnh t u c sp xp vo cc nhm. Kt thc ca qu trnh ny
s to ra mt cy biu din mi quan h ca cc trnh t nghin cu. Phng php s
dng ph bin trong nhm ny l UPGMA (Unweighted pair-group method using
arithmetic averages) v NJ (Neighbour Joining).
UPGMA
y l phng php xy dng cy n gin nht da vo thut ton nhm cm
trnh t. Hn ch ca phng php ny xy dng da trn gi thit tc tin ha l
nh nhau tt c cc trnh t phn tch. Gi thuyt ny ch ph hp trong iu kin
cc trnh t c mc ging nhau cao v c t tin chung gn gi. i vi cc trnh t
c khong cch tin ha xa th cy tin ha to ra c chnh xc thp. Tuy nhin,
phng php ny to ra cy c gc vi tc cc nhanh.
Di y l mt v d ca phng php UPGMA. Sau khi cc trnh t A, B, C
v D c cn vi nhau theo tng cp. Cc gi tr khong cch c tnh ton v a
vo bng.
OTU
A
B
C

A
0
0.1
0.3

0
0.5

129

D 0.6

0.6

0.66

Cc cp A:A, B:B, C:C v D:D s cho khong cch bng 0. V vy bng s c vit
li. Trong bng ny, khong cch gia cp trnh t A:B l nh nht (0.1) v vy A v B
s c ghp thnh 1 cp.

Mt im c t vo gia A v B, im ny c coi l im m t tch ring ra


2 trnh t A v B hay cn gi l t tin chung gn nht ca A v B. Gi d l khong
cch t im gia ti 2 im A v B, khi dAB/2 = 0.1/2 = 0.05.
Sau hai trnh t A v B c ghp li thnh AB. Khong cch gia trnh t C vi
AB v khong cch gia trnh t D vi AB c tnh theo cng thc:
d(AB)C= (dAC + dBC)/2 = (0.3 + 0.5)/2 = 0.4
d(AB)D= (dAD + dBD)/2 = (0.6 + 0.6)/2 = 0.6

Gi tr d(AB)/C = 0.4/2 = 0.2 biu din khong cch tin ha t mt t tin chung n
AB v C.
Theo khong cch tnh c, d(AB)C < d(AB)D v vy trnh t C s c gp nhm
vi trnh t AB to thnh nhm ABC.

Khong cch gia D v ABC d(ABC)D = (dAD + dBD + dCD)/3 = 0.62.


Gi tr d(ABC)/D = 0.62/2 = 0.31 biu din khong cch tin ha t mt t tin chung
n ABC v D.

130

Do vic tnh ton khong cch ph thuc vo phng php s dng. Nn trong
cng mt d liu ty thuc vo thut ton m kt qu c th khc nhau. Trn thc t
ngi ta cn phi hiu chnh gi tr khong cch theo cc m hnh nh: Jukes-Cantor
model, Kimura two-parameter model hoc Kimura three-parameter model.

NJ (Neighbour Joining)
V c bn, phng php NJ cng tng t nh phng php UGPMA. NJ ng
dng trong nghin cu tin ha nh (minimum evolution). Tn gi NJ lin quan n
cch gp nhm k cn (neighbor). Phng php ny c u im l khng p dng gi
thuyt tc tin ha ging nhau tt c cc nhnh ca cy.

8
7

2
3
4

B
8
7

1
Y

2
3

Hnh 59. Nguyn l gp nhm ca NJ


A: dng cy hnh sao cha phn nhnh. B: cc OTU 1 v 2 c xp vi nhau ti v tr X,
mt khong cch gia X v Y c to ra. C. Tip theo OTU 3 gn v tr Y s tch ring
khi cc OTU cn li, mt khong cch tip theo s c to ra mt v tr Z. Qu trnh ny
tip tc cho n khi ch cn duy nht mt OTU.

10.2.2. Phng php phn tch k t


Theo phng php ny, thut ton s to ra tt c cc dng cy c th vi trnh t a
vo v sau tm trong s cc cy va to ra cy ph hp nht theo mt s tiu ch
nht nh. Phng php ny c nhc im l tn nhiu thi gian bi v khi s lng
trnh t phn tch cng nhiu th s cy to ra tng ln rt nhanh chng. So vi phng
php da vo khong cch, vic tnh ton ch din ra trong vi giy th phng php
da vo phn tch k t c th ln ti vi pht hoc lu hn ph thuc vo s lng v
131

chiu di trnh t. Phng php s dng ph bin trong nhm ny l MP (Maximum


Parsimony) v ML (Maximum Likelihood). u im chung ca phng php ny l
xc nh t tin chung (trnh t cc im giao bn trong cy v khng mt thng
tin tnh ton trc tip bi d liu cn nhiu trnh t).
Phng php MP (Maximum Parsimony)
MP c bt u bng cch cn nhiu trnh t, tuy nhin s trnh t p dng trong
phng php MP khng c qu nhiu v cc trnh t phi c mc ging nhau
cao (c v chiu di ln mc khc nhau gia cc trnh t). Sau khi cn trnh t, cc
v tr cha nhiu thng tin (informative site) c la chn phn tch. V tr cha
nhiu thng tin khng phi l cc trnh t bo th (conserved sequences) v nn c
nhng k t ging nhau t nht 2 trong s cc trnh t c cn. Tip , xc nh
cc dng hnh cy (topology) tt nht cho mi v tr cha nhiu thng tin. Qu trnh
ny bao gm a ra cc dng hnh cy khc nhau sau nh gi chng v la chn
mt hoc nhiu dng cy vi s ln thay i t nht (parsimony).

Hnh 60. Vng trnh t cha thng tin


MP khng a ra chiu di nhnh m ch a ra trt t ca ca cc nhnh. i vi
trnh t DNA cc chng trnh Paup, molphy, phylo_win hoc b phn mm ca
Phylip (Phylip package) nh DNAPars, DNAPenny cho php phn tch MP. i vi
trnh t Protei c th dng cc chng trnh paup, molphy, phylo_win hoc b phn
mm ca Phylip: paup, molphy, phylo_win.
ML (Maximum Likelihood)
Phng php ny da vo m hnh tnh ton tnh xc sut tm ra cy tin ha.
Phng php ny phn tch tng v tr trong kt qu cn nhiu trnh t v th n i hi
ti nguyn my tnh ln. Ban u, chng trnh t ra mt m hnh tin ha nht nh
sau phn tch cc trnh t v da vo cc thng s, m hnh phc tp tnh ton
ma trn thay th (substitution matrix) nhm to ra cy tin ha vi xc sut cao nht
i vi s liu v m hnh tin ha t ra ban u. Phng php ny cho chnh xc
cao hn rt nhiu so vi cc phng php cn li.
u im:
- C th xy dng cy chnh xc vi cc trnh t c khong cch xa (cc trnh t
tin ha lu t t tin chung ban u).
- To ra t bin th so vi cc phng php khc
- C th can thip vo gi thuyt m hnh tin ha
132

p dng c vi cc trnh t rt ngn vn khng th i vi cc phng php


khc nh phng php khong cch hoc MP.
- Thm nh nhiu dng s cy khc nhau da vo cc phng php thng k
chnh xc.
- S dng tt c cc thng tin c trong trnh t
Nhc im:
- i hi cu hnh my tnh cao v thi gian phn tch lu.
- Kt qu ph thuc vo m hnh tin ha s dng.
Di y l s chung hng dn cch la chn phng php phn tch tin ha
-

10.3. La chn m hnh tin ha


Nh trnh by trn phng php da vo khong cch thng phn nh
khng chnh xc v thiu y khong cch tin ha thc. V vy cn thit phi hiu
chnh khong cch khi s dng phng php ny. Vic la chn m hnh tin ha ph
hp cho tt c cc i tng cn phn tch l rt cn thit. Tuy nhin, la chn m
hnh tin ha khng phi n gin v vy ngi ta thng la chn mt m hnh nht
nh th sau chn ra cy tin ha ph hp nht vi d liu phn tch.
So vi phng php UPGMA, NJ c u im hn v n khng s dng chung mt
ng h phn t ni cch khc l n khng gi thuyt tt c cc nhnh u tin ha
cng mt tc nh nhau.
10.4. nh gi cy phn tin ha
Mt trong nhng cch ph bin nht kim tra tin cy ca cy tin ha l
s dng gi tr bootstrap hoc Jack-knifing. Gi tr boostrap h tr d liu trong vic
la chn mt nhnh tng ng vi m hnh tin ha v phng php xy dng cy.
i vi cc trnh t c mc khc nhau nht nh vic xc nh kt qu cn
trnh t ti u thng rt kh c bit c nhiu v tr nucleotide b thay th nhiu. Sau
khi cn trnh t, chn ngu nhin mt vng trnh t nht nh sau khi cn phn tch
(hnh 60).

133

Hnh 61. Cc vng trnh t c la chn ngu nhin


Trong qu trnh la chn ngu nhin, mt vng trnh t nht nh c th c
chn lp li nhiu ln. Cc vng trnh t la chn sau li c phn tch a ra
cc cy khc nhau. Qu trnh ny c lp li nhiu ln (t 100-10000 ln), thng
thng gi tr lp li 100 ln cng nh gi thng k. Gi tr bootstrap >70% s
ln lp li c coi l c gi tr thng k cao hn 95%. Gi tr bootstrap khng a ra
thng tin v mc hoc tnh cht ca cy tin ha m gip nh nghin cu la chn
hoc thay i m hnh tin ha hoc phng php phn tch tin ha. Jack-knife cng
l phng php thm nh tin cy ca cy tin ha. Nguyn l ca Jack-knife
cng tng t nh bootstrap nhng Jack-knife ch chn ngu nhin mt vng trnh t
mt ln duy nht.
10.5. To gc cho cy tin ha
Mt bc quan trng trong xy dng cy tin ha l xc nh gc cho cy. C
hai phng php ph bin c s dng to gc l phng php ly mt im
gia (midpoint rooting) v s dng nhm ngoi (outgroup).
Phng php th nht gc c t im gia cc OTU xa nht. Theo nh tn gi,
phng php ny s t mt gc vo im gia ca khong cch OTU xa nht, y
l khong cch t A ti E. Nu nh tc tin ha gia cc OTU c coi l nh
nhau hoc tng ng th phng php ny ph hp. Trong trng hp khc khi m
cy tin ha kh cn bng vi mt s nhnh c mi quan h rt gn gi c tch ra
bi mt nhnh di th phng php ny cng rt hu ch. Tuy nhin, phng php ny
s khng chnh xc khi tc bin i gia cc OTU khc nhau qu ln. Ngoi ra
t c im gia vo trong mt v tr c rt nhiu nhnh vi chiu di ngn cng rt
kh. Trong nhng trng hp nh vy, c th s dng nhm ngoi lm gc.

134

Hnh 62. Phng php to gc bng t im gia


Phng php th hai chnh xc hn nhng khng phi lc no cng p dng c bi
v i hi mt nhm ngoi, l mt n v taxon tch khi cy trc s tn ti ca t
tin chung cui cng (last common ancester) ca tt c cc taxon khc ang c
nghin cu (cn gi l ingroup). Khc vi phng php t im gia, phng php
ny cn phi chn mt nhm OTU sao cho n xa tch khi tt c cc OTU cn
li trong nhm phn tch (ingroup) nhng cng phi gn khng nh n cng
chia s t tin chung vi cc OTU phn tch. Chng hn, trong hnh 62, kangaroo c
s dng lm nhm ngoi v n l ng vt c v phn tch t t tin ca tt c cc
ng vt c nhau thai. Theo cch ny phng php t im gia l sai bi v nhnh
gm nhm (chut nht, chut cng v t tin ca chng) tin ha nhanh hn so vi cc
OTU cn li ca cy nguyn nhn c th l do thi gian th h tng i ngn.

Hnh 62. So snh gia hai phng php to gc cho cy tin ha


Mt v d na cho thy vic s dng nhm ngoi l khng ph hp, chng hn cho
n nay chng ta vn cha th ch ra c trt t s phn nhnh gia vi khun c, vi
135

khun thc v eukaryote m ch chp nhn t mt t tin chung tch ra 3 gii ln


ny (hnh 63). Khng c nhm ngoi no ph hp cho trng hp ny bi v nu chn
mt nhm ngoi khc s to ra mt nhnh th 4 ca sinh gii. Bi v khng c nhnh
th 4 nn khng th chn c gc cho cy pht sinh vi khun c, vi khun thc v
eukaryote.

Hnh 63. Ba domain ca sinh gii

Tm tt chng 10
Cy tin ha l dng s m t mi quan h tin ha gia cc sinh vt. Ty
thuc vo dng d liu v phng php phn tch, cy tin ha c th c xy dng
t cc d liu hnh thi, c im phn b, sinh l ha sinh hay trnh t nucleotide,
amino acid... Cy tin ha phi th hin c mi quan h tin ha, trt t tin ha ca
cc n v taxon (OTU) v mc bin i hoc thi gian tin ha. Cc d liu v
hnh thi, c im phn b hoc sinh l ha sinh thng khng phn nh c y
mc tin ha. Hin nay cc d liu v trnh t sinh hc c s dng ph bin
nh gi v xy dng cy tin ha.
Khi xy dng cy tin ha da trn d liu trnh t sinh hc cn ch n vic
la chn trnh t, m hnh tin ha, ng h phn t (molecular clock) v phng
php kim nh chnh xc ca cy. Ty thuc vo mc ging nhau hay sai khc
ca cc trnh t phn tch c th la chn phng php xy dng cy da vo khong
cch hoc phn tch k t. i vi nhm th nht ngi ta thng dng phng php
UPGMA, NJ. Nhm th hai bao gm ML v MP. to gc cho cy tin ha c th
dng phng php t im gia hoc s dng nhm ngoi. S dng nhm ngoi
thng hiu qu hn tuy nhin nhm ngoi phi c mi quan h tin ha gn vi tt
cc n v taxon phn tch nhng phi tch ring khi cc n v ny to thnh mt
nhm ring bit. Trn c s c th t cc n v taxon phn tch vo ng v tr
tin ha ca chng.
V mt l thuyt, trnh t sinh hc no cng c th s dng c trong phn
tch tin ha. Tuy nhin trong phn ln cc trng hp, vic phn tch mt s trnh t
136

gene t suy din ra cy tin ha mc loi s c khong cch. V th m


bo chnh xc khi phn tch tin ha, trnh t DNA hoc protein phi mang tnh c
trng cho loi v c tc tin ha ph hp chung vi cc i tng cng phn tch.
Hin nay cc trnh t thng s dng bao gm trnh t ribosome (16S, 18S, 5,8S),
ITS, COI, Cyt, RuBP, RNA polymerase...
Cu hi n tp chng 10
1. Cy tin ha l g? ngha ca vic xy dng cy tin ha
2. Th no l n v taxon (operational taxomic unit/OTU) trong cy tin ha
3. Loi d liu no c th s dng xy dng cy tin ha? Phn tch u nhc
im ca mi loi d liu
4. C bao nhiu phng php xy dng cy tin ha? Phn tch u nhc im
ca cc phng php.
5. ng h phn t (molecular clock) l g? ngha ca vic p dng ng h
phn t trong phn tch tin ha?
6. Ti sao phi p dng cng mt m hnh tin ha khi xy dng cy tin ha?
7. Nhm ngoi l g? Nu nhng c im ca nhm ngoi. Ti sao cn thit phi
s dng nhm ngoi khi phn tch tin ha.
8. Tc tin ha ca trnh t sinh hc l g? Ti sao phi la chn cc trnh t c
mc tin ha tng ng khi xy dng cy tin ha?
9. Th no l cy gene (gene tree) v cy loi (species tree)? Khi no c th s
dng cy gene suy din ra cy loi?
10. Bootstrap l g? C s ca vic kim nh chnh xc ca cy tin ha?

137

You might also like