Markovian Structures Liu1999

Markovian Structures in Biological Sequence Alignments
Author(s): Jun S. Liu, Andrew F. Neuwald and Charles E. Lawrence

Source: Journal of the American Statistical Association, Vol. 94, No. 445 (Mar., 1999), pp. 1-15
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2669673 .
Accessed: 14/06/2014 07:52
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

All use subject to JSTOR Terms and Conditions
Markovian Structures in Biological
Sequence Alignments
JunS. Liu,AndrewF. NEUWALD,and Charles E. LAWRENCE
The alignment of multiplehomologousbiopolymersequencesis crucialin researchon proteinmodelingand engineering, molecular

evolution,and predictionin termsof both gene functionand gene productstructure. In this articlewe providea coherentview
of the two recentmodels used formultiplesequence alignment-thehiddenMarkov model (HMM) and the block-basedmotif
model-to developa set of new algorithms thathave boththesensitivity
of theblock-basedmodeland theflexibility of theHMM.
we decomposethestandardHMM intotwocomponents:theinsertioncomponent,whichis capturedby theso-called
In particular,
"propagationmodel,"and the deletioncomponent,whichis describedby a deletionvector.Such a decompositionserves as a
basis forrationalcompromisebetweenbiological specificityand model flexibility.
Furthermore, we introducea Bayesian model
selectioncriterionthat-in combinationwiththe propagationmodel,geneticalgorithm, and othercomputationalaspects-forms
the core of PROBE, a multiplealignmentand database searchmethodology. The applicationof our methodto a GTPase family
of proteinsequencesyieldsan alignmentthatis confirmed by comparisonwithknowntertiary structures.
KEY WORDS: DNA sequence; Evolution;Gibbs sampler;GTPase; Hidden Markov model; MAP criterion;Model selection;
Proteinsequence; Sequence comparisons.
1. INTRODUCTION thebiotechnologyand pharmaceuticalindustries(Marshall

All of thehereditary information of an individualorgan- 1996; Taubes 1996).
ism is containedin its genome,whichcomprisessequences 1.1 Bioinformatics and Sequence Alignment
of thefourDNA bases (nucleotides),A, T, C, and G. Pro-
teins,chainsof 20 different aminoacid residues,are theac- The most important contribution of computationalbiol-
of
tionmolecules lifeand are "spelled"(coded) by segments ogy has been in the development of methods forextracting
of the genome,called genes.The universalgeneticcode is information from the biopolymer sequence databases via
used to translatetripletsof DNA bases, called codons, to sequence comparison, characterization, and classification-
the20-letteralphabetof proteins(Campbell 1995). For ex- tasks thatinterestmany statisticians.Sequence alignment
ample,codons CCA and CCG are bothtranslatedintothe methodologyis centralto all of thesemethods.
aminoacid proline(abbreviatedas Pro or P). The biotech- It is commonlybelieved that today's biopolymerse-
nology revolutionand many genome sequencingprojects quencesevolvedfromancestralsequencesthrough mutation
have resultedin large and rapidlygrowingdatabases of and selection.Evolutionary theoryholdsthatstochasticmu-
DNA sequences.A rapidlygrowingdatabaseof proteinse- tationaleventsmay alterthe genomeof an individualand
likeli-
quences has been derivedfromthe DNA sequences using thatthese changes may be passed to progeny.The
hood that a given mutation is maintainedthroughgener-
the universalgeneticcode. Both are available over the In-
ations is determinedby its contributions to the fitnessof
ternet(e.g., http://www.ncbi.nlm.nih.gov).
the progenyvia a stochasticprocess called naturalselec-
Because DNA and proteinsare unbranchedheteropoly-
tion. At the molecularlevel, the effectof a mutationon
mers,theycan be characterized by sequencesof lettersrep-
the structure and/orfunctionof a gene productdetermines
resentingthe monomersthatformthem.Accordingly, the
themutation'scontribution to theorganism'sfitness.There-
data in these databases are sequences of lettersusing p-
fore,sequencecomparisonmethodshelprevealinformation
letter(p = 4 forDNA, p = 20 forproteins)alphabetswith-
aboutbiopolymerstructure and function,as well as thebi-
out punctuationor space characters.Table 1 shows typical
ological processof molecularevolution.
proteinsequencesof two GTPases, whose structure and se-
To aid in theunderstanding of sequencealignment, let us
quence comparisonis providedin Section6. Can we tell if,
consideran intentionally oversimplified example.Imagine
and if so how,theyare related?
writingthe sentence
Computationalmolecularbiology,whichemergedabout
20 yearsago, focusesmainlyon the analysisof such data. Many of our friendslove statisticsjokes
Recentlythisfieldhas been the subjectof greatinterestin
and askingthreechildrento copy it. You mightthenobtain
three"noisycopies":
JunS. Liu is AssistantProfessor,Departmentof Statistics,Stanford Mamy of yous fryersneed longerspokes
University,Stanford,CA 94305. AndrewF. Neuwald is AssistantInves- Mony of yourown stripededloversare nicest
tigator,Cold SpringHarborLaboratory,Cold SpringHarbor,NY 11724. Monkeysof ours friendleyshave stinkingsmokes.
Charles E. Lawrenceis Chief of BiometricsLab, WadsworthCenterfor
Laboratoriesand Research,New York State Departmentof Health,Al- By showing these "noisy" copies to your friendsand
bany,NY 12201. This work was supportedin part by Departmentof
asking them to guess what you originallywrote, you
EnergygrantDE-FG02-96ER62266, National Institutesof Health grant
ROI HG01257-01,NationalScience FoundationgrantsDMS-9404344 and
DMS-9501570, and the StanfordTerman Fellowship. The authorsare
gratefulto Lee Ann McCue and Ye Ding forproofreading themanuscript ? 1999 American Statistical Association
and to theeditor,associateeditor,and tworefereesfortheirmanyvaluable Journal of the American Statistical Association
suggestions. March 1999, Vol. 94, No. 445, Applications and Case Studies

2 Journalof the AmericanStatisticalAssociation,March 1999
may make an entertaining game. By comparingthe noisy errors.There are not only typographicalerrorsand mis-
sentences,your guests may be able to identify"essen- spellings but also insertedor deleted lettersand entire
tial" parts of the original sentencethat have been con- words.The followingtableshowsan alignmentof thenoisy
served even thoughthe children'stranscriptions contain copies:
Mamy of(y)ous fryers (need)long (s)pokes

Momy of(y)iur (owns)triped(ed) love(rs are) nices(t)
Monk(eys)of ovr(s) friend(leys) have (stinking s)mokes
Here the lettersin parenthesesare noisy insertionsor alignmentprocedurethatproducesthe posteriordistribu-

deletionsfromone or a fewsentences.Fromthealignment, tion of the evolutionarydistances between pairs of se-
one mayguess severalwords,including"many,""of,""our,"9 quences. For multiplesequences,however,the inferences
"love,"andevenperhaps"jokes" and "friends."On theother on alignmentsand on phylogenetictrees are interdepen-
hand,some wordsin theoriginalsentence(e.g.,"statistics") dent,and each has been shown to be NP-hard.Because
have been deletedin thecopied sentences.Some sentences of this inherentcomputationalcomplexityand otherrea-
have words thatare not presentin any of the othersen- sons,efforts to simultaneously addressbothproblems(i.e.,
tences. Nevertheless,game playersmay be able to infer tree alignmentmethods)have been limited,and attention
the main themeof the originalsentence.More generally, has been focused on solving the problemsseparately.In
an alignmentprobleminvolvesthreeinterrelated tasks: (a) multiplealignment,the focus of this article,two heuris-
identification of themodels (e.g., parametersforletterfre- tic approaches-weightingand purging-havebeen used to
quencies) at aligned(conserved)positions,(b) word align- addresssequence correlationsinducedby the evolutionary
ment,and (c) determination of theextentto whichcommon process. In various weightingmethods,similarsequences
featuresare conservedin the sentences. are down-weighted to accountfortheirevolutionary close-
This simpleexampleillustratesa roughapproximation to ness. Alternatively, the purgingmethodthatwe use here
biologicalrealityand thepossibilityof obtainingimportant removesclosely relatedsequences to achieve an approxi-
information by comparingrelatedbiopolymersequences. mateindependenceforthoseremainingsequences.
However,thebiopolymeralignmentproblemis muchmore Databases of biopolymers that have been exper-
complicatedthantheforegoinggame.As shownin Table 1, imentally shown to have related structures[includ-
biopolymersequences lack knownrules of grammar, have ing SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/), MMDB
onlya smallvocabularyofknown"words,"containno blank (http://www.ncbi.nlm.nih.gov/Structure/), and DALI
or punctuationcharacters,and are unpredictablein many (http://www. embl-heidelberg. de/dali/dali. html)]
ways. More seriously,biological sequences available for and functions[such as Swiss-Prot (http://expasy.hcuge.
analysisdo not at all evolve down independentpathways ch/sprot/sprot-top.html)] providea basis forexaminingthe
froma singleprogenitor, as in theforegoinggame. Rather utilityof methods aimed at predicting thesecharacteristics.
theyevolve throughgenerationsof progeny.This process In contrast, insufficientdata are available fora similarex-
can be representedby an evolutionarytree thatis rarely amination of the methods forinferring molecular evolution.
observable. Although theory for simultaneously addressing evolution
Some methodsthat incorporatethe evolutionarypro- and multiple sequence alignment is important and much
cess to align pairs of sequenceshave been described(Alli- needed, we find thatour methods, based on the assumption
son, Wallace, and Yee 1992; Bishop and Thompson1986; of independenceafterpurging,oftenworkwell even when
Thorne,Kishino, and Felsenstein1991, 1992). More re- the data show substantialdeparturefromsuch an assump-
and Lawrence1995; Neuwald,Liu, Lip-
cently,Zhu, Liu, and Lawrence(1998) proposeda Bayesian tion(Liu, Neuwald,
man,andLawrence1997; Qu andLawrence1998). Indepen-
Table 1.
dently,Henikoff, Henikoff, Alford,and Pietrokovski(1995)
have shownthatthese methodswork well in conjunction
H-Ras P21 Protein (Protein Data Base Accession: 121P) withheuristicweightingprocedures.
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQ
YMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIP Approachand New StatisticalModels
1.2 Traditional
YIETSAKTRQGVEDAFYTLVREIRQH
In the traditionalroutinefor comparingtwo sequences,
a heuristiccriterionfor the goodness of the alignmentis
Elongation Factor Tu (EF-TU, Swiss Prot Accession: P02990)
selectedand fixed,an efficient algorithmis designedto op-
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEY
timizesuch a criterion, and finallylargedeviationtheoryis
DTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYI IVFLNKCDM'
VDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALEGDAEWEAKILELAGFLDSYIPEPERAIDKP
applied to assess the statisticalsignificanceof such align-
FLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLR
ments. Popularmethodsforcomparinga pair of sequences
GIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEM
have been given by Needleman and Wunsch (1970) and
VMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
Smith and Waterman (1981), and the methodsfor search-
ing the database to finda sequence relatedto the query

Liu, Neuwald,and Lawrence: BiologicalSequence Alignments 3
sequence were developedby Altschul,Gish,Miller,Myers, the sequence,whereasrecombinations yieldinsertionsand

and Lipman (1990) and Pearson and Lipman (1988). The deletionsthatmisalignthesequences(Lawrenceand Reilly
firststatisticalapproachto sequence alignmentwas given 1996). Over time,thesemutationaleventsproducenumer-
by Bishop and Thompson(1986) (see Karlin and Brendel ous familiesof relatedproteinor DNA sequencesthatmay
1992 and Waterman1995 formorereferences). be responsibleforseveraldifferent butrelatedfunctions.In
These pairwisecomparisonmethodshave helpedin many addition,proteinsthatperformthe same functionin dif-
recentbiologicaldiscoveries.For example,applicationof a ferentspecies may evolve substantially. The analyses of
pairwisesequence alignmentmethodplayed a key role in these sequence data hinge on aligning the sequences to
theidentification and characterizationof a recentlydiscov- discover relationships. Because of biological constraints of
eredhumancancergene (Bronneret al. 1994). However,for life, natural selection eliminates mutations in those portions
aligningmultiplebiopolymersequences,thepairwisecom- of biopolymersthatplay keyroles in structure or function.
parisonmethodshave limitations in efficiencyand accuracy These constraints provide alignment information evenwhen
thatareparticularly pronouncedwhensequenceshavemany the sequences are evolutionarily distant (Liu et al. 1995).
typographical errors,misspellings,insertions,and deletions; Four types of recombination events are possible:
thatis, when the sequences are subtlyrelated.The rapid
* segmentsof the gene may be deleted.
growthof the sequence databases has also begun to re-
* Extrasegmentsmaybe inserted.
duce theutilityof thesepairwisemethods.Specifically,af- * Segmentsmaybe duplicated.
teradjustingforthelargenumberof multiplecomparisons, * Segmentsmay be transposed(i.e., some segmentsof
the comparisonscores obtainedby chance fromrandom DNA cut fromtheiroriginallocations and inserted
sequences are creepinginto the range of the comparison intosites at new locations).
scoresfortrulyrelatedsequences(Claverie 1996; Henikoff
and Henikoff1991). Withingenesand adjoiningregions,insertions and deletions
Two statisticalmodels for multiplealignmenthave re- are the most common events, and transpositions and dupli-
centlybeen developed:the block-motifmodel, which de- cations are less common. Because duplications can be dealt
scribesconservedregionsin proteinor DNA sequences as with by using methods of Liu et al. (1995) and Neuwald
ungappedblocks (Lawrenceand Reilly 1990; Lawrenceet et al. (1995), and because transpositions are veryrare,we
al. 1993; Liu 1994; Liu et al. 1995; Neuwald, Liu, and assume throughout that the latter two recombination events
Lawrence 1995), and the hiddenMarkov model (HMM), can be safely ignored. Under this assumption, a powerful re-
whichtreatsthe observedsequences as thoughtheywere cursive relationship comes to bear. This recursive relation-
generatedby a hypotheticalancestralmodel via mutation shipformsthebasis ofpopulardynamicprogramming algo-
(Baldi, Chauvin, McClure, and Hunkapiller1994; Eddy rithms for the alignment of a pair of sequences and is also
1995; Krogh,Brown,Mian, Sjolander,and Haussler 1994). key to the HMM. Clever algorithms with favorable time and
By usinga model similarto theHMM to describehow two space complexity for solving the combinatoric optimization
sequencesrelateto each other,Allison and Wallace (1994) problemassociatedwithpairwisesequencealignments have
presentedusefulalgorithmsforconductingmultiplealign- been developed (Needleman and Wunsch 1970; Smith and
mentconsideringinformation on evolution(i.e., assuming Waterman 1981).
thatthe evolutionarytree is known). Importantcommon These alignmentalgorithmsprovide for flexibilityin
featuresof thesemethodsare thattheyuse explicitstatis- alignment by permitting insertionsor deletionsbetweenall
tical models and theytreatthe multiplealignmentprob- residues of a sequence. However, a largenumberofparame-
lem as a statisticalinferenceproblem.These statisticalal- ters and an associated loss of sensitivity is thepricepaid for
gorithms,which are reviewedand analyzed in Section 2, maintaining this flexibility. The inclusion of gap penalties
have addressedthe alignmenttasks (a) and (b) mentioned (i.e., penalizing random insertions) into the alignmentob-
in Section 1.1. However,despite its criticalimportance, jective function ameliorated this problem somewhat. Except
themodel selectionproblemsinherent in multiplesequence in cases where there is substantial prior knowledge about
alignmenttask(c) has receivedscantattention (Lawrenceet the number of gaps and the size of the conserved blocks,
al. 1993). To redressthis,an approximateBayesian model theHMM stilllacks sensitivity whenthenumberof these-
selectionprocedureis describedin Section 4. This proce- quences under analysis is small and/orthe sequences are
dure,in combinationwithan improvedalignmentalgorithm only subtly related. In fact, when thereare less than 30
(describedin Section3), is a keyfeatureof PROBE, a mul- sequences to be aligned, traditional pairwisecomparison-
tiplealignmentand databasesearchmethodology(Neuwald based methods tend to outperform the existingHMM al-
et al. 1997). gorithms (Sonnhammer, Eddy, and Durbin 1996). A further
difficulty withtheHMM approachis thatthereare no well-
foundedcriteriafordetermining thealignmentmodel (e.g.,
1.3 Modelingthe Sequence Alignment how manymodelpositionsto be includedand whatpenalty
Currentbiopolymersequencesarebelievedto have arisen termsto use) fora particularproblemand no clear account
froma commonancestralDNA sequencethroughevolution. of uncertainty associatedwiththe particularalignmentre-
This evolutionary process consistsof two typesof events: sultingfromthe algorithm.
pointmutationsand recombinations. Like typographicaler- Block motif-basedGibbs samplingstrategieshave the
rors,point mutationschange the identityof the bases in abilityto alignsubtlyrelatedsequencesevenwhenthenum-

ber of sequencesavailableforanalysisis limited(Lawrence The notationR or Rk denotesa single observedbiopoly-

et al. 1993; Neuwald et al. 1995). They achievethisadded mer sequence, where R = (rl, . . . , rn) and Rk -
sensitivity throughtwobasic characteristics of functionally (rk,1,... , rk,72 withr's as residues.R denotesa collec-
relatedproteins: tion of multiplesequences,R1, . . . , RK, each writtenas a
row vector.So we can writeR (Ri,... ,R
* Point mutationsand recombinations tend to be lim- The countingfunctionh( ), whose argumentis a set
ited in functionallyor structurallyconservedregions of residues,countshow manyof each residuein a set of
of distantlyrelatedproteins(Liu et al. 1995). To cap- residues(or base pairs). For example,if R is a proteinse-
italize on this observation,these strategieslimitthe quence,thenh(R) returnsa vectorof length20 withcounts
alignmentto ungappedblocks (called block motifs) of each typeof amino acids in R. SymbolsOo,Oj, and e
of the sequences,and in so doing greatlyreduce the representthe model parametersfor the underlyingproba-
numberof freealignmentvariables. bilitylaws (e.g., multinomialor productmultinomialdis-
* Structuraland functionalconstraintsare particularly tributions)to generateeveryresidue in the sequence. An
strongon a limitednumberof key residuepositions, ungapped segmentin a biopolymersequence that is be-
whichaccordinglyare moreconserved. lieved to be conservedacross functionallyor structurally
relatedsequences is termeda motifelement,or simplyan
A separatesamplingstep,fragmentation, enables the algo-
element.The word "motif"is used to describethe residue
rithmto focuson thosemoreconservedpositions(Liu et al.
frequencypatternfor these motifelementsamong multi-
1995). This stepfurther reducesthenumberof freeparam-
ple sequences.Mathematically, a motifis determined by its
etersand removesthe need to specifywidthsof the block
underlying productmultinomialmodel (Liu et al. 1995).
motifs.However,theseblock motif-basedalgorithmslose
sensitivitiesand are slow to convergewhenthe alignment
containsmorethanthreeor fourmotifs.Furthermore, be- 2.1 The Block-Motif Model and Its Alignment
cause thereare no well-foundedcriteriafor selectingthe A characteristicview of this approach is that certain
numberof motifsand the numberof conservedpositions, segmentsof the biopolymers(i.e., subsequencescriticalto
thesevalues mustbe specifiedby theuser. the biopolymers'structureor function)tend to be con-
In this articlewe presentideas to combineHMM with served against mutations.Because this conservationpro-
theblock motif-basedapproachesand to addresstheshort- tects against both point mutationsand recombinations,
comingsofbothmethods.In Section2 we briefly reviewthe sequence conservationin distantlyrelated biopolymers
singleblock-motif modelandtheassociatedGibbssampling presentsitselfin theformof setsof ungappedsubsequences
algorithm andprovidean analysisof thestandardHMM for (blocks). To capturethisbasic biological concept,we use
sequence alignment.In Section 3 we describea model de- a simple stochasticmodel, the block-motif model, as the
signedto capturethe spiritof bothapproaches,whichalso probabilisticmechanismto generatetheset of homologous
containsa generalizationof the model for flexibleincor- biopolymersequences. This model's graphicalrepresenta-
porationof deletions.In Section 4 we proposea Bayesian tionis as follows:
procedurefor selectingthe numberof alignmentvariables
Motif
and the numberof residuefrequencytermsto be included Sequence klc _
in themodel.Finally,in Section5 we describetheapplica- '1__width=w
C/t
tion of these methodsto enzymesthathydrolyzeguanine
triphosphate (GTPases). Withthismodel,each sequence Rk (k = 1,.. , K) con-
2. TWO MODELS FOR MULTIPLE ALIGNMENT tains only one occurrence(called an element)of a sin-
gle motif,as illustratedby the shaded block, witha start-
For vectorsv = (vI,...,vp)T and 0 = (0j,...,0p)', ing position ak. For the set of K sequences, R, we let
we use the followingnotationsthroughout the rest of the A = {al,... ,aK}, in which1 < ak < nk- w + 1. We call
article: A thealignmentvariableforR. Furthermore, we use A[-k]
to denote all of A but ak, use A+l = al +,...,aK +}
lvl = Iv1l + + lvpl,
to denotetheset of i-shifted positionsof A, and use {A}l
{ak + ij-1: k = 1,. ..,K, j = l, . .., w} to represent the
v + 0 = (v1 + 01, *...* ,vp + p)T set of residueindicesoccupiedby the motifelementswith
alignmentvariableA. For any set C of indices,Rc repre-
V/0 = (V,/01'.. v
**Vp10p)T sentsthe collectionof theresiduesindexedby elementsof
C. For example,givenany alignmentvariableA, we have
ov = R{A} = {rk,ak+j-1: forj = 1,.. ., w; k = 1,..., K}.
1 *p'ovp
Residues out of the conservedmotifelementare treated
QV,
as iid observationsfroma commonmultinomialdistribu-
17(v) = r(v1) . . . rv tion,called the backgroundmodel,withp (equals 20 for
and proteinsand 4 forDNA or RNA) categories,whichcan be
represented by theprobability vectorOo =(Ol,o, ,. S,o,
v! 1l!.. . vp! (if the vi are integers.) where 01,0 + + Op,0 =1
. and 0i,0 > 0 for all i.

Liu, Neuwald, and Lawrence: Biological Sequence Alignments 5
Residues withinthemotifelementare modeledby a prod-

uctmultinomial PM(E) (Liu et al. 1995),where
distribution
E = and 0j = (OIj,...,Opj)', with lo03= distributions
whereftand gtareprobability (knownup to
P
Oij = 1. In otherwords,theresidueinjth position
in some estimableparameters),
theYtare observations,
and the
a motifelementis independently generatedfromthemulti- Markovchainand
htforma (possiblytime-inhomogeneous)
nomialdistribution O . Therefore, a totalof w+1 parameter are unobservable(i.e., hidden).The dynamiclinearmodel
vectorsof (p - 1)-dimensionare requiredto fullydescribe (West and Harrison 1989), or the so-called "state-space"
thedata. modelin timeseriesanalysis,is a specialcase of thismodel.
As discussedby Liu et al. (1995), althoughtheseblock- In the evolutionof proteinsequences,segmenttranspo-
motifmodels are insufficient to characterizea biopolymer sitionsare rare.Thus, althoughthe sequencesbecome mis-
(e.g.,bases in DNA sequencesare knownto have serialcor- alignedvia insertionsand deletions,conservedresiduesre-
relationsand G-C richregions),theydo describetheeffect main in order.By using thischaracteristic, the HMM not
of sequence conservationamong homologous sequences. an
onlycaptures important featureof proteinevolution,but
The challengingalignmentproblemcorrespondsto simul- also results in an effective algorithm.
taneouslyfindingthe locationsof the motifelementsand The HMM structurefor multiplesequence alignment
characterizing theresiduefrequenciesin themotifs.As an treatsthe sequencesto be alignedas iid observationsfrom
introduction to ourgeneralmethodology, we reviewthesin- a probabilisticmechanism(i.e., HMM model) thatperturbs
gle block-motif model and its alignmentas treatedby Liu a hypothetical common"ancestral"model sequence (called
et al. (1995). "model"), denotedas M = (M\I,..., M/lL).Here each MII,
For any givenA, we can writethe complete-datalikeli- is regardedas an abstractresidueand is representedby a
hood functionas probabilityvector 01 of lengthp (4 for DNA sequences
w
and 20 forproteins).Whengenerating biologicalsequences,
7(ROo, E)) A) oc 0h(R{A}c)l Oh(RA+j-l)
the typesof perturbations allowed, whichare not observ-
j=l
able (and thushidden),are pointmutations, and
insertions,
0h(R)
W
(o)h(RA?j-1) deletions.Figure 1 illustratessuch a model withL = 3: A
residuein an observedsequenceis generatedeitherby some
j=1
Ml or by an insertion,whichis modeledby a probability
Now let the priorfor00 be a Dirichletdistribution D(a), vector00 of lengthp. (More detailson how to generatean
wherea = (a I, ... I ap); let thepriorfore's be a product observed sequence fromM can be foundin Baldi et al.
DirichletD(B), with B = (/3),j = 1,... ,w, and fj = 1994 and Kroghet al. 1994.)
(13ij, i = l... ,p); and let A be uniforma priori.Then, One can also thinkof the process of generatinga se-
as was partlyderivedby Liu (1994) and implementedby quence, say R = (r ,... rn), fromM as choosinga path
Lawrence et al. (1993), we have an explicitformfor the througha (n + 1) x (L + 1) table startingfromthe upper
conditionalposteriorpredictivedistribution wF(akA[k], R) leftcornerand endingin thelowerrightcorner(Fig. 2).
by integrating out the parameter0 and E), whichcan be The columnsforthistable are denotedby MO,... ,
well approximated by whichcorrespondto a void startingpositionand L model
positions.The rows,ro,..., rn,correspondto a void starting
residueand the n observedsequence residues.The moves
-F(ak = iIA[-kl R) ocI|( (rk,i+3'-I)i (1)
thatare allowed in thistable are of threetypes:horizontal
to theright,verticaldown,and diagonaldownto theright.
where the 0 are the posteriormeans of the 0, given the At any position(i, j) of thistable,thenextstep allowed
observedsequencedata R and thecurrentalignmentA[_k]. is to (a) position(i, j + 1), whichimpliesthata deletionof
This approximate is used in a Gibbs
conditionaldistribution model positionMj+l has occurred;(b) position (i + 1,j),
samplingalgorithmto do the local alignmentand can be
applied recursivelyto align multiplemotifs(Lawrence et
al. 1993).
2.2 Hidden MarkovModel forSequence Alignment

The HMM, initiallyintroducedin the late 1960s, is a
powerfulstatisticalmodelingtool thathas been widelyap-
plied in suchareas as signalprocessing,speechrecognition,
and time series analysis(Rabiner 1989). The methodwas
firstapplied to model biological sequences by Churchill
(1989) and recentlyhas become verypopular in multiple
sequence alignment(Baldi et al. 1994; Krogh et al. 1994;
Lazareva and Churchill1997).
The basic formof an H1MMcan be writtenas Figure1. The HMMArchitecture forGeneratingBiopolymer Se-
quences WithL = 3 ModelPositions.Fromeach state,it can go to
thenextmodelposition,
an insertion,
ora deletion.

Mo M1 M2 M3 END ing whetherri is generatedfroma model positionor an

insertion,
and 61 recordsthenumberof deletionsthathave
r rl ------- occurred.If 6, = L, thenthe only choice for J1 is 1, an
insertion.Clearly,h1 takes 2L + 1 possible values for k1
rangingfrom0 to L.
In general,htcan be written as (Jt, 5t),whereJt indicates
whetherit is an insertionor a matchand 6t recordsthetotal
r2 } . numberof deletionsthathave occurred.Withht = (Jt,5t)
for residue rt, the next hidden state ht+? = (Jt+l?, t+?)
for residue rt+l can be one of two types: Jt+1 = 0 and
3' 6t+l = 6t, 6t+11 ..*I* L, orJt+1 = I and6t+1 = 6t+11 ... ., L.
For example,the path indicatedby solid arrowsin Figure
2 representsthe followinghiddenstatecoding forthe se-
quence:
r5 \ I Sequence rl r2 T3 T4 T5 T6
r (Gt) (? (O (O (1 (1 (O
6A
The transitionprobabilitiesbetween the ht's can be ex-
ENDi plicitlywrittendown using the parametersencoded in the
HMM architecture of Figure 1 (Kroghet al. 1994).
Figure2. The Table-Path of theHMM.The "ancestral" 3. TOWARD A UNIFICATION:PROPAGATIONMODEL
Illustration
modelsequenceis assumedto have fourpositions,and theobserved
sequence R is sevenresidueslong.Thepathinsolidarrowspresents Many alignmentproblemsinvolvemultiplemotifs.Al-
wayofgenerating
one particular theobservedsequence R fromthe thoughthe single block-motifmethodof Section 2.1 can
modelM.
be applied iterativelyin this case, its failureto capture
which implies thatan insertionhas occurred;and finally, collinearorderingof themotifsmakesthemethodcomputa-
(c) position (i + 1,j + 1), whichmeans thatonly a point tionallyinefficient whenmorethana fewmotifsare present
mutationis allowed. Thus the path depictedby the solid (Lawrence et al. 1993; Neuwaldet al. 1995). In contrast, the
arrowsin Figure2 correspondsto HMM explicitly capitalizeson collinearity to develop effi-
cientrecursivealgorithms. These modelsrequirelargenum-
MO -~ END~~h
1o l+
--~ 1o --~ 1o -- D1I
2 -- M2 --~
hnM3 -- 13 --~ END. bersof free parameters, however. Specifically, 2LK degrees
of freedomare associatedwiththetrinomial(i.e., insertion,
Extra constraintsare usually needed to make such paths deletion,and match) alignmentparameters,and n(p - 1)
unique. degreesof freedomare associated withresiduefrequency
Althoughthearchitecture forgenerating observationsde- multinomialdistributions, whereL is thenumberof model
scribedin Figure 1 is easilyunderstood, themeaningof the positions,n is theaveragesequencelength,K is thenumber
hiddenstatesforthisH1MMis moresubtle.One may natu- of sequences,and p is the size of the alphabet.This large
rallythinkof the model positionsMA4as thehiddenstates. parameterspace can lead to a lack of sensitivity.
However,thetruehiddenstatesare theallowablepathsjust To redressthesecomplementary limitations, we describe
describedthattraversethe (rn? 1) x (L ? 1) table and gen- a Markovianpropagationmodel thattakes the formof a
eratefromM theobservedsequenceR. More precisely,we block-motif HMM, butwithsubstantially fewerfreeparam-
wherewk
let 1, 2,..re)2Therefpore,thehdeotteh eod eters.Briefly, in thisapproachtheconservedregionamong
multiplesequences is modeled as a fixednumberof un-
Ethefirstrtimethae patsraches theerow tofmand wuhepther gapped and collinearblocks (multiplemotifs)withflexible
be the hiddenstatesthatgeneratethe observedsequence gaps betweenthem.Residues notassignedto anymotifare
R. We now formally define hI the so that the H cMM for modeledby a commonbackgroundmultinomial model.
itlltisneahe
sequence fom ane
alignment inserostions
conforms orf
to theas
mtche state.
standard form sthues.h
of (2) As has been statedby Krogh et al. (1994), the block-
canver
and bhe
(3).abruevidensated ase tJ,) hereJ aloral 0,iniat-sjs motifmodelcan be regardedas a special HMM thatallows
Consider residue rl. Because it must be produced by an no insertionsor deletionswithina motif.The propagation
insertion or a match state, the path to produce r, must modelcan also be viewedas a special HMM witha flexible
be one of the followingtwo types: an insertionafterk gap (insertion)distribution. However,the applicationof a
deletions (i.e., DI ... DkIk, where k can be 0, 1, ...) or generalHMM leaves fourmajorissues to be addressed:de-
a model positionafterk deletions(i.e., D1 ... DkMk?1, termining thesizes of theblocks (Sec. 3.3), determining the
numberof blocks (Sec. 4), determining thenumberof con-
servedcolumns(Sec. 4), and ensuringefficient computation
(Secs. 3.2 and 5.3). Furthermore, theblock-motif viewpoint
givesus a newlook at themodelingof biologicalsequences

and establishesstrongconnectionswithmixturemodeling Based on theconstraint thatthemotifsnotoverlap,qi(X, y)

and statisticalclassification methods(Liu et al. 1995). mustbe 0 whenevery - x < Wl. But it can take many
forms,such as l/(y - x), exp(-clx - yl),exp{-c(x y)2-
_ ,
3.1 The PropagationModel and so on, wheny - x > Wl. The exponentialformis most
We beginin a mannersimilarto HMM by assumingthat commonlyused in alignmentliteratureforits nice mathe-
thereare L conservedmodelpositionsforeach sequenceto maticalpropertiesthatgive rise to a fastalgorithm.How-
be aligned,withtheonlydifference being thateach model ever,an exponentialgap penaltymay not be suitablefor
positionrepresentsa block of residueswithwidthwl. In- aligningsubtlemotifs.If we electnotto penalizegaps,then
{
tuitively,we can imaginethatL motifelementspropagate we can set qi(X,y) 1, fory - x > wi, so that
along a sequence.Insertionsare reflectedby gaps between
adjacent motifelements.No deletionsare allowed at this 1 if ak,+1-akl W V
>_w,
point;thisissue is addressedin Section 3.4. ( 0 otherwise,
Let A = (A1, ... , AL) =(ak,1)KxL be a matrixwithaki1
indicatingthe startingpositionof the lthmotifelementin whereak,L+l -=-nk+ 1. This priorinducesonlya collinear-
sequence k. ity and nonoverlapping constraint.Because gaps between
subtlemotifsvarygreatly,we feel thatthis "no penalty"
Sequence k I prioris most suitablefor our tasks.When the numberof
ak,1 ak,2 ak,L
motifsis to be determined fromthedata,thispenaltyissue
becomes moresubtle,and we deferit to Section 5.
These alignmentvariables are unobservable.Let vector In the propagationmodel, we treatthe numberof se-
A., = (ai,i, ... , aKI1)' indicatethestarting positionsof the quences K, sequence lengthnk, numberof motifsL, and
Ithmotifin all thesequences,and let E)1= (() ... )) the motifwidthwI as fixedconstantsinstead of random
denote the parametervectorfor the product-multinomialvariables.Because formula(5) is givenonly up to propor-
model of the lth motif,where wI is the widthof the lth tionality, to get the actual distribution we need to compute
element.We writeW = WI + + WL. The likelihoodcan thenormalizing constantby summingoverall possibleval-
be writtenas ues of 1 < ak, 1< ... < ak,L < nk in (5). Althoughthis
L wl stepis notnecessaryat present,a similarsummationis re-
1 quired foranalyzingthe posteriordistribution of Ak.. We
-F(RIA, Oo, 9) o 0h(R{A}c) J0(')}h(RA1?+j-l)
1=1 j=1 provide a recursive algorithm for this computationin the

nextsection.
L( )lfl(o) \ h(RA1?+j_1)
3.2 Forward-BackwardRecursionforPredictive
I=h(R) ( 4? Updating
Considera particularsequenceRk. DefiningqL (x, y)1
whereA. +j -1 = (al,, + j - ,I--, aK,I 3 - I)T. Us_ and usingan argumentsimilarto thatof Liu et al. (1995),
ingthesamereasoningas in Section2.1 forthesingle-motif we can writethecrucialconditionalpredictivedistribution
case, we can integrate overthe0's to simplifycomputation. forAk. as
If wI _ 1 forall 1,thentheA., correspondto thosepositions
suchthatJt= 1 in Section2.2, and themodelis equivalent (Ak. = (il, . . , iL) R, A[1k])
to an HMM withno deletions.
We assume a priorithat0o - D(a) and E)(1) D(B(')) L wj /^( h(rk ,ij+j -1l)
and are independentof A for 1 = 1,... , L, whereB(') = DL jl (ii i1)J1 ( \) (6)

((1), . . , 3$i)).
Then thelikelihoodfunctionof A, with0's
integrated out,is
whereA[_k]= A \ Ak. and the Oj are definedas in (1).
L Wl The conditionaldistribution (6) formsthe basis of our
rr(RIA) o F{h(R{A}C)+a} 7 17Fh(RA.1+j1I) + f31)}. predictiveupdatingversion of the Gibbs sampler (Liu
1=1 j=l 1994). The algorithmproceedsin two steps:randomly(or
systematically) choosinga sequencek,and updatingitsmo-
Let Ak. = (ak,,... ,ak,L) denote the alignmentvector
tifelementpositionsAk. by a draw fromdistribution (6).
for the kthsequence. Then a Markovianstructure for the
To draw Ak. from(6), we need to propagateinformation
a prioridistribution of Ak., conditionalon the sequence
forwardalong the sequence and thensamplebackward.
lengthnk, can be introducedas
Let Q(O) = (1)nkxnk and Q() = (ql(i,j)),1kXnk, where
ql (i, j) are the same as those in (5), and let u(1)
]7
L-1
7Fk(Ak.) o? qK(ak,I, ak,.z1 ), (5) (U) ,IU), where
1=1
wherep (x, y) > 0 can be view.edas a penaltyfunction. I n1 k-WI+ )h for i1., ,

we set thepriorof A as 7(A) =Hk.=1 Fk(Ak.).
Jointly,

andUi) - fori > nk - w,+ 1. Thus u(1) is proportional to the lthmotif,draw ak,1-1 fromtheprobabilityvectorpro-
the marginaldistribution of ak,1 withoutconsideringother portional to p(1-1) = {V(-l1)}T * Q(1-1)x[, and denoteit
motifs,withgivenmotifparameters at 0o and (1). Formula by xlj_.
(6) can be rewrittenas The computation requiredby thisforward-backward pro-
cedure is O(nm) for a general spacing penaltyfunction
p(Ak. - (il,.. iL)) 7(A. = (il,....iL)IR,AA[k]) ql(a, b), wheren is the lengthof the sequence. But when
L the spacing functionis memoryless(i.e., q,(a, b) does not
OCHql (il, ilt)ul dependon 1 and is exponentialin b - al or constant),the
1=1 amountof computation is reducedto O(nk) (Neuwaldet al.
1997).
To sample fromp(Ak.), we firstneed to computethe
normalizingconstant
and Weighting
3.3 Fragmentation
L
In theprevioussection,theconservedpartin each protein
Go =ql (i, ij+')uZl)
... iL
sequence was describedas a sequel of L ungappedblocks
i2< 1=1
each withknownlengthWl. But the exact value of wI is
il
Let v(0)(j) - 1, and for j 1 .nk (the lengthof the rarelyknown;at best, biologistsmay have some a priori
sequence) let knowledgeon its range. In addition,not all of the posi-
tionswithina block motifare equallyimportant forprotein
V() (j) = JJ
m
ql(il, i1+?)U$l),
structureand function.For example,some of the residue
positionsin a motifmay be criticalto an enzyme's cat-
il< . <iM=j1=1 alytic function,and thus residue types at these positions
j ,.,nk; . m=1.,L
are highlyconserved.On theotherhand,thereis littlecon-
Let v(m) - (v(m)(), . , V(m)(nk)). Then thefollowingre- servationof residuetypesin motifpositionsthatserveonly
cursiverelationshipholds: as geometricplace holders.Liu et al. (1995) and Neuwald
et al. (1995) exploitedthischaracteristicby introducing
the
V(m+l) (j) v(m) (k)qm+l (k, j)tmj
fragmentation modelin additionto the block-motifchar-
k<j acterization.This model allows the aligned columnsof a
motifto hop stochasticallywithina neighborhoodof their
whichcan be writtenin a matrixformas currentalignment positionswithprobability to
proportional
* U(m+Hl)
the degreeof conservationin each columnas comparedto
v(M+1) = {v(m)Q(m)}
background.We now extendthatapproachto our propaga-
for m=O,...,L-1, (7) tionmodelto permitcolumnhoppingbetweenmotifs,thus
whereoperation* is definedas forall u (i1, . . ., w,n)and removingtherequirement of havingto specifythenumber
V = (V1)... Vn)9
of conservedcolumnsw, allocatedto each block motif.
The keyto thefragmentation modelis theconceptofpo-
U*V = V*U - (UIvI, Unn); tentialwidth,W1,..., WL,withWi > wi, andfragmentation
indicatorA1 = (61,,... ,1,w1), with 1,j= 1 or 0 indicat-
and fora matrixS = (sjj)nxn, ing whetherthe positionis regardedas part of the motif
U*S S * U = (SijUj)nxn
or the background.Thus W1is the potentialspan of motif
1, and Al is a vectorindicatingwhichof the WTpotential
and positionsshouldbe includedin the model formotif1. We
further let A = (A1,.. , AL). A graphicalrepresentation
uT * S - S * uT =(SijUi)nxn
is shownin Figure3.
Finally,we have Go =vi1 v(L)(j). Hence the Liu et al. (1995) requiredthatI1 = wl. We requireonly
marginaldistribution of ak,L is V(L)/Go. Furthermore, we thatthetotalnumberof columnsbe constant;thatis,
can easily derivetheconditionaldistribution of ak,11 with
given ak,1.
Thus the randomsamplingof Ak. can proceed recur-
sively as follows: first,draw ak,L fromthe distribution L L
v(L) /Go; thenforgivenxl, wherexl is a row vectorwith
Os as entriesexceptfora 1 at the startingpositionak, 1 of E 1jl= w = EWi.
1=1 1=1
Ai A2 AL
Sequencek i _- I looms 0-**0 101|1 I

ak.,I ak,2 ak.L
Figure3. GraphicalRepresentationofthe FragmentedPropagationModeL.Whitespaces inside each motifelementindicatethatthose positions

of the motifelementsare excluded fromthe motifmodeL The excludedpositionsforthe Ithmotifare indexedby zeros in A/. The A1 are the same
forall sequences.

To accommodatethisnewfeature,we rewritetheparameter priordistribution

for (A, D) is
vectorsas one big matrix,
wr(A,D) = wrj(A)w72(D),
6) (e (1) E (L) ) _(ol, ..* *,OW) v
(A) is the same as wr(A)in Section 3.1 and
where wr,
where,in old notation,OM') (o) . W,
- w2(D) H d,ipl(1 p)l-di; thatis, the d, are mutually
Hence thelikelihoodofthemodelwiththeseadded struc- independentand P(di = 1) = Pl. A Markovianmodel for
turescan be writtenas D is sometimesmoredesirableand can be characterized
by
w L
7r(RIA, Al1I . .. AL io0 ) oc 0h(RfAAl-) H h(RAA(w))

r2(D) = f, (di ) 11 fi(di -1, di),
w=l 1=2
whereAA(w) denotesthe wthoverallmodel positionin- where fi(d1_1,d1)is the transitionfunctionfromdl-1 to

dicatedby A. Similarlyto Liu et al. (1995), we consider dl. We can also model (A, D) jointly with a Markovian
forA as inverselyproportional
thepriordistribution to the structure:
totalnumberof possible realizationsof A for giventotal
spans, J(Al) = max{w: a1,w = 1} - min{w: 31,w = 1}, for P(al, di IS,-,) = 7r(al, di Jai-,, di-,),
1=l,..,L; that is,
whereSi_1 = F{(aj, dj), 1 < j < 1- 1}, the Cxfieldgener-
( J(Ai) - ated by all previousa's and d's.

LF(Ao
T(A)o(U |Aj|-2,) If the width wI of each block is set to 1, then this
deletion-propagation model is verysimilarto a HMM. In
particular,al thencorrespondsto the sequence positionof
Note thatthereare (j(Aj) ways of assigningOs and Is for Ml, the lth model positionin a HMM (see Sec. 2.2). A
positionswithinthe span of lengthJ(Al). Techniquesfor deletionof a, indicatedby d, = 0 correspondsto a deletion
treatingthisnew featurevia Gibbs samplingare essentially of Ml. In a futurework,we show thatthe deletionmodel
the same as thoseof Liu et al. (1995) and are omitted. generalizesthe HMM of Krogh et al. (1994) and provide
An important remainingproblemis thechoice of W, the computationalstrategiesforimplementing themodel.
totalnumberof conservedcolumns,and L, thetotalnumber
of conservedblocks. In Section 4 we providea Bayesian
maximuma posteriori(MAP) criterionforchoosingthese 4. MODEL SELECTION: AN APPROXIMATE
parameters. BAYESIAN APPROACH
Two unresolvedmodel selectionissues remain:thenum-
3.4 Block-Motif
Model WithDeletions berof motifelements(i.e.,thenumberof gaps) and thetotal
Althoughthepropagationmodelprovidesa way to com- numberof conservedpositionsin all motifs.The fragmen-
bine the spiritof theblock-basedmodel and the gap-based tationmodel of Section3.3 can be appliedto allocatethese
HMM, it cannothandledeletioneventseasily.In this sec- positionsintoall motifelements.
tion we show thatthe deletionissue can be addressedby The difficulty of model selectionhas long been appreci-
usinga flexibleindicatorvector. ated by statisticians.
Among the manysolutionsthathave
Supposethatthereare L conservedcollinearmotifs,each been proposed,the most popularare the Akaike informa-
witha fixedwidthwl, in everysequence. For a particular tioncriterion(AIC), Bayes information criterion(BIC), and
sequence R, the alignmentvariable A = (al, . . ., aL) repre- Mallows's Cp. Althoughthesehave proveneffectivefora
sentsthe starting positionsof theseL conservedsegments. class of problemstheyhave seriouslimitations, such as the
The previouslydescribedpropagationmodel assumes that sequencealignment problems(Lawrenceet al. 1993). Model
all of the blocks must appear in every sequence to be selectionmethodsbased on the Bayes factors(or model
aligned-permittingno deletionsof any block in any of likelihoods)have provenuseful in many Bayesian analy-
the sequences. ses, and the recentdevelopmentof Markov chain Monte
To accountfor deletions,we introducea binaryvector Carlo (MCMC) methodsenables such methodsto be car-
D = (dl,. .. , dL) forthe sequence R, whered, = 0 indi- riedoutforverycomplicatedand realisticmodels(see Kass
cates thatthe lthblock has been deletedand d= 1 indi- and Raftery1995 for a recentreview). Otherinteresting
cates otherwise.Therefore,each sequence R is associated Bayesian approachesto model critiquehave been pursued
withan alignment vectorA and a deletionvectorD, neither by Box (1980), Gelman,Meng, and Stern(1996), and Ru-
of whichis observed.Whend, = 1,al indicatesthelocation bin (1984). Of themanydifferent methods,it seemsthatthe
forlthmotifelement.When d, = 0, however,the value of one based on theBayes factor(i.e., theposteriordensityof
al is notmeaningful. A and D can be treatedas missingdata the observeddata) providesa good startingpointfor our
and approachedby an EM algorithm. Alternatively,we can problem.
give priordistributions to A, D, and 0 and use a Bayesian As in previoussections,we let A denotethe alignment
approachwithcomputationcompletedby Gibbs sampling. vector (which consistsof L motifblocks), R denote the
By givingdifferent priordistributionsto (A, D), we can observedsequencedata,and e denotethemodelparameter
obtaindifferent desirableeffects.For example,the simplest fora particularmodel underconsideration.FollowingBox

10 Journalofthe AmericanStatisticalAssociation,March 1999
(1980), we assess model adequacyby themodel likelihood stantiallydifferentfromtheassumptionthatall alignments

p(R), whichcan be computedas are equallylikely.
MonitoringlogMAP in MCMC samplingis done effi-
p(R) Jp(RIe, A) (e, A) dE dA cientlyby recursiveupdating.More precisely,calculating
log{p(RIA(0))} for a startingalignmentA(?) for an ini-
ZEp(RjA)p(A), tial model and thenforany furtheriteration,say A(, we
A computetheincrement
wherep(A) is thepriordistribution forthealignmentvari-
able. Here we assumethate can be at least approximately log{p(RjA(1))} - log{p(RjA(0))},
integrated as in our align-
out.In manypracticalsituations,
which is easily done because our samplingalgorithmis
mentproblem,computation of P(R) unfortunatelyis infea-
composedsolelyof small local moves.
sible,and some Monte Carlo or numericalapproximations A heuristicsupportof theMAP criterionstemsfromspe-
are necessary. of the alignmentproblem.The poste-
cial characteristics
To simplifythe computationinvolved,we introducethe rioralignmentdistribution containsnumerous"chance" lo-
MAP criterion formodel selection,whichchooses a model
cal modes thatemergeas artifactsof the alignmentmodel
to maximize ratherthanof biology(Lawrenceand Reilly 1996). Accord-
logMAP = log{p(RI A) } + log{p(A) }, ingly,theinferenceof biologicalinterestoftenfocuseson a
(8)
small subsetof thealignmentensemble.This subsetcan be
whereA is theposteriormode of A underthatmodel. distinguishedfromchance modes onlyif theyare concen-
If thelikelihoodfunctionp(RIA) forthealignmentvari- tratedaroundthe global mode thatan alignmentalgorithm
able is verymuch concentratedat its maximum,thenwe can detect.
have the approximation p(R) p(RIA)p(A). Bounds on As demonstrated by manyof our novel biological find-
this approximationcan be obtained as follows. Because ings (Neuwald et al. 1997), it appearsthatthe MAP crite-
P(R) = p(R, A)/p(A R), we have rion works quite well. A studyby Neuwald et al. (1999)
showed thatthe MAP criterionis conservativecompared
logp(R) logp(R|A) + logp(A) - logp(A R). withthep value and Bayes factorapproachesin the sense
Upper and lower bounds for logp(R) based on log MAP of preferring simplermodels. The methodperformssat-
are isfactorilyfor a simulationexample. Qu and Lawrence
(1999) showed thatsome modificationof this criterionis
logMAP < logp(R) < logMAP - EAIRt{logp(AIR)}. (9) requiredforeffective predictionof structuralalignmentsin
themolecularmodellingdatabase(MMDB) database(URL:
Furthermore, by the informationinequality(Cover and http://www.ncbi.nih.gov/structure).
Thomas 1991) thatforanynondegenerate q(A),
distribution
5. EXAMPLE AND DISCUSSION
EAIR{logP(AIR)} > EAjR{logq(A)},
Cells are veryresourcefulin theiruse of materials.For
the secondinequalityof (9) can be replacedby logp(R) < example,thebasic buildingblocks of nucleicacids,ribonu-
logP(AIR)} - EAIR{log q(A)} and can be estimatedusing cleotide,and deoxyribonucleotide triphosphate are used in
Monte Carlo samples.Thus the logMAP is closely related a numberof cellularprocessesin additionto theirrole in
to theBayes factor.Our experienceshowsthatthelogMAP RNA and DNA synthesis.One of the most importantof
criterion worksquitewell formultiplealignment problems. these,adenosinetriphosphate (ATP), is the universal"cur-
In usingtheMAP criterion formodelselection,one must rency"forchemicalenergyin all organisms.ATP provides
providea priorprobabilityfor each model. Althoughthe thepowerformostof thecells' endergonic(energyabsorb-
propagationmodel permitsflexiblegap penalties,we have ing) processes.A limitednumberof important endergonic
foundthefollowingno-gap penaltymodel to be highlyef- processes are poweredby guanosinetriphosphate(GTP),
fective.Specifically,we assume thatthe priorprobability however.ReactionsinvolvingGTP are thefocusof thisap-
of observingL blocks in a sequence is takento be equally plication.Energyis releasedwhenGTP is brokendownto
likelyin a rangeof possiblenumbers,say fromlo + 1 to Lo. guanosinediphosphate(GDP) throughhydrolysis of its ter-
Hence P(L = 1) = /(Lo- lo) forany lo < 1 < Lo. This minalphosphatebond as follows:
implicitlyintroducesa constrainton the possible number
of gaps. We further assume thatall alignmentswithL = I GTP + H20 -? GDP + P1 + H+,
motifsare equally likely.Therefore,the priorprobability
of observinga particularconfiguration of L = 1 motifel- wherePi is the phosphateion. This energyreleasing(ex-
ementsin a sequence is inverselyproportionalto the total ergonic)reactionis coupled to the reactionthatrequires
numberof such configurations. The totalnumberof such energy(endergonicreaction)and is catalyzedby a GTPase
configurations can be computedusing a recursiveformula enzyme.Severalcellularprocessesutilizethesecoupledre-
similarto (7). Because the numberof all possible align- actions.In thissectionwe providea detailedsequenceanal-
mentsof an L-motifmodel growssuper-exponentially with ysis of the GTPases using the methodologydescribedin
L, the assumptionthatall modelsare equallylikelyis sub- previoussections.

5.1 The Dataset and ,3(1) in theproductDirichletwereassignedequal values

Neuwald et al. (1997) examinedthe utilityof PROBE, as aYk oc nk, and a I + a?+ p = O.1N, wherenrk is the
whichis designedto identifyproteinfamiliescontainedin totalnumberof residuesof typek in theentiredatasetand
the proteindatabases,findthe conservedmotifs,and align N = ni + ? ?+np. The priordistributionforthealignment
the familymembers.One of the familiesidentifiedwas a variableA, as givenin Section4, has been used throughout
set of 1,338 GTPases. When the PURGE algorithm(de- our work.It shouldbe notedthatin our priorspecification
tailsin Neuwaldet al. 1995; availablevia anonymousftpat the gap penaltyfunctionq (x,Iy) was takento be constant,
ncbi.nlm.nih.gov),whichcomputesthe similarityscore for whichmeansthatno explicitpenaltywas givento thelength
everypair of sequences using a BLOSUM62 scoringma- of gaps. Instead,thepriordistributionof thenumberof gaps
trixand removesclose homologs(thosewitha BLOSUM62 was uniformovera specifiedrange,and,conditionedon this
score > 150),was appliedto thisset of sequences,a dataset number,all arrangements of gaps were equally likely.
of 46 sequenceswas obtained.For validationpurposes,we
added to thisdatasettwo distantlyrelatedGTPases whose
structureshad been determinedby X-ray crystallography.5.3 The Implementation of Propagation
The sequences of thesetwo proteinsare givenin Table 1. The propagationalgorithmand theMAP model selection
Because thesetwo sequences are not significantly related, criteriahave been incorporated into software(PROBE) for
as measuredby BLAST, but share commonsubstructures the identification of proteinfamilies.A variationin imple-
(MMDB), theyserveas good internalpositivecontrols.Four mentingthe propagationis to use a geneticalgorithmto
sequences out of 46 in the previousdatasetwere related improvethemode-finding ability(i.e., findtheMAP) of the
to the two added sequences. Afterthese fourrelatedse- Gibbs sampler.The algorithmconsistsof followingmain
quences were removed,the finaldatasetcontained44 se- steps(see Neuwald et al. 1997 formoredetails):
quences,withno pair havinga BLOSUM62 score > 150.
1. Createan initialpopulationof M multiplealignments
by repeatingthefollowingthreestepsM times:
5.2 PriorSpecification
a. Randomlydraw the numberof blocks (L) and the
Throughout our applicationsof propagationand PROBE, total numberof columns (W) froma givendistri-
thepriorswere set in a mannerthatis uninformative with bution.
respectto the alignmentof any specificproteinor family b. Alignthepurgedsequencesby usingtheGibbs sam-
of proteins.Specifically,the priorson the O's in (4) were plingalgorithm derivedfromthepropagationmodel
setin accordancewithLawrenceet al. (1993); thevectorsca (Sec. 3).
Seq# NCBI ID DB-Access. Start Element 1 Gap 1 Element 2 Gap 2

1) gi1493746 pdb-121P (4) YKLVVVGAGGVGKSALTIQLIQNHF (29-52) LDILDTAGQEEY (65-68)
2) gi1229900 pdb-lETU (13) VNVGTIGHVDHGKTTLTAAITTVLA (38-76) YAHVDCPGHADY (89-92)
3) gi 1O77890 pir-S57091 (4) STIICIGMAGSGKTTFMQRLNSHLR(29-101) NCIIDTPGQIEC (114-125)
4) gil141353 sp-P17103 (62) ATVALVGFPSVGKSSLINAMTNADS(87-109) IQLLDVPGLIEG (122-132)
5) gi1129021 sp-P20964 (159) ADVGLVGFPSVGKSTLLSVVSSAKP
(184-207) FVMADLPGLIEG(220-230)
6) bgil434759 trem-Q15029 (130) RNVTLCGHLHHGKTCFVDCLIEQTH
(155-199) FNIMDTPGHVNF(212-215)
7) gi11204225 sp-Q10251 (485) PICCILGHVDTGKTKLLDNLRRSNV
(510-552) LLIIDTPGHESF (563-566)
8) gil68956 pir-RGECGT (9) GFIAIVGRPNVGKSTLLNKLLGQKI(34-57) AIYVDTPGLHME (70-82)
9) gi11174907 sp-P42871 (13) TRIGIGGPVGSGKTAIIEVITPILI (38-72) LGVETGACPHTA(85-120)
10) gil462264 sp-P25519 (198) PTVSLVGYTNAGKSTLFNRITEARV(223-238) IDVADVGETVLA(251-270)
Fragmentation: .... *
Seq# Element 3 Gap3 Element 4 Gap 4 Element 5 Last

1) DQYMRTGEGFLCVFAINNTKSFED (93-109) PMVLVGNKCDL (121-140) YIETSAKTRQGVEDAFYTLVREI(163)
2) ITGAAQMDGAILVVAATDGPMPQT (117-129) YIIVFLNKCDM (141-263) KLLDEGRAGENVGVLLRGIKREE
(286)
3) SFASSFPTVIAYIVDTPRNSSPTT (150-166) PMIVVFNKTDV(178-235) VVGVSSFTGDGFDEFMQCVDKKV
(256)
4) LSVIRGADLVIFVLSAFEIEQYDR(157-236) PSLVTVNKVDL(248-269) AIFISAAEEKGLDVLKERMWRAL
(292)
5) LRHIERTRVIVHVIDMSGLEGRDP (255-275) PQIIVANKMDM(287-305) VFPISAVTREGLRELLFEVANQL
(328)
6) TAGLRISDGVVLFIDAAEGVMLNT (240-251) AVTVCINKIDR (263-358) KAPTSSSQRSFVEFILEPLYKIL (381)
7) SRGTSLCNIAILVIDIMHGLEPQT(591-602) PFVVALNKVDR(614-672) LVPTSAQSGEGVPDLVALLISLT(695)
8) SSSIGDVELVIFVVEGTRWTPDDE(107-117) PVILAVNKVDN(129-151) IVPISAETGLNVDTIAAIVRKHL(174)
9) TFSPALADFYIYVIDVAEGEKIPR(145-152) ADILVINKIDL (164-186) YILTNCKTGQGIEELVDMIMRDF(209)
10) LQETRQATLLLHVIDAADVRVQEN
(295-310) PTLLVMNKIDM(322-338) RVWLSAQTGAGIPQLFQALTERL
(361)
Frag. *.***.************- ***** ******.* * *-
Figure 4. AlignedMotifElements. The alignmentof 10 of the 44 GTPase sequences mentionedin the text.Columns are as follows:NCBI
sequence ID; proteindatabase and correspondingsequence accession number;startingresidue numberof the firstelement; fivealigned mo-
tifelements in the 10 sequences withthe residue numbers of the interveningsubsequences (gap) in parentheses; and number of the last
residue of the last element. Starred columns are those selected by fragmentation
(Liu et al. 1995). NCBI sequence ID numbersof the 44 se-
quences in the alignmentare as follows:gi/493746,gi/229900,gi/1302162,gi/585780,gi/549796,gill 154901, gi/601848,gi/559421,gi/631679,
gi/1072199, gill 171566, gi/729139, gi/731641, gi/1072255, gi/1340115, gi/861254, gi/466271, gi/585177, gi/1086887, gi/479657, gi/544493,
gi/730928,gi/1085447,gi/131887,gi/1050856,gi/94524,gi/731284,gi/1079402,gi/600886,gi/124210,gill 175159, gi/466991,gi/1051305,gi/558296,
gi/544478,gi/1078133,gi/1077890,gi/141353,gi/129021,gi/434759,gi/1204225,gi/68956,gi/1174907, and gi/462264.

12 Journalof the AmericanStatisticalAssociation,March1999
c. Save the copy of the alignmentwhenit stabilizes. 5.4 StructurePrediction

applythefollowinggeneticalgorithm-type Enzymes are proteinsthatcatalyze chemicalreactions.
2. Iteratively
steps: in acceleratingchemicalreactionsare usu-
Theirefficiencies
a. Randomlychoose two alignmentsfrom the pop-
ally severalordersof magnitudebeyondthebestman-made
ulation and determinethe possible recombinants
catalysts.An enzyme achieves this efficiency by folding
derived from the two alignments.(A recombi-
nantalignmentis composedof the nonoverlapping into a precise three-dimensional structure that binds the
collinear blocks resultingfrom the two original compound (the substrate) to be chemically converted. Two
alignments.) proteins with very different primary amino acid sequences
b. Selectthebestones based on theMAP criteria(sam- can efficiently catalyzethesame reaction.In suchcases the
pling proportionalto fitnesscan be done as well) structures of the two enzymeswill typicallybe similarin
and add it to thepopulation. the regions that bind the substrate.This suggeststhatpre-
c. Remove the "least fit"alignmentfromthe popula- dictinga protein'sstructure fromits sequenceis extremely
tion. difficult-notsurprisingly, one of the grandchallengesin
d. Occasionallyintroducenew variantsintothepopu- biology.
lationby repeatingsteps2(a)-(c). Structuralpredictionbased on sequence alignmenthas
As shownin Figure 4, at convergencethe propagational- proven to be the most successfulmethodfor addressing
gorithmaligned the 44 GTPase sequences,and the MAP this grand challenge.However,good predictionshave been
model selectioncriteriaidentifiedfivemotifswitha total limited to proteinswhose sequences are closely related.
of 78 conservedpositions. As sequences become more distant,improperalignments
As shownin Figure 5, therewas considerablevariation play a major role in the breakdownof these predictions.
in the degree of conservationat differentpositionsin the The method illustrated in this article is especially suit-
alignment.Nearly all of the most highlyconservedposi- able foraligningdistantlyrelatedsequencesand helps im-
tionsplay key roles in bindingthe substrate(GTP) or the prove structuralpredictionsbased on multiplealignment.
product(GDP) or haveimportant roles.As shown However,approximations
structural inherentin all multiplesequence
in Figure7, nearlyall of the most highlyconservedposi- models, includingours, demand that such predictionsbe
tionsinteractdirectlywitheitherGTP or GDP. For exam- validatedby experimentally derivedcontrols.Accordingly,
ple, the conservedlysine (K), a positivelychargedamino we have incorporateda pair of distantlyrelatedsequences
acid, at position 13 of motif1 interactswitha negatively (lETU and 121P) with known structuresthat have been
chargedphosphateof GTP/GDP (see Fig. 7). In addition, shownto be similarby the VAST procedureand reported
thereare a numberof conservedglycines(G), whichallow with a structuralsuperpositionin MMDB. These protein
theproteinbackboneto bend sharply. sequencesand theirX-raystructures provideusefuldata to
IN1c3ptiC 1 M9t1if 2
Mo tifJr l tif 4
Figure 5. Sequence Logos. Posteriordistribution of e presented as a sequence logo (Schneider and Stephens 1990) formotifs1-5. The
heightHf') of thejth positionof motifI is computedas H(') - Z r{f$0rlog2(06$)>, where0JQ)- ( 0__r ranges over 20 amino acids). Accordingly,
positionsin the alignmentthatare highlyconserved are tall.The heightof the letterr is Or,jx Hi. Confidencelimitsare delineatedat one standard
deviationin H,.

Liu,Neuwald,and Lawrence: BiologicalSequence Alignments 13
121P (1) MTEYKLVVVGAGGVGKSALTIQLIQNHF

(29-50) CLLDILDTAGQEEY
1ETU PROBE (12) ... VNVGTIGHVDHGKTTLTAAITTVLA
(37-75) . . YAHVDCPGHADY
1ETU VAST (9) KPHVNVGTIGHVDHGKTTLTAAITTV..
(35-73) RHYAHVDCPGHA..
121P (#) SAMRDQYMRTGEGFLCVFAINNTKSFED(93-108) VPMVLVGNKCDL(120)

IETU PROBE (88-91) .... ITGAAQMDGAILVVAATDGPMPQT
(116-128) .YIIVFLNKCDM(139)
1ETU VAST (86-87) VKNMITGAAQMDGAILVVAATDG
..... (111-127) PYIIVFLNKCDM(139)
Figure6. ComparisonBetween the Sequence Alignment(MotifPre-

dictions)of 1ETU and 121P, Produced by PROBE, and TheirStructural
AlignmentBased on the Crystallography Data, Produced by VAST The
numbersin theparentheses are the endingpositionof thepreviousmo-
tifand the startingpositionof the followingmotif.The sign (#) means
thatthereis no gap between twoconsecutivemotifs.The dots represent
those positionsthatare missed by eitherPROBE or VAST Althoughthe
PROBE and VAST alignmentsof 1ETU with121P were produced inde-
pendently,theyare presented in adjacent rows to facilitatecomparison.
Qu and Lawrence (1999) have shown thatreliablestructural predictions
based on PROBE require cross-validationof PROBE alignments.The
purpose of thiscross-validationis to ensure thatthe proteinof interest
does not bias the alignmentin its own favor.This validationis accom-
impsed Tesrucur ofth bakboefte in
ETUcandis-shown
plished by removingthe sequence of theproteinofinterest,here 1ETU,
and those thatare even marginallysimilarto it fromthe multiplealign-
ment. Withthese removed,a test of the hypothesisthatthe elements
of the proteinof interest,individuallyand collectively,are drawn from ;
ndMotif
the motifmodel based on the remainingsequences is performedusing
SCAN (Neuwald et al. 1997). Proteinsor elements foundto be insignif- stutuamoisfo iue6aei&rgtrdrbos n h etpe
icant (p > .05) are not included in the prediction.Here element 5 of
1ETU failsthistest,and thus it is not includedin the foregoingpredic-
tion.Qu and Lawrence (1999) also showed thatPROBE typically aligns
onlythose residues thatare in the vicinity of a ligand. These residues
are usuallyabout halfof all those thatcan be structurallysuperimposed
by VAST Here VAST aligns 137 residues, whereas PROBE aligns only
78. As shown,63 of these 78 residues agree withVAST
morutifreleet
thegpredite of 1PWit arershownesdrak blue
roibbonS.pr
imoe.The
superpsitio ofthes
motifsnwiththoeo 1ETU whaiis
sobtainedb
examinethevalidity of theGTPasealignment represented
in Figure4. fleshadfor
EThenge foursructura ThagenGDP bourspndintotheTUipresented
Conditioned onthealignment ofthe44 sequencesinFig- astractbal-ad-tick fro igureinyrelo The briheginnings, andth
eninsto peac
ofntheda
foranEUsotfsaren lablda uh
ighyikrbbnCoria consrveresidues41t
ure 4, we decidedto predictthe structure of IETU, the
targetprotein,based on thestructure of 121P,theparent (otabvea2.5abit
from Figh5
crsare ptreseteasbl-nstcfiures.
Theyfamnscrepnigt
are colrediceustingCPcolornemenso2Pae (carownas graeen,u nitrogns.
protein.Qu and Lawrence(1999) havereported methods bleupe,rndoxygeon rhed),
of excetiforAspatathoe 80 (AS8O)whi otise pre
of usingtheresultsof PROBE forsuchpredictions, and sentedzingmagentam tof
hepqustraed dithfancts
bthatetei theonlyoned ofalthes
Qu, Martin, andLawrence(1998)havepredicted thestruc- tonsereRegosidutestrcues
aisnthnar dsgTre Consrve glycinre clres
idus
tureof glutamate decarboxylase usingthesemethods. whichdornoTU
As distnly showwrell
.relatd bothbecausTe
bi cologica thyDav
bounlyoT
sceqences. T ashyrogenatom
metho,tro
asid cai
balandstcbecause theylayw forbendingsof
speca roeginns teback-
described furtherin thelegendof Figure6, thebackbone
structureof thetargetcan be predicted ar are
fromthatof the (bone, needed
notsshownFi.Notearfullscolor vesioalndofsthiskfigureis.
avilbey
A maoretito oftemdl(n fms te
parentif criteriaon thestrength of themotifmodeland
significanceof themotifelementsof thetargetare met. brobeO
HMM-tye modtucuels is
1Tha Wthe Sequenres tof be1genichi
atignSued-
gethern withe a transitiexp
., ltshouh. we searc
Aspatrateg
cated ta and
theAS8)
MA ale-
criterio
The firstfourmotifelements of IETU passedthisscreen- goredithmaet speed
to lpilutaethe
MAPaotimizatitsteon, formte
y ortes
ing.Figure6 showsa comparison of thestructuralalign- atrectureateaoishavmingur aevolvaligtedbongsndpndn pandthwayes.re
mentproducedby VAST withthesequencealignment of ofechi anew mecultpe algnentpand databalresefo oftool,ck
sendcing
PROBE forthesefourmotifs. As shownin Figure7, the
fourmotifs of IETU structurallysuperpose wellwiththose
of 121P,andarein substantial agreement withpredictions
fortheseregionsproduced seestonlbelatgodslcinmtoo biologicalsequences.Ti ehd o
byVAST.Furthermore, thefour
motifs formmajorcomponents oftheGTP binding analysis aditional
traniiesearch onatgapprxiat moelei sele-
pocket.
Forcomparison, we also examined an alignmentofthese tofanmethdwo multiple eune anaaaesacisneeed
alignment Asol
44 sequencesproduced byCLUSTAL W (Thompson, Hig-
gins,andGibson1994)withthestructural informationand
foundno agreement betweensequenceandstructural align-
ments. wenotedinSection1,ourstatistical modelhasseverallimi-
tations, andfurther analyses, either theoretical orempirical,
5.5 Discussion
In this articlewe have demonstrateda new efficient
methodforidentifying subtlepatterns
conservedamong

Anotherlimitationis thateveryresidueis treatedindepen- J.M. (1993), "AncientConservedRegionsin New Gene-Sequencesand

dentlygiventhealignmentinformation. Althoughmuchof theProteinDatabases,"Science,259, 1711-1715.
our experiencehas shown thatthe methoddeveloped in Henikoff, S., and Henikoff,J. G. (1991), "AutomatedAssemblyof Pro-
teinBlocks forDatabase Searching,"NucleicAcids Research,19, 6565-
this articleand thatof Lawrenceet al. (1993) can tolerate 6572.
datasetswithsubstantialdeviationfromthe independence Henikoff,S., Henikoff,J. G., Alford,W. J.,and Pietrokovski,S. (1995),
assumptions,a systematicanalysison robustnessof these "AutomatedConstruction and GraphicalPresentation of ProteinBlocks
relatedmethodsis desirable.It will also be a greatadvance From Unaligned Sequences," Gene, 163 (2), GC17-GC26.
if efficientalgorithmscan be developedto simultaneously Karlin,S., and Brendel,V. (1992), "Chance and StatisticalSignificancein
Proteinand DNA Sequence Analysis,"Science,257, 39-49.
inferphylogenyand multiplealignment,withboth uncer- Kass, R. E., and Raftery,A. E. (1995), "Bayes Factors,"Jou-rnalof the
taintiestakenintoaccount. AmericanStatisticalAssociation,90, 377-395.
By revealingconservedpatterns, one can inferstructural Krogh,A., Brown,M., Mian, S., Sjolander,K., and Haussler,D. (1994),
or functionalcharacteristics,such as the structural motifs "ProteinModelingUsing HiddenMarkovModels,"Joumnal of Molecut-
lar Biology,235, 1501-1531.
predictedin Section 5.4, of all membersof the familyof
Lawrence,C. E., Altschul,S. F., Boguski,M. S., Liu, J.S., Neuwald,A. F.,
alignedsequencesbased on experimental evidencethatmay and Wootton,J.C. (1993), "DetectingSubtleSequence Signals:A Gibbs
be available for only a few members.Qu and Lawrence SamplingStrategyforMultipleAlignment," Science,262, 208-214.
(1999) showedhowto refinesuchstructural inferences.Sev- Lawrence,C. E., and Reilly,A. A. (1990), "An ExpectationMaximization
eral such characterizationshave led to usefuldiscoveries, AlgorithmfortheIdentification and Characterization of CommonSites
in UnalignedBiopolymerSequences,"PROTEINS. Struicture, Function,
as reportedby Neuwald et al. (1997). In additionto char- and Genetics,7, 41-51.
acterizingspecificproteinfamilies,a major goal for this (1992), "LikelihoodInferencesforPermutedData WithApplica-
and relatedmethodsis the characterization of the protein tion to Gene Regulation,"Journalof theAmericanStatisticalAssocia-
"universe"(Greenet al. 1993). tion,91, 76-85.
Lazareva,B., and Churchil,G. A. (1997), "BayesianRestorationof a Hid-
denMarkovChainWithApplicationsto DNA Sequencing,"unpublished
manuscript submitted to JournaloftheAmericanStatisticalAssociation.
[ReceivedJanuary1997. RevisedJuly1998.] Liu, J. S. (1994), "The Collapsed Gibbs Samplerin Bayesian Computa-
tionsWithApplicationsto a Gene RegulationProblem,"Journalof the
AmericanStatisticalAssociation,89, 958-966.
REFERENCES Liu, J.S., Neuwald,A. F., and Lawrence,C. E. (1995), "Bayesian Models
forMultipleLocal Sequence Alignment and GibbsSamplingStrategies,"
Allison,L., and Wallace, C. S. (1994), "The PosteriorProbabilityDistri-
Journalof theAmericanStatisticalAssociation,90, 1156-1170.
butionof Alignmentsand Its Applicationto ParameterEstimationof
Evolutionary Trees and to Optimizationof MultipleAlignments," Jour- Marshall,E. (1996), "Hot Property:BiologistsWho Compute,"Science,
nal ofMolecularEvolution,39, 418-430. 272, 1730-1732.
Allison,L., Wallace, C. S., and Yee, C. N. (1992), "MinimumMessage Needleman,S. B., and Wunsch,C. D. (1970), "A GeneralMethodAppli-
LengthEncoding EvolutionaryTrees and Multiple Alignment,"Pro- cable to theSearchforSimilaritiesin theAminoAcid Sequence of Two
ceedingsof25thHawaii InternationalConferenceon SystemScience, 1, Proteins,"JournalofMolecular Biology,48, 443-453.
663-674. Neuwald, A. F., Liu, J. S., and Lawrence, C. E. (1995), "Gibbs Motif
Altschul,S. F., Gish, W., Miller,M., Myers,E. W., and Lipman,D. J. Sampling:Detectionof Bacterial Outer MembraneProteinRepeats,"
(1990), "Basic Local AlignmentSearch Tool,"JournalofMolecularBi- ProteinScience,4, 1618-1632.
ology,215, 403-410. Neuwald, A. F., Liu, J. S., Lipman,D. J.,and Lawrence,C. E. (1997),
Baldi, P., Chauvin,Y., McClure,M., and Hunkapiller,T. (1994), "Hidden "ExtractingProteinAlignmentModels From the Sequence Database,"
MarkovModels of Biological PrimarySequence Information," Proceed- NucleicAcid Research,25(9), 1665-1677.
ingsof theNationalAcademyof Science,91, 1059-1063. Pearson,W. R., and Lipman,D. J. (1988), "ImprovedTools for Biolog-
Bishop,M. J.,and Thompson,E. A. (1986), "MaximumLikelihoodAlign- ical Sequence Comparison,"Proceedingsof the National Academyof
mentof DNA Sequences,"JournalofMolecularBiology,190, 159-165. Science, 85, 2444-2448.
Box, G. E. P. (1980), "Samplingand Bayes Inferencein ScientificMod- Qu, K., and Lawrence, C. E. (1999), "Extended Homology Prediction
eling and Robustness,"Journalof theRoyal StatisticalSociety,Ser. A, for Motif Structureby Multiple Sequence Alignment,"unpublished
143, 383-430. manuscriptsubmittedto Modelingand ScientificComputing.
Bronner,C. E., Baker,S. M., Morrison,P. T., Warren,G., Smith,L. G., Qu, K., Martin,D. L., and Lawrence,C. E. (1998), "Motifsand Structural
Lescoe, M. K., Kane, M., Earabino,C., Lipford,J.,Lindblom,A., Tan- Fold of theCofactorBindingSite of HumanGlutamateDecarboxylase,"
nergard,P., Bollag, R. J.,Godwin,A. R., Ward,D. C., Nordenskj0ld, ProteinScience,7, 1092-1105.
M., Fishel,R., Kolodner,R., and Liskay,R. M. (1994), "Mutationin the Rabiner,L. R. (1989), "A Tutorialon HiddenMarkovModels and Selected
DNA MismatchRepair Gene Homologue hMLH1 Is AssociatedWith Applicationsin Speech Recognition," ProceedingsoftheIEEE, 77, 257-
HereditaryNon-PolyposisColon Cancer,"Nature,368, 258-261. 286.
Campbell,M. K. (1995), Biochemistry (2nd ed.), New York:SaundersCol- Rubin,D. B. (1984), "BayesianlyJustifiable and RelevantFrequencyCal-
lege Publishing. culations for the Applied Statistician,"The Annals of Statistics,12,
Churchill,G. A. (1989), "StochasticModels forHeterogeneousDNA Se- 1151-1172.
quences,"BulletinofMathematicalBiology,51, 79-94. Schneider,T. D., and Stephens,R. M. (1990), "Sequence Logos: A New
Claverie, J. M. (1996), "EffectiveLarge-Scale Sequences Similarity Way To Display Consensus Sequences," Nucleic Acids Research,20,
Searches,"Methodsin Enzymology, 266, 212-227. 6097-6100.
Cover,T. M., and Thomas,J. A. (1991), Elementsof Information Theory, Smith,T. F., and Waterman,M. S. (1981), "Identification of Common
New York:Wiley. MolecularSubsequences,"JournalofMolecularBiology,147, 195-197.
Eddy,S. R. (1995), "MultipleAlignmentUsing HiddenMarkovModels," Sonnhammer, E. L. L., Eddy,S. R., and Durbin,R. (1997), "Pfam:A Com-
IntelligentSystems for Molecular Biology,3, 114-120. prehensiveDatabase of ProteinDomain FamiliesBased on Seed Align-
Gelman,A., Meng, X. L., and Stern,H. (1996), "PosteriorPredictiveAs- ments,"Proteins,28, 405-420.
sessmentof Model Fitness via Realized Discrepancies" (withdiscus- Taubes, G. (1996), "SoftwareMatchmakersHelp Make Sense of Se-
sion), StatisticaSinica, 6, 733-796. quences,"Science,273, 588-590.
Green,P., Lipman,D., Hillier,L., Waterston, R., States,D., and Claverie, Thompson,J. D., Higgins,D. G., and Gibson,T. J. (1994), "CLUSTAL

W: Improvingthe Sensitivityof ProgressiveMultipleSequence Align- of Sequence Evolution,"JournalofMolecular Evolution,34, 3-16.

mentThroughSequenceWeighting, Gap Penalties,and Waterman,M. S. (1995), Introductionto ComputationalBiology, New
Position-Specific
WeightMatrixChoice,"NucleicAcid Research,22, 4673-4680. York: Chapmanand Hall.
Thorne,J. L., Kishino,H., and Felsenstein,J. (1991), "An Evolutionary West, M., and Harrison,J. (1989), Bayesian Forecastingand Dynamic
Model forMaximumLikelihoodAlignmentof DNA Sequences,"Jour- Models, New York:Wiley.
nal of Molecular Evolution,33, 114-124. Zhu, J.,Liu, J. S., and Lawrence,C. E. (1998), "Bayesian AdaptiveSe-
(1992), "InchingTowardReality:An ImprovedLikelihoodModel quence AlignmentAlgorithms," Bioinformics, 14, 25-39.


Markovian Structures Liu1999

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markovian Structures Liu1999

Uploaded by

Copyright:

Available Formats

Markovian Structures in Biological Sequence Alignments

Author(s): Jun S. Liu, Andrew F. Neuwald and Charles E. Lawrence

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

The alignment of multiplehomologousbiopolymersequencesis crucialin researchon proteinmodelingand engineering, molecular

1. INTRODUCTION thebiotechnologyand pharmaceuticalindustries(Marshall

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

Mamy of(y)ous fryers (need)long (s)pokes

Here the lettersin parenthesesare noisy insertionsor alignmentprocedurethatproducesthe posteriordistribu-

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

sequence were developedby Altschul,Gish,Miller,Myers, the sequence,whereasrecombinations yieldinsertionsand

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

ber of sequencesavailableforanalysisis limited(Lawrence The notationR or Rk denotesa single observedbiopoly-

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

Residues withinthemotifelementare modeledby a prod-

2.2 Hidden MarkovModel forSequence Alignment

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

Mo M1 M2 M3 END ing whetherri is generatedfroma model positionor an

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

and establishesstrongconnectionswithmixturemodeling Based on theconstraint thatthemotifsnotoverlap,qi(X, y)

1=1 j=1 provide a recursive algorithm for this computationin the

and are independentof A for 1 = 1,... , L, whereB(') = DL jl (ii i1)J1 ( \) (6)

wherep (x, y) > 0 can be view.edas a penaltyfunction. I n1 k-WI+ )h for i1., ,

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

Sequencek i _- I looms 0-**0 101|1 I

Figure3. GraphicalRepresentationofthe FragmentedPropagationModeL.Whitespaces inside each motifelementindicatethatthose positions

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

To accommodatethisnewfeature,we rewritetheparameter priordistribution

7r(RIA, Al1I . .. AL io0 ) oc 0h(RfAAl-) H h(RAA(w))

whereAA(w) denotesthe wthoverallmodel positionin- where fi(d1_1,d1)is the transitionfunctionfromdl-1 to

( J(Ai) - ated by all previousa's and d's.

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

(1980), we assess model adequacyby themodel likelihood stantiallydifferentfromtheassumptionthatall alignments

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

5.1 The Dataset and ,3(1) in theproductDirichletwereassignedequal values

Seq# NCBI ID DB-Access. Start Element 1 Gap 1 Element 2 Gap 2

Seq# Element 3 Gap3 Element 4 Gap 4 Element 5 Last

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

c. Save the copy of the alignmentwhenit stabilizes. 5.4 StructurePrediction

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

121P (1) MTEYKLVVVGAGGVGKSALTIQLIQNHF

121P (#) SAMRDQYMRTGEGFLCVFAINNTKSFED(93-108) VPMVLVGNKCDL(120)

Figure6. ComparisonBetween the Sequence Alignment(MotifPre-

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

Anotherlimitationis thateveryresidueis treatedindepen- J.M. (1993), "AncientConservedRegionsin New Gene-Sequencesand

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

W: Improvingthe Sensitivityof ProgressiveMultipleSequence Align- of Sequence Evolution,"JournalofMolecular Evolution,34, 3-16.

This content downloaded from 188.72.96.102 on Sat, 14 Jun 2014 07:52:00 AM

You might also like