Professional Documents
Culture Documents
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
may make an entertaining game. By comparingthe noisy errors.There are not only typographicalerrorsand mis-
sentences,your guests may be able to identify"essen- spellings but also insertedor deleted lettersand entire
tial" parts of the original sentencethat have been con- words.The followingtableshowsan alignmentof thenoisy
served even thoughthe children'stranscriptions contain copies:
VDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALEGDAEWEAKILELAGFLDSYIPEPERAIDKP
applied to assess the statisticalsignificanceof such align-
FLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLR
ments. Popularmethodsforcomparinga pair of sequences
GIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEM
have been given by Needleman and Wunsch (1970) and
VMPGDNIKMVVTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
Smith and Waterman (1981), and the methodsfor search-
ing the database to finda sequence relatedto the query
r5 \ I Sequence rl r2 T3 T4 T5 T6
r (Gt) (? (O (O (1 (1 (O
6A
The transitionprobabilitiesbetween the ht's can be ex-
ENDi plicitlywrittendown using the parametersencoded in the
HMM architecture of Figure 1 (Kroghet al. 1994).
Figure2. The Table-Path of theHMM.The "ancestral" 3. TOWARD A UNIFICATION:PROPAGATIONMODEL
Illustration
modelsequenceis assumedto have fourpositions,and theobserved
sequence R is sevenresidueslong.Thepathinsolidarrowspresents Many alignmentproblemsinvolvemultiplemotifs.Al-
wayofgenerating
one particular theobservedsequence R fromthe thoughthe single block-motifmethodof Section 2.1 can
modelM.
be applied iterativelyin this case, its failureto capture
which implies thatan insertionhas occurred;and finally, collinearorderingof themotifsmakesthemethodcomputa-
(c) position (i + 1,j + 1), whichmeans thatonly a point tionallyinefficient whenmorethana fewmotifsare present
mutationis allowed. Thus the path depictedby the solid (Lawrence et al. 1993; Neuwaldet al. 1995). In contrast, the
arrowsin Figure2 correspondsto HMM explicitly capitalizeson collinearity to develop effi-
cientrecursivealgorithms. These modelsrequirelargenum-
MO -~ END~~h
1o l+
--~ 1o --~ 1o -- D1I
2 -- M2 --~
hnM3 -- 13 --~ END. bersof free parameters, however. Specifically, 2LK degrees
of freedomare associatedwiththetrinomial(i.e., insertion,
Extra constraintsare usually needed to make such paths deletion,and match) alignmentparameters,and n(p - 1)
unique. degreesof freedomare associated withresiduefrequency
Althoughthearchitecture forgenerating observationsde- multinomialdistributions, whereL is thenumberof model
scribedin Figure 1 is easilyunderstood, themeaningof the positions,n is theaveragesequencelength,K is thenumber
hiddenstatesforthisH1MMis moresubtle.One may natu- of sequences,and p is the size of the alphabet.This large
rallythinkof the model positionsMA4as thehiddenstates. parameterspace can lead to a lack of sensitivity.
However,thetruehiddenstatesare theallowablepathsjust To redressthesecomplementary limitations, we describe
describedthattraversethe (rn? 1) x (L ? 1) table and gen- a Markovianpropagationmodel thattakes the formof a
eratefromM theobservedsequenceR. More precisely,we block-motif HMM, butwithsubstantially fewerfreeparam-
wherewk
let 1, 2,..re)2Therefpore,thehdeotteh eod eters.Briefly, in thisapproachtheconservedregionamong
multiplesequences is modeled as a fixednumberof un-
Ethefirstrtimethae patsraches theerow tofmand wuhepther gapped and collinearblocks (multiplemotifs)withflexible
be the hiddenstatesthatgeneratethe observedsequence gaps betweenthem.Residues notassignedto anymotifare
R. We now formally define hI the so that the H cMM for modeledby a commonbackgroundmultinomial model.
itlltisneahe
sequence fom ane
alignment inserostions
conforms orf
to theas
mtche state.
standard form sthues.h
of (2) As has been statedby Krogh et al. (1994), the block-
canver
and bhe
(3).abruevidensated ase tJ,) hereJ aloral 0,iniat-sjs motifmodelcan be regardedas a special HMM thatallows
Consider residue rl. Because it must be produced by an no insertionsor deletionswithina motif.The propagation
insertion or a match state, the path to produce r, must modelcan also be viewedas a special HMM witha flexible
be one of the followingtwo types: an insertionafterk gap (insertion)distribution. However,the applicationof a
deletions (i.e., DI ... DkIk, where k can be 0, 1, ...) or generalHMM leaves fourmajorissues to be addressed:de-
a model positionafterk deletions(i.e., D1 ... DkMk?1, termining thesizes of theblocks (Sec. 3.3), determining the
numberof blocks (Sec. 4), determining thenumberof con-
servedcolumns(Sec. 4), and ensuringefficient computation
(Secs. 3.2 and 5.3). Furthermore, theblock-motif viewpoint
givesus a newlook at themodelingof biologicalsequences
{
tuitively,we can imaginethatL motifelementspropagate we can set qi(X,y) 1, fory - x > wi, so that
along a sequence.Insertionsare reflectedby gaps between
adjacent motifelements.No deletionsare allowed at this 1 if ak,+1-akl W V
>_w,
point;thisissue is addressedin Section 3.4. ( 0 otherwise,
Let A = (A1, ... , AL) =(ak,1)KxL be a matrixwithaki1
indicatingthe startingpositionof the lthmotifelementin whereak,L+l -=-nk+ 1. This priorinducesonlya collinear-
sequence k. ity and nonoverlapping constraint.Because gaps between
subtlemotifsvarygreatly,we feel thatthis "no penalty"
Sequence k I prioris most suitablefor our tasks.When the numberof
ak,1 ak,2 ak,L
motifsis to be determined fromthedata,thispenaltyissue
becomes moresubtle,and we deferit to Section 5.
These alignmentvariables are unobservable.Let vector In the propagationmodel, we treatthe numberof se-
A., = (ai,i, ... , aKI1)' indicatethestarting positionsof the quences K, sequence lengthnk, numberof motifsL, and
Ithmotifin all thesequences,and let E)1= (() ... )) the motifwidthwI as fixedconstantsinstead of random
denote the parametervectorfor the product-multinomialvariables.Because formula(5) is givenonly up to propor-
model of the lth motif,where wI is the widthof the lth tionality, to get the actual distribution we need to compute
element.We writeW = WI + + WL. The likelihoodcan thenormalizing constantby summingoverall possibleval-
be writtenas ues of 1 < ak, 1< ... < ak,L < nk in (5). Althoughthis
L wl stepis notnecessaryat present,a similarsummationis re-
1 quired foranalyzingthe posteriordistribution of Ak.. We
-F(RIA, Oo, 9) o 0h(R{A}c) J0(')}h(RA1?+j-l)
andUi) - fori > nk - w,+ 1. Thus u(1) is proportional to the lthmotif,draw ak,1-1 fromtheprobabilityvectorpro-
the marginaldistribution of ak,1 withoutconsideringother portional to p(1-1) = {V(-l1)}T * Q(1-1)x[, and denoteit
motifs,withgivenmotifparameters at 0o and (1). Formula by xlj_.
(6) can be rewrittenas The computation requiredby thisforward-backward pro-
cedure is O(nm) for a general spacing penaltyfunction
p(Ak. - (il,.. iL)) 7(A. = (il,....iL)IR,AA[k]) ql(a, b), wheren is the lengthof the sequence. But when
L the spacing functionis memoryless(i.e., q,(a, b) does not
OCHql (il, ilt)ul dependon 1 and is exponentialin b - al or constant),the
1=1 amountof computation is reducedto O(nk) (Neuwaldet al.
1997).
To sample fromp(Ak.), we firstneed to computethe
normalizingconstant
and Weighting
3.3 Fragmentation
L
In theprevioussection,theconservedpartin each protein
Go =ql (i, ij+')uZl)
... iL
sequence was describedas a sequel of L ungappedblocks
i2< 1=1
each withknownlengthWl. But the exact value of wI is
il
Let v(0)(j) - 1, and for j 1 .nk (the lengthof the rarelyknown;at best, biologistsmay have some a priori
sequence) let knowledgeon its range. In addition,not all of the posi-
tionswithina block motifare equallyimportant forprotein
V() (j) = JJ
m
ql(il, i1+?)U$l),
structureand function.For example,some of the residue
positionsin a motifmay be criticalto an enzyme's cat-
il< . <iM=j1=1 alytic function,and thus residue types at these positions
j ,.,nk; . m=1.,L
are highlyconserved.On theotherhand,thereis littlecon-
Let v(m) - (v(m)(), . , V(m)(nk)). Then thefollowingre- servationof residuetypesin motifpositionsthatserveonly
cursiverelationshipholds: as geometricplace holders.Liu et al. (1995) and Neuwald
et al. (1995) exploitedthischaracteristicby introducing
the
V(m+l) (j) v(m) (k)qm+l (k, j)tmj
fragmentation modelin additionto the block-motifchar-
k<j acterization.This model allows the aligned columnsof a
motifto hop stochasticallywithina neighborhoodof their
whichcan be writtenin a matrixformas currentalignment positionswithprobability to
proportional
* U(m+Hl)
the degreeof conservationin each columnas comparedto
v(M+1) = {v(m)Q(m)}
background.We now extendthatapproachto our propaga-
for m=O,...,L-1, (7) tionmodelto permitcolumnhoppingbetweenmotifs,thus
whereoperation* is definedas forall u (i1, . . ., w,n)and removingtherequirement of havingto specifythenumber
V = (V1)... Vn)9
of conservedcolumnsw, allocatedto each block motif.
The keyto thefragmentation modelis theconceptofpo-
U*V = V*U - (UIvI, Unn); tentialwidth,W1,..., WL,withWi > wi, andfragmentation
indicatorA1 = (61,,... ,1,w1), with 1,j= 1 or 0 indicat-
and fora matrixS = (sjj)nxn, ing whetherthe positionis regardedas part of the motif
U*S S * U = (SijUj)nxn
or the background.Thus W1is the potentialspan of motif
1, and Al is a vectorindicatingwhichof the WTpotential
and positionsshouldbe includedin the model formotif1. We
further let A = (A1,.. , AL). A graphicalrepresentation
uT * S - S * uT =(SijUi)nxn
is shownin Figure3.
Finally,we have Go =vi1 v(L)(j). Hence the Liu et al. (1995) requiredthatI1 = wl. We requireonly
marginaldistribution of ak,L is V(L)/Go. Furthermore, we thatthetotalnumberof columnsbe constant;thatis,
can easily derivetheconditionaldistribution of ak,11 with
given ak,1.
Thus the randomsamplingof Ak. can proceed recur-
sively as follows: first,draw ak,L fromthe distribution L L
v(L) /Go; thenforgivenxl, wherexl is a row vectorwith
Os as entriesexceptfora 1 at the startingpositionak, 1 of E 1jl= w = EWi.
1=1 1=1
Ai A2 AL
Figure 4. AlignedMotifElements. The alignmentof 10 of the 44 GTPase sequences mentionedin the text.Columns are as follows:NCBI
sequence ID; proteindatabase and correspondingsequence accession number;startingresidue numberof the firstelement; fivealigned mo-
tifelements in the 10 sequences withthe residue numbers of the interveningsubsequences (gap) in parentheses; and number of the last
residue of the last element. Starred columns are those selected by fragmentation
(Liu et al. 1995). NCBI sequence ID numbersof the 44 se-
quences in the alignmentare as follows:gi/493746,gi/229900,gi/1302162,gi/585780,gi/549796,gill 154901, gi/601848,gi/559421,gi/631679,
gi/1072199, gill 171566, gi/729139, gi/731641, gi/1072255, gi/1340115, gi/861254, gi/466271, gi/585177, gi/1086887, gi/479657, gi/544493,
gi/730928,gi/1085447,gi/131887,gi/1050856,gi/94524,gi/731284,gi/1079402,gi/600886,gi/124210,gill 175159, gi/466991,gi/1051305,gi/558296,
gi/544478,gi/1078133,gi/1077890,gi/141353,gi/129021,gi/434759,gi/1204225,gi/68956,gi/1174907, and gi/462264.
IN1c3ptiC 1 M9t1if 2
Mo tifJr l tif 4
Figure 5. Sequence Logos. Posteriordistribution of e presented as a sequence logo (Schneider and Stephens 1990) formotifs1-5. The
heightHf') of thejth positionof motifI is computedas H(') - Z r{f$0rlog2(06$)>, where0JQ)- ( 0__r ranges over 20 amino acids). Accordingly,
positionsin the alignmentthatare highlyconserved are tall.The heightof the letterr is Or,jx Hi. Confidencelimitsare delineatedat one standard
deviationin H,.