You are on page 1of 53

Edit Distance

ADS NTUA
4 2007

Overview
What

is edit distance? Why do we care?


A few interesting properties
Some algorithms from the past for
finding exact edit distance
More (complicated) algorithms from
today for approximate edit distance
Embedding edit distance into l1

What is that?
Suppose we have two strings x,y
e.g. x = kitten
y = sitting
and we want to transform x into y.
We use edit operations: 1. insertions
2. deletions
3. substitutions

What is that?
A closer look:
kitten
sitting
1st step: kitten sitten (substitution)
2nd step: sittensittin (substitution)
3rd step: sittinsitting (insertion)

What is that?
Can we do better?
Answer here is no (obviously)
What about:
x = darladidirladada
y = marmelladara
Tough

Why do we care?
A lot of applications depend on the similarity of two
strings
Computational Biology:
ATGCATACGATCGATT
TGCAATGGCTTAGCTA
Animal species from the same
family are bound to have
more similar DNAs
What about evolutionary biology?

Why do we care?

searching keywords through the net: usually


by mtallica we mean metallica:

Definitions
We

care about bit-strings only: = {0,1}n


for i..j<n we denote the substring of x: x[i..j]
by xi we denote the i-th bit of x
Associate

operations with string positions:


deleting xi i
substituting xi i
inserting y position of y after the insertion
Alignment of x,y is a sequence of edit
operations transforming x into y

Definitions
Length

of an alignment is the number of its edit


operations
Optimal alignment is one that uses a minimum
number of edit operations
Edit Distance of two strings x,y is the length of
their optimal alignment: ED(x,y)
e.g. ED(kitten, sitting) = 3
Hamming Distance of two equal length strings
x,y is the number of positions for which the
corresponding symbols are different (xi yi)
e.g. HD(kitten, sittin) = 2

Some interesting properties


1.

Triangle Inequality: for any three strings x,y,z


of arbitrary lengths:
ED(x,y) ED(x,z) + ED(z,y)

2.

Splitting Inequality: let x, y be strings of


lengths n and m respectively. For all i,j:
ED(x,y) ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])

Some interesting properties


3.

Some simple upper and lower bounds:


Let x, y be strings of lengths n, m (n m). Then:
ED(x,y) m
ED(x,y) m-n
ED(x,y)=0 iff x=y
if m=n, ED(x,y) HD(x,y)

ED(x,y) number of characters (not

counting duplicates) found in x but not in y

Some interesting properties


4.

Induced alignment property:

ins(i..j) = number of insertions in interval [i..j]


del(i..j) = number of deletions in interval [i..j]
sub(i..j) = number of sub/tions in interval [i..j]
Caution: i,j refer to the initial positions of the string
(see the string as an indexed array)
sh(i..j) = ins(i..j) - del(i..j)
Intuitively sh(i..j) is the induced shift on x[i..j]
Define sh(i) = sh(1..i) and sh(0) = 0
ed(i..j) is the induced alignment of , i.e. the subsequence of edit operations in that belong in [i..j]

Some interesting properties


Consider the strings: x = savvato, y = eviva
1

savvato

savvato

savvato

eviva--

-eviva-- e-viva--

ed(4..5)

ed(4..5)=1 because we insert i in pos. 4

Some interesting properties


Induced alignment property:
For any alignment of x and y, for all i j,
ed(i..j) ED (x[i..j], y[i+sh(i-1)..j+ sh(j)])

sh(3), sh(5)

0,0 -1,0 -1,0

x[4..5]
y[4+sh(3)..5+sh(5)])

va va

ed(4..5)

ED(x[4..5], y[4+sh(3)..5+sh(5)])) 0

va

va iva iva

A Dynamic Programming Algorithm

Levenshtein ~1965 (Levenshtein distance)


induced alignment property principle of optimality
(not exactly)
use dynamic programming to solve the problem (quite
similar to subset sum problem)
Key ideas:

input: strings s, t with lengths n, m respectively


use an (n+1)x(m+1) matrix
the invariant maintained throughout the algorithm is that we
can transform the initial segment s[1..i] into t[1..j] using a
minimum of d[i,j] operations.
use a traceback algorithm to find the optimal alignment

A Dynamic Programming Algorithm


int LevenshteinDistance(char s[1..n], char t[1..m])
int d[0..n, 0..m]
for i from 0 to n d[i, 0] := i
for j from 1 to m d[0, j] := j
for i from 1 to n
for j from 1 to m
if s[i] = t[j] then cost := 0 else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution )
return d[n, m]

A Dynamic Programming Algorithm


branches

correspond
to different optimal
alignments
the length of the
optimal alignment is 5
the algorithm runs in
O(n2) time and it can
be improved so as to
use O(n) space

A Dynamic Programming Algorithm


the

above algorithm returns the exact value of


the edit distance of two strings
there exists a (minor) improvement in the
algorithm so as to run in time O(n2/logn) by
Masek and Paterson (1980): only theoretical
interest
O(n2) is too much! We want linear time.
Even developing subquadratic time algorithms
for approximating edit distance within a
modest factor has proved quite challenging

Approximating edit distance efficiently


Suppose

we know that our strings are either too


similar or too dissimilar, namely that their edit
distance is either at most k, or greater than
l(n,k). Then we can develop a gap algorithm,
that decides which of the two holds.
We are going to present three linear algorithms
(Yossef, Jayram et. al. FOCS 2004 ):
The first one has l= O((kn)2/3) & works for all strings
The second one has l= O(k2) & works only for strings
that are non-repetitive
The third one is an n3/7-approximation algorithm
which improves to n1/3 if the strings are non-repetitive

Approximating edit distance efficiently

Why do we care about an efficient gap


algorithm?
1.

2.

Such algorithms yield approximation


algorithms that are as efficient, with the
approximation ratio directly correlated with
the gap
They are useful on their own: there exist
problems where two strings can be either
too similar or too dissimilar

Approximating edit distance efficiently

Our model for the first two algorithms: the sketching model:
two-party public-coin simultaneous messages communication
complexity protocol
persons: Alice, Bob and the Referee
goal: to jointly compute f: AxBC, when Alice has the input
a A and Bob has the input b B
Alice uses her input a and shared random coins to compute a
sketch sA(a) and then sends it to the referee. Bob does the
same and sends a sketch sB(b)
the referee uses the sketches and the shared coins to
compute the value of the function f(a,b), with a small error
probability (constant)
main measure of complexity: the sketches size
usually desirable that Alice and Bob are efficient too

Approximating edit distance efficiently


Why

the sketching model?


Because there already exist sketching
algorithms for the Hamming Distance
Idea 1:
map the Edit Distance into some form of
Hamming Distance and then use the known
results for the Hamming Distance
Idea 2:
if two strings share a lot of identical substrings
in near positions they cannot differ too much.

Gap algorithm for arbitrary strings

Algorithm for arbitrary strings


Technique:
1.

2.

3.

Map each string to the multiset of all its


(overlapping) substrings. Annotate each
substring with an encoding of its position
in the original string.
Take the characteristic vectors of the two
multisets and run a gap algorithm for their
Hamming Distances.
Map the results for HD into results for ED

Gap algorithm for arbitrary strings


1st step:
two inputs: the size of the input n
the gap parameter k
define a suitable substring length B(n,k)
and a suitable window parameter D(n,k)
map: xTx and yTy. These sets consist
of pairs (,i), where is a length B
substring and i is the encoding of the
substrings position

Gap algorithm for arbitrary strings


Encoding

method: round down the starting


position of the substring to the nearest multiple
of D(n,k)
Example:
Let x = savvato, B=2 and D=3.
Then
Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)}
formally:
Tx = {(x[i..i+B-1],i div D) | i = 1,n-B+1}
B = n2/3/(2k1/3) and D = n/B

Gap algorithm for arbitrary strings


2nd step:
Take the characteristic vectors u, v of multisets Tx and
Ty respectively. To do that impose a (lexicographical)
order on the elements of Tx Ty
Example:
If Tx = {(1 , i1 ), (2 , i2 ), (1 , i3 )} and
Ty = {(2 , i1 ), (1 , i3 ), (4 , i5 )} then
Tx Ty
= {(1 , i1 ),(2 , i2 ),(1 , i3 ),(2 , i1 ),(4 , i5 )}
order: (1 , i1 ), (1 , i3 ), (2 , i1 ), (2 , i2 ), (4 , i5 )
Then u=11010 and v=01101
But we do not want to sort things (it costs). So
instead of the union consider the set of all (j , ij )

Gap algorithm for arbitrary strings


pair ( , i) Tx Ty is indicative of substrings of
x and y that match (i.e. they are identical in
terms of contents and appear at nearby
positions in x and y) and corresponds to a j
such that uj = vj
a pair ( , i) (Tx \ Ty ) (Ty \ Tx ) is indicative of
substrings that do not match and corresponds
to a j such that uj vj
We now have to estimate the Hamming
Distance between u and v
a

Gap algorithm for arbitrary strings

Hamming Distance can be approximated using


constant-size sketches as shown by Kushilevitz,
Ostrovsky, Rabani (2000)
Problem: the Hamming Distance instance our
problem is reduced to is exponentially long (set of all
(j , ij )). As a result the time to produce the sketches
is prohibitive (Alice and Bob are very slow)
Solution: new improved sketching method which
1.
2.
3.

produces constant size sketches


runs in time proportional to the HD of the instances
solves the k vs (1+)k gap HD problem for any >0

Notice that HD(u,v) O(n): no more than 2n distinct


substrings

Gap algorithm for arbitrary strings


3rd step:
We tune the sketching algorithm for HD to
determine whether HD(u,v) 4kB or HD(u,v)
8kB with constant probability of error. The
referee, upon receiving the sketches from
Alice and Bob, decides that ED(x,y) k if he
finds HD(u,v) 4kB and ED(x,y) 13 (kn)2/3 if
he finds HD(u,v) 8kB.
The algorithms correctness follows from:
Lemma 1: If ED(x,y) k then HD(u,v) 4kB
Lemma 2: If ED(x,y) 13 (kn)2/3 then HD(u,v) 8kB

Gap algorithm for arbitrary strings


set map(string x)
set T:= empty
for i=0 to n div D
for j=1 to B
T(x[i*D+j.. i*D+j+B-1])
return T
int algorithm_1(string x, string y, int k)
int B = n2/3/(2k1/3)
int D = n/B
set Tx = map(x)
set Ty = map(y)
(u,v) = characteristic_vectors(Tx , Ty )
if HD(u,v) 4kB then return (ED(x,y) k)
if HD(u,v) 8kB then return (ED(x,y) 13 (kn)2/3 )

Gap algorithm for arbitrary strings


Remarks
In the above algorithm Alice and Bob run the procedure
map and the referee runs the procedure HD, which is
the sketching algorithm for Hamming Distance
Notice that Alice and Bob do not use their random
coins. Only the referee uses randomness and has
bounded error probability
The algorithm solves the k vs ((kn)2/3 ) gap problem for
all k n1/2, in quasi-linear time, using sketches of size
O(1)
The idea of independently encoding the position of
each substring, though simple, has problems. Consider:
x = paramana, y = mana, with B = 2, D = 3

Gap algorithm for nonrepetitive strings


We

saw that encoding the substrings


position independently, by choosing an
appropriate integer D, can lead to problems: in
fact we fail to identify many matches even in
the presence of just one edit operation
(consider the case of the strings x and 1x)
We overcome this handicap by resorting to a
method where the encodings are correlated.

Gap algorithm for nonrepetitive strings


Idea:

we scan the input from left to right, trying to


find anchor substrings, i.e. identical
substrings that occur in x, y at very near
positions. All we need to change is the
encoding of the position: we now map each
substring to the region between successive
anchors
Example:
x = ma ro ... as, y = *ma****ro***as**
the substrings starting at a region between
successive anchors have the same encoding

Gap algorithm for nonrepetitive strings

Why does this idea work?


Remember that we are dealing with a gap algorithm.
Hence, if the input strings are very similar, we expect
that a sufficiently short substring, chosen from a
sufficiently long window is unlikely to contain edit
operations and thus has to be matched with a
corresponding substring in y in the same window
And how do Alice and Bob choose identical
substrings?
Alice and Bob cannot communicate with each other,
so they pick the anchors randomly. In fact they use
some (shared) random permutations (remember the
shared coins)

Gap algorithm for nonrepetitive strings


Isnt

this too good to be true?


Yes, it is: in order for the random permutations
to ensure that anchors are detected with high
probability we must demand that the input
strings are non-repetitive.
A string is called (t,l)-non-repetitive if for any
window of size l, the l substrings of length t,
which start inside the window are distinct

Gap algorithm for nonrepetitive strings


set map(string x, randomness r, int t) //Alice&Bob now use their coins r
set T:= empty
int c:=1
int i:=1
table of anchors a
table of regions reg
//pick a sliding window of length W and place it at c
for all length t-substrings starting in the interval [c+W..c+2W-1]
produce a Karp-Rabin fingerprint using r
pick a random permutation of the fingerprints using r
a[i] = the substring with the minimal fingerprint permutation
i:=i+1
c:=the first character after the ancher
for j=1 to i
reg[j] = substring starting after last char of a[i-1] till last char of a[i]
T (reg[i],i)
return T

Gap algorithm for nonrepetitive strings


int algorithm_2(string x, string y, int k, int t)
//t is the non-repetition parameter
randomness r
set Tx = map(x,r,t)
set Ty = map(y,r,t)
(u,v) = characteristic_vectors(Tx , Ty )
if HD(u,v) 3k then return (ED(x,y) k)
if HD(u,v) 6k then return (ED(x,y) (tk)2 )

Gap algorithm for nonrepetitive strings


The

algorithms correctness follows from:

Lemma 1:
If ED(x,y) k then Pr[HD(u,v) 3k] 5/6
Lemma 2:
If HD(u,v) 6kB then ED(x,y) O(tk)2

For

any 1t<n, the algorithm solves the k vs


((tk)2) gap problem for all k (n/t)1/2, for all
(t,tk)-non-repetitive strings, in polynomial time,
using sketches of size O(1).

Approximation algorithm

Key idea: use of graphs


Given two strings x,y the edit graph GE is a
representation of ED(x,y), by means of a
directed graph.

The vertices correspond to the edit distances


of x[1..i] and y[1..j] for all i,j n.

An edge between vertices corresponds to a


single edit operation transforming one
substring to the other.

Approximation algorithm
We define the graph G(B) as a (lossy) compression of
GE: each vertex corresponds to a pair (i,s), where i=jB,
for j=0..n/B and s=-k..k. The bigger parameter B is, the
lossier the compression.
Each vertex is closely related with the edit distance of
x[1..i] and y[1..i+s] (s denotes the amount by which we
shift y with respect to x)
We have two types of edges:

a.
b.

a-type edges from (i,s) to (i,s) where |s-s|=1


b-type edges from (i-B,s) to (i,s) with w(e)

w(a-type edges) = 1,
w(b-type edges) depends on approximation factor c

Approximation algorithm
In

GE the weight of the shortest source-sink


path corresponds to the optimal alignment of
x, y.
Theorem 1: Given two strings x, y and their
corresponding graph G(B), let the shortest
path P from (0,0) to (n,0) have weight T. Then
ED(x,y) T and T (2c+2)k, if ED(x,y) k,
where c is the approximation factor affiliated
with b-type edge weights.

Approximation algorithm
Only

problem: finding the shortest path.


Dijkstra is too slow (i.e. not linear)
If we could figure out the weights of all btype edges for a given i simultaneously,
perhaps we could solve the problem.
But computing b-type edges is the same
as finding the approximate edit distances
between x[i+1..i+b] and every substring
of y[i+1-k..i+B+k] of length B

Approximation algorithm

c(p,t)-edit pattern matching problem: given a pattern


string P of length p and a text string of length t p,
produce numbers d1, d2,dt-p+1, such that
di /c ED(P,T[i..i+p-1]) di for all i
Theorem 2: Suppose there is an algorithm that can
solve the c(p,t)-edit pattern matching problem in time
TIME(p,t).
Let x, y be two strings and G(B) their corresponding
graph. Set p=B, t=B+2k and c=c(p,t).
Then the shortest path in G(B) can be used to solve
the k vs (2c+2)k edit distance gap problem in time
O((k+TIME(p,t))n/B)
(i.e. the shortest path can be efficiently computed)

Approximation algorithm
graph make_graph(string x, string y, int k, int B)
//bigger Bfaster algorithm, bigger gap
vertices V = empty
edges E = empty
for j = 0 to n div B
//vertices
i = j*B
for s = -k to k
V(i,s)
for j = 0 to n div B
//a-type edges
i = j*B
for s = -k to k
EA ((i,s),(i,s+1),1)
EA ((i,s),(i,s-1),1)
for j = 1 to n div B
//b-type edges
i = j*B
for s = -k to k
EB ((i-B,s),(i,s),w)
return (V, EA EB)

//(source, sink, weight)

Approximation algorithm
(d1, d2,dt-p+1) epm_algorithm(string P, string T)
//length(T) length(P)
//returns approximate ED(P,S) for all S: length(P)-substrings of T

int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input?


int d[-k..k]
//table of b-type edges weights
d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k])
if n-B=0 return d
for s = -k to k
EB ((i-B,s),(i,s),d[s])
G = update(G, EB)
return compose(d, fix_weights(x,y,B,k,G,n-B)) //what is compose?

Let T(i,s) denote the shortest path from (0,0) to (i,s)


Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!)
computed for all s. The idea is to compute the shortest path T(n,0) by recursively
computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s).
No more than n/B recursive calls of fix_weights are needed

Approximation algorithm
int shortest_path(string x, string y, int k, int B)
graph G = make_graph(x,y,k,B)
int T = fix_weights(x,y,k,B,G,n)
return T

Correctness of the gap algorithm (follows


from Theorem 1):
1.
2.

If ED k then T (2c+2)k
If ED (2c+2)k then T ED (2c+2)k

Approximation algorithm

Some results which use algorithms for


the edit pattern matching problem:
1.
2.
3.

4.

quasi-linear time algorithm for k vs O(k2)


quasi-linear time algorithm for k vs O(k7/4)
quasi-linear time algorithm for k vs O(k3/2)
for (k, O(k1/2))-non-repetitive strings
2&3 imply approximation algorithms with
factors n3/7 and n1/3 respectively

Embedding edit distance into l1

Question: can {0,1}d endowed with edit distance be embedded


into l1 (Rd endowed with L1-metric) with low distortion?
Known results up to 2005:
1.
Edit distance cannot be embedded into l1 with distortion
less than 3/2
2.
Edit distance can be trivially embedded into l1 with distortion
d
3.
Edit distance can be embedded into l1 with distortion
log log d )
Rabani, STOC 2005)
2O ( log d(Ostrovsky,
Why do we care?
1.
Because embeddings allow us to use efficient algorithms of
one metric space on instances of another metric space
2.
Because embeddings help us learn more about the
structure of a metric space

Embedding edit distance into l1


Let

x be a string. Then for any integer s we


denote by shifts(x,s) the set
{x[1,|x|-s+1], x[2,|x|-s+2],, x[s,|x|]
which consists of all length s substrings of x.
Theorem:
There exists a universal constant c>0 such
that for every integer d>0 there exists an
embedding f:({0,1}d, ED)l1 with distortion at
most 2c log d log log d

Embedding edit distance into l1


Key

idea of the embedding:


for sufficiently small d the distortion is indeed
that small. Sologitd logsuffices
to break down the
log d
string into 2
substrings of
approximately the same length (1), the
blocks, and recursively embed into l1 some
metric spaces of lower dimension. The spaces
embedded must be chosen in such a way, that
the concatenation of their scaled embeddings
results in the embedding of the original string.

Embedding edit distance into l1


Choosing

the metric in the recursive step:


Let x1, x2, xn be the blocks.
For all i consider the sets shifts(xi ,s) where s
ranges over the non-negative powers of logd
that are below the block length.
Now define the distance between the sets
shifts(xi ,s) and shifts(xj ,s) to be the minimum
cost perfect matching, where the cost of an
edge between 2 elements is their edit distance.
This defines a metric that satisfies the above
requirements. Ideally we want to embed it in l1

Embedding edit distance into l1


However

a good embedding for the metric


shifts(xi ,s) is too strong an inductive
hypothesis.
Therefore we inductively embed the strings in
shifts(xi ,s) into l1 and redefine the edge costs
for the minimum cost perfect matching, to be
the l1 distances of the embedded strings.
This inductive embedding is not one of low
distortion, but it also does not need to be one.
It just needs to satisfy a weaker property
(omitted here) which guarantees the upper
bound on the total distortion.

The End

You might also like