Edit Distance Solution

Edit Distance
ADS NTUA
4 2007
Overview
What
is edit distance? Why do we care?

A few interesting properties
Some algorithms from the past for
finding exact edit distance
More (complicated) algorithms from
today for approximate edit distance
Embedding edit distance into l1
What is that?
Suppose we have two strings x,y
e.g. x = kitten
y = sitting
and we want to transform x into y.
We use edit operations: 1. insertions
2. deletions
3. substitutions
What is that?
A closer look:
kitten
sitting
1st step: kitten sitten (substitution)
2nd step: sittensittin (substitution)
3rd step: sittinsitting (insertion)
What is that?
Can we do better?
Answer here is no (obviously)
What about:
x = darladidirladada
y = marmelladara
Tough
Why do we care?
A lot of applications depend on the similarity of two
strings
Computational Biology:
ATGCATACGATCGATT
TGCAATGGCTTAGCTA
Animal species from the same
family are bound to have
more similar DNAs
What about evolutionary biology?
Why do we care?
searching keywords through the net: usually

by mtallica we mean metallica:
Definitions
We
care about bit-strings only: = {0,1}n

for i..j<n we denote the substring of x: x[i..j]
by xi we denote the i-th bit of x
Associate
operations with string positions:

deleting xi i
substituting xi i
inserting y position of y after the insertion
Alignment of x,y is a sequence of edit
operations transforming x into y
Definitions
Length
of an alignment is the number of its edit

operations
Optimal alignment is one that uses a minimum
number of edit operations
Edit Distance of two strings x,y is the length of
their optimal alignment: ED(x,y)
e.g. ED(kitten, sitting) = 3
Hamming Distance of two equal length strings
x,y is the number of positions for which the
corresponding symbols are different (xi yi)
e.g. HD(kitten, sittin) = 2
Some interesting properties

1.
Triangle Inequality: for any three strings x,y,z

of arbitrary lengths:
ED(x,y) ED(x,z) + ED(z,y)
2.
Splitting Inequality: let x, y be strings of

lengths n and m respectively. For all i,j:
ED(x,y) ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])

3.
Some simple upper and lower bounds:

Let x, y be strings of lengths n, m (n m). Then:
ED(x,y) m
ED(x,y) m-n
ED(x,y)=0 iff x=y
if m=n, ED(x,y) HD(x,y)
ED(x,y) number of characters (not
counting duplicates) found in x but not in y

4.
Induced alignment property:
ins(i..j) = number of insertions in interval [i..j]

del(i..j) = number of deletions in interval [i..j]
sub(i..j) = number of sub/tions in interval [i..j]
Caution: i,j refer to the initial positions of the string
(see the string as an indexed array)
sh(i..j) = ins(i..j) - del(i..j)
Intuitively sh(i..j) is the induced shift on x[i..j]
Define sh(i) = sh(1..i) and sh(0) = 0
ed(i..j) is the induced alignment of , i.e. the subsequence of edit operations in that belong in [i..j]

Consider the strings: x = savvato, y = eviva
1
savvato
savvato
savvato
eviva--
-eviva-- e-viva--
ed(4..5)
ed(4..5)=1 because we insert i in pos. 4

Induced alignment property:
For any alignment of x and y, for all i j,
ed(i..j) ED (x[i..j], y[i+sh(i-1)..j+ sh(j)])
sh(3), sh(5)
0,0 -1,0 -1,0
x[4..5]
y[4+sh(3)..5+sh(5)])
va va
ed(4..5)
ED(x[4..5], y[4+sh(3)..5+sh(5)])) 0
va
va iva iva
A Dynamic Programming Algorithm
Levenshtein ~1965 (Levenshtein distance)

induced alignment property principle of optimality
(not exactly)
use dynamic programming to solve the problem (quite
similar to subset sum problem)
Key ideas:
input: strings s, t with lengths n, m respectively

use an (n+1)x(m+1) matrix
the invariant maintained throughout the algorithm is that we
can transform the initial segment s[1..i] into t[1..j] using a
minimum of d[i,j] operations.
use a traceback algorithm to find the optimal alignment

int LevenshteinDistance(char s[1..n], char t[1..m])
int d[0..n, 0..m]
for i from 0 to n d[i, 0] := i
for j from 1 to m d[0, j] := j
for i from 1 to n
for j from 1 to m
if s[i] = t[j] then cost := 0 else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution )
return d[n, m]

branches
correspond
to different optimal
alignments
the length of the
optimal alignment is 5
the algorithm runs in
O(n2) time and it can
be improved so as to
use O(n) space

the
above algorithm returns the exact value of

the edit distance of two strings
there exists a (minor) improvement in the
algorithm so as to run in time O(n2/logn) by
Masek and Paterson (1980): only theoretical
interest
O(n2) is too much! We want linear time.
Even developing subquadratic time algorithms
for approximating edit distance within a
modest factor has proved quite challenging
Approximating edit distance efficiently

Suppose
we know that our strings are either too

similar or too dissimilar, namely that their edit
distance is either at most k, or greater than
l(n,k). Then we can develop a gap algorithm,
that decides which of the two holds.
We are going to present three linear algorithms
(Yossef, Jayram et. al. FOCS 2004 ):
The first one has l= O((kn)2/3) & works for all strings
The second one has l= O(k2) & works only for strings
that are non-repetitive
The third one is an n3/7-approximation algorithm
which improves to n1/3 if the strings are non-repetitive
Why do we care about an efficient gap

algorithm?
1.
2.
Such algorithms yield approximation

algorithms that are as efficient, with the
approximation ratio directly correlated with
the gap
They are useful on their own: there exist
problems where two strings can be either
too similar or too dissimilar
Our model for the first two algorithms: the sketching model:
two-party public-coin simultaneous messages communication
complexity protocol
persons: Alice, Bob and the Referee
goal: to jointly compute f: AxBC, when Alice has the input
a A and Bob has the input b B
Alice uses her input a and shared random coins to compute a
sketch sA(a) and then sends it to the referee. Bob does the
same and sends a sketch sB(b)
the referee uses the sketches and the shared coins to
compute the value of the function f(a,b), with a small error
probability (constant)
main measure of complexity: the sketches size
usually desirable that Alice and Bob are efficient too

Why
the sketching model?

Because there already exist sketching
algorithms for the Hamming Distance
Idea 1:
map the Edit Distance into some form of
Hamming Distance and then use the known
results for the Hamming Distance
Idea 2:
if two strings share a lot of identical substrings
in near positions they cannot differ too much.
Gap algorithm for arbitrary strings
Algorithm for arbitrary strings

Technique:
1.
2.
3.
Map each string to the multiset of all its

(overlapping) substrings. Annotate each
substring with an encoding of its position
in the original string.
Take the characteristic vectors of the two
multisets and run a gap algorithm for their
Hamming Distances.
Map the results for HD into results for ED

1st step:
two inputs: the size of the input n
the gap parameter k
define a suitable substring length B(n,k)
and a suitable window parameter D(n,k)
map: xTx and yTy. These sets consist
of pairs (,i), where is a length B
substring and i is the encoding of the
substrings position

Encoding
method: round down the starting

position of the substring to the nearest multiple
of D(n,k)
Example:
Let x = savvato, B=2 and D=3.
Then
Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)}
formally:
Tx = {(x[i..i+B-1],i div D) | i = 1,n-B+1}
B = n2/3/(2k1/3) and D = n/B

2nd step:
Take the characteristic vectors u, v of multisets Tx and
Ty respectively. To do that impose a (lexicographical)
order on the elements of Tx Ty
Example:
If Tx = {(1 , i1 ), (2 , i2 ), (1 , i3 )} and
Ty = {(2 , i1 ), (1 , i3 ), (4 , i5 )} then
Tx Ty
= {(1 , i1 ),(2 , i2 ),(1 , i3 ),(2 , i1 ),(4 , i5 )}
order: (1 , i1 ), (1 , i3 ), (2 , i1 ), (2 , i2 ), (4 , i5 )
Then u=11010 and v=01101
But we do not want to sort things (it costs). So
instead of the union consider the set of all (j , ij )

pair ( , i) Tx Ty is indicative of substrings of
x and y that match (i.e. they are identical in
terms of contents and appear at nearby
positions in x and y) and corresponds to a j
such that uj = vj
a pair ( , i) (Tx \ Ty ) (Ty \ Tx ) is indicative of
substrings that do not match and corresponds
to a j such that uj vj
We now have to estimate the Hamming
Distance between u and v
a
Hamming Distance can be approximated using

constant-size sketches as shown by Kushilevitz,
Ostrovsky, Rabani (2000)
Problem: the Hamming Distance instance our
problem is reduced to is exponentially long (set of all
(j , ij )). As a result the time to produce the sketches
is prohibitive (Alice and Bob are very slow)
Solution: new improved sketching method which
1.
2.
3.
produces constant size sketches

runs in time proportional to the HD of the instances
solves the k vs (1+)k gap HD problem for any >0
Notice that HD(u,v) O(n): no more than 2n distinct

substrings

3rd step:
We tune the sketching algorithm for HD to
determine whether HD(u,v) 4kB or HD(u,v)
8kB with constant probability of error. The
referee, upon receiving the sketches from
Alice and Bob, decides that ED(x,y) k if he
finds HD(u,v) 4kB and ED(x,y) 13 (kn)2/3 if
he finds HD(u,v) 8kB.
The algorithms correctness follows from:
Lemma 1: If ED(x,y) k then HD(u,v) 4kB
Lemma 2: If ED(x,y) 13 (kn)2/3 then HD(u,v) 8kB

set map(string x)
set T:= empty
for i=0 to n div D
for j=1 to B
T(x[i*D+j.. i*D+j+B-1])
return T
int algorithm_1(string x, string y, int k)
int B = n2/3/(2k1/3)
int D = n/B
set Tx = map(x)
set Ty = map(y)
(u,v) = characteristic_vectors(Tx , Ty )
if HD(u,v) 4kB then return (ED(x,y) k)
if HD(u,v) 8kB then return (ED(x,y) 13 (kn)2/3 )

Remarks
In the above algorithm Alice and Bob run the procedure
map and the referee runs the procedure HD, which is
the sketching algorithm for Hamming Distance
Notice that Alice and Bob do not use their random
coins. Only the referee uses randomness and has
bounded error probability
The algorithm solves the k vs ((kn)2/3 ) gap problem for
all k n1/2, in quasi-linear time, using sketches of size
O(1)
The idea of independently encoding the position of
each substring, though simple, has problems. Consider:
x = paramana, y = mana, with B = 2, D = 3
Gap algorithm for nonrepetitive strings

We
saw that encoding the substrings

position independently, by choosing an
appropriate integer D, can lead to problems: in
fact we fail to identify many matches even in
the presence of just one edit operation
(consider the case of the strings x and 1x)
We overcome this handicap by resorting to a
method where the encodings are correlated.

Idea:
we scan the input from left to right, trying to

find anchor substrings, i.e. identical
substrings that occur in x, y at very near
positions. All we need to change is the
encoding of the position: we now map each
substring to the region between successive
anchors
Example:
x = ma ro ... as, y = *ma****ro***as**
the substrings starting at a region between
successive anchors have the same encoding
Why does this idea work?

Remember that we are dealing with a gap algorithm.
Hence, if the input strings are very similar, we expect
that a sufficiently short substring, chosen from a
sufficiently long window is unlikely to contain edit
operations and thus has to be matched with a
corresponding substring in y in the same window
And how do Alice and Bob choose identical
substrings?
Alice and Bob cannot communicate with each other,
so they pick the anchors randomly. In fact they use
some (shared) random permutations (remember the
shared coins)

Isnt
this too good to be true?

Yes, it is: in order for the random permutations
to ensure that anchors are detected with high
probability we must demand that the input
strings are non-repetitive.
A string is called (t,l)-non-repetitive if for any
window of size l, the l substrings of length t,
which start inside the window are distinct

set map(string x, randomness r, int t) //Alice&Bob now use their coins r
set T:= empty
int c:=1
int i:=1
table of anchors a
table of regions reg
//pick a sliding window of length W and place it at c
for all length t-substrings starting in the interval [c+W..c+2W-1]
produce a Karp-Rabin fingerprint using r
pick a random permutation of the fingerprints using r
a[i] = the substring with the minimal fingerprint permutation
i:=i+1
c:=the first character after the ancher
for j=1 to i
reg[j] = substring starting after last char of a[i-1] till last char of a[i]
T (reg[i],i)
return T

int algorithm_2(string x, string y, int k, int t)
//t is the non-repetition parameter
randomness r
set Tx = map(x,r,t)
set Ty = map(y,r,t)
(u,v) = characteristic_vectors(Tx , Ty )
if HD(u,v) 3k then return (ED(x,y) k)
if HD(u,v) 6k then return (ED(x,y) (tk)2 )

The
algorithms correctness follows from:
Lemma 1:
If ED(x,y) k then Pr[HD(u,v) 3k] 5/6
Lemma 2:
If HD(u,v) 6kB then ED(x,y) O(tk)2
For
any 1t<n, the algorithm solves the k vs

((tk)2) gap problem for all k (n/t)1/2, for all
(t,tk)-non-repetitive strings, in polynomial time,
using sketches of size O(1).
Approximation algorithm
Key idea: use of graphs

Given two strings x,y the edit graph GE is a
representation of ED(x,y), by means of a
directed graph.
The vertices correspond to the edit distances

of x[1..i] and y[1..j] for all i,j n.
An edge between vertices corresponds to a

single edit operation transforming one
substring to the other.
We define the graph G(B) as a (lossy) compression of
GE: each vertex corresponds to a pair (i,s), where i=jB,
for j=0..n/B and s=-k..k. The bigger parameter B is, the
lossier the compression.
Each vertex is closely related with the edit distance of
x[1..i] and y[1..i+s] (s denotes the amount by which we
shift y with respect to x)
We have two types of edges:
a.
b.
a-type edges from (i,s) to (i,s) where |s-s|=1

b-type edges from (i-B,s) to (i,s) with w(e)
w(a-type edges) = 1,
w(b-type edges) depends on approximation factor c
In
GE the weight of the shortest source-sink

path corresponds to the optimal alignment of
x, y.
Theorem 1: Given two strings x, y and their
corresponding graph G(B), let the shortest
path P from (0,0) to (n,0) have weight T. Then
ED(x,y) T and T (2c+2)k, if ED(x,y) k,
where c is the approximation factor affiliated
with b-type edge weights.
Only
problem: finding the shortest path.

Dijkstra is too slow (i.e. not linear)
If we could figure out the weights of all btype edges for a given i simultaneously,
perhaps we could solve the problem.
But computing b-type edges is the same
as finding the approximate edit distances
between x[i+1..i+b] and every substring
of y[i+1-k..i+B+k] of length B
c(p,t)-edit pattern matching problem: given a pattern

string P of length p and a text string of length t p,
produce numbers d1, d2,dt-p+1, such that
di /c ED(P,T[i..i+p-1]) di for all i
Theorem 2: Suppose there is an algorithm that can
solve the c(p,t)-edit pattern matching problem in time
TIME(p,t).
Let x, y be two strings and G(B) their corresponding
graph. Set p=B, t=B+2k and c=c(p,t).
Then the shortest path in G(B) can be used to solve
the k vs (2c+2)k edit distance gap problem in time
O((k+TIME(p,t))n/B)
(i.e. the shortest path can be efficiently computed)
graph make_graph(string x, string y, int k, int B)
//bigger Bfaster algorithm, bigger gap
vertices V = empty
edges E = empty
for j = 0 to n div B
//vertices
i = j*B
for s = -k to k
V(i,s)
//a-type edges
i = j*B
for s = -k to k
EA ((i,s),(i,s+1),1)
EA ((i,s),(i,s-1),1)
//b-type edges
i = j*B
for s = -k to k
EB ((i-B,s),(i,s),w)
return (V, EA EB)
//(source, sink, weight)
(d1, d2,dt-p+1) epm_algorithm(string P, string T)
//length(T) length(P)
//returns approximate ED(P,S) for all S: length(P)-substrings of T
int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input?

int d[-k..k]
//table of b-type edges weights
d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k])
if n-B=0 return d
for s = -k to k
EB ((i-B,s),(i,s),d[s])
G = update(G, EB)
return compose(d, fix_weights(x,y,B,k,G,n-B)) //what is compose?
Let T(i,s) denote the shortest path from (0,0) to (i,s)

Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!)
computed for all s. The idea is to compute the shortest path T(n,0) by recursively
computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s).
No more than n/B recursive calls of fix_weights are needed
int shortest_path(string x, string y, int k, int B)
graph G = make_graph(x,y,k,B)
int T = fix_weights(x,y,k,B,G,n)
return T
Correctness of the gap algorithm (follows

from Theorem 1):
1.
2.
If ED k then T (2c+2)k
If ED (2c+2)k then T ED (2c+2)k
Some results which use algorithms for

the edit pattern matching problem:
1.
2.
3.
4.
quasi-linear time algorithm for k vs O(k2)

quasi-linear time algorithm for k vs O(k7/4)
quasi-linear time algorithm for k vs O(k3/2)
for (k, O(k1/2))-non-repetitive strings
2&3 imply approximation algorithms with
factors n3/7 and n1/3 respectively
Question: can {0,1}d endowed with edit distance be embedded

into l1 (Rd endowed with L1-metric) with low distortion?
Known results up to 2005:
1.
Edit distance cannot be embedded into l1 with distortion
less than 3/2
2.
Edit distance can be trivially embedded into l1 with distortion
d
3.
Edit distance can be embedded into l1 with distortion
log log d )
Rabani, STOC 2005)
2O ( log d(Ostrovsky,
Why do we care?
1.
Because embeddings allow us to use efficient algorithms of
one metric space on instances of another metric space
2.
Because embeddings help us learn more about the
structure of a metric space

Let
x be a string. Then for any integer s we

denote by shifts(x,s) the set
{x[1,|x|-s+1], x[2,|x|-s+2],, x[s,|x|]
which consists of all length s substrings of x.
Theorem:
There exists a universal constant c>0 such
that for every integer d>0 there exists an
embedding f:({0,1}d, ED)l1 with distortion at
most 2c log d log log d

Key
idea of the embedding:

for sufficiently small d the distortion is indeed
that small. Sologitd logsuffices
to break down the
log d
string into 2
substrings of
approximately the same length (1), the
blocks, and recursively embed into l1 some
metric spaces of lower dimension. The spaces
embedded must be chosen in such a way, that
the concatenation of their scaled embeddings
results in the embedding of the original string.

Choosing
the metric in the recursive step:

Let x1, x2, xn be the blocks.
For all i consider the sets shifts(xi ,s) where s
ranges over the non-negative powers of logd
that are below the block length.
Now define the distance between the sets
shifts(xi ,s) and shifts(xj ,s) to be the minimum
cost perfect matching, where the cost of an
edge between 2 elements is their edit distance.
This defines a metric that satisfies the above
requirements. Ideally we want to embed it in l1

However
a good embedding for the metric

shifts(xi ,s) is too strong an inductive
hypothesis.
Therefore we inductively embed the strings in
shifts(xi ,s) into l1 and redefine the edge costs
for the minimum cost perfect matching, to be
the l1 distances of the embedded strings.
This inductive embedding is not one of low
distortion, but it also does not need to be one.
It just needs to satisfy a weaker property
(omitted here) which guarantees the upper
bound on the total distortion.
The End

Edit Distance Solution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edit Distance Solution

Uploaded by

Copyright:

Available Formats

Edit Distance

is edit distance? Why do we care?

searching keywords through the net: usually

care about bit-strings only: = {0,1}n

operations with string positions:

of an alignment is the number of its edit

Some interesting properties

Triangle Inequality: for any three strings x,y,z

Splitting Inequality: let x, y be strings of

Some interesting properties

Some simple upper and lower bounds:

ED(x,y) number of characters (not

counting duplicates) found in x but not in y

Some interesting properties

Induced alignment property:

ins(i..j) = number of insertions in interval [i..j]

Some interesting properties

ed(4..5)=1 because we insert i in pos. 4

Some interesting properties

0,0 -1,0 -1,0

A Dynamic Programming Algorithm

Levenshtein ~1965 (Levenshtein distance)

input: strings s, t with lengths n, m respectively

A Dynamic Programming Algorithm

A Dynamic Programming Algorithm

A Dynamic Programming Algorithm

above algorithm returns the exact value of

Approximating edit distance efficiently

we know that our strings are either too

Approximating edit distance efficiently

Why do we care about an efficient gap

Such algorithms yield approximation

Approximating edit distance efficiently

Approximating edit distance efficiently

the sketching model?

Gap algorithm for arbitrary strings

Algorithm for arbitrary strings

Map each string to the multiset of all its

Gap algorithm for arbitrary strings

Gap algorithm for arbitrary strings

method: round down the starting

Gap algorithm for arbitrary strings

Gap algorithm for arbitrary strings

Gap algorithm for arbitrary strings

Hamming Distance can be approximated using

produces constant size sketches

Notice that HD(u,v) O(n): no more than 2n distinct

Gap algorithm for arbitrary strings

Gap algorithm for arbitrary strings

Gap algorithm for arbitrary strings

Gap algorithm for nonrepetitive strings

saw that encoding the substrings

Gap algorithm for nonrepetitive strings

we scan the input from left to right, trying to

Gap algorithm for nonrepetitive strings

Why does this idea work?

Gap algorithm for nonrepetitive strings

this too good to be true?

Gap algorithm for nonrepetitive strings

Gap algorithm for nonrepetitive strings

Gap algorithm for nonrepetitive strings

algorithms correctness follows from:

any 1t<n, the algorithm solves the k vs