Professional Documents
Culture Documents
ADS NTUA
4 2007
Overview
What
What is that?
Suppose we have two strings x,y
e.g. x = kitten
y = sitting
and we want to transform x into y.
We use edit operations: 1. insertions
2. deletions
3. substitutions
What is that?
A closer look:
kitten
sitting
1st step: kitten sitten (substitution)
2nd step: sittensittin (substitution)
3rd step: sittinsitting (insertion)
What is that?
Can we do better?
Answer here is no (obviously)
What about:
x = darladidirladada
y = marmelladara
Tough
Why do we care?
A lot of applications depend on the similarity of two
strings
Computational Biology:
ATGCATACGATCGATT
TGCAATGGCTTAGCTA
Animal species from the same
family are bound to have
more similar DNAs
What about evolutionary biology?
Why do we care?
Definitions
We
Definitions
Length
2.
savvato
savvato
savvato
eviva--
-eviva-- e-viva--
ed(4..5)
sh(3), sh(5)
x[4..5]
y[4+sh(3)..5+sh(5)])
va va
ed(4..5)
ED(x[4..5], y[4+sh(3)..5+sh(5)])) 0
va
va iva iva
correspond
to different optimal
alignments
the length of the
optimal alignment is 5
the algorithm runs in
O(n2) time and it can
be improved so as to
use O(n) space
2.
Our model for the first two algorithms: the sketching model:
two-party public-coin simultaneous messages communication
complexity protocol
persons: Alice, Bob and the Referee
goal: to jointly compute f: AxBC, when Alice has the input
a A and Bob has the input b B
Alice uses her input a and shared random coins to compute a
sketch sA(a) and then sends it to the referee. Bob does the
same and sends a sketch sB(b)
the referee uses the sketches and the shared coins to
compute the value of the function f(a,b), with a small error
probability (constant)
main measure of complexity: the sketches size
usually desirable that Alice and Bob are efficient too
2.
3.
Lemma 1:
If ED(x,y) k then Pr[HD(u,v) 3k] 5/6
Lemma 2:
If HD(u,v) 6kB then ED(x,y) O(tk)2
For
Approximation algorithm
Approximation algorithm
We define the graph G(B) as a (lossy) compression of
GE: each vertex corresponds to a pair (i,s), where i=jB,
for j=0..n/B and s=-k..k. The bigger parameter B is, the
lossier the compression.
Each vertex is closely related with the edit distance of
x[1..i] and y[1..i+s] (s denotes the amount by which we
shift y with respect to x)
We have two types of edges:
a.
b.
w(a-type edges) = 1,
w(b-type edges) depends on approximation factor c
Approximation algorithm
In
Approximation algorithm
Only
Approximation algorithm
Approximation algorithm
graph make_graph(string x, string y, int k, int B)
//bigger Bfaster algorithm, bigger gap
vertices V = empty
edges E = empty
for j = 0 to n div B
//vertices
i = j*B
for s = -k to k
V(i,s)
for j = 0 to n div B
//a-type edges
i = j*B
for s = -k to k
EA ((i,s),(i,s+1),1)
EA ((i,s),(i,s-1),1)
for j = 1 to n div B
//b-type edges
i = j*B
for s = -k to k
EB ((i-B,s),(i,s),w)
return (V, EA EB)
Approximation algorithm
(d1, d2,dt-p+1) epm_algorithm(string P, string T)
//length(T) length(P)
//returns approximate ED(P,S) for all S: length(P)-substrings of T
Approximation algorithm
int shortest_path(string x, string y, int k, int B)
graph G = make_graph(x,y,k,B)
int T = fix_weights(x,y,k,B,G,n)
return T
If ED k then T (2c+2)k
If ED (2c+2)k then T ED (2c+2)k
Approximation algorithm
4.
The End