CVX Duality

Applications of Legendre-Fenchel transformation to
computer vision problems

Ankur Handa, Richard A. Newcombe, Adrien Angeli, Andrew J. Davison
Department of Computing,
Huxley Building,
180, Queens Gate,
SW7 2AZ, London
{ahanda,r.newcombe08,a.angeli,ajd}@imperial.ac.uk,
WWW home page: http://www.doc.ic.ac.uk/ahanda
Abstract. We aim to provide a small background on Lengenre-Fenchel

transformation, the applications of which have been increasingly getting
popular in computer vision. A general motivation follows up with standard examples. Then we take a good view on their applications in solving
various standard computer vision problems e.g. image denoising, optical
flow, image deconvolution etc.
Keywords: Legendre-Fenchel transformation, Convex functions, Optimisation
Legendre-Fenchel transformation
Legendre-Fenchel (LF) transformation of a continuous but not necessarily differentiable function f : R R {}, is defined as
f (p) = sup{ px f (x)}
x R
(1)
Geometrically it means that we are interested in finding a point x on the function f (x) such that the slope of line p passing through (x, f (x)) has a maximum
intercept on the y-axis. This happens to be the point on the curve which has a
slope p which is nothing but the tangent at that point.
p = f 0 (x)
(2)
A vector definition can be written as

f (p) = sup {xt p f (x)}
xRn
2
2.1
(3)
Motivation: Why do we need it?

Duality
Duality is the principle of looking at a function or a problem from two different

perspectives namely the primal and the dual form. When working in optimisation theory, often and in general a deep understanding of a given function
II
Fig. 1. Image Courtesy: Wikipedia.org
is required. For instance, one would like to know whether a given function is
linear, whether the function is well behaved in a given domain etc. to name
a few. Transformations are one way of mapping the function to another space
where better and easy ways of understanding the function emerge out. Take for
instance Fourier Transform and Laplace Transform. Legendre-Fenchel is one
such transform which maps the (x, f (x)) space to the space of slope and conjugate that is (p, f (p)). However, while the Fourier transform consists of an
integration with a kernel, the Legendre Fenchel transform uses supremum as
the transformation procedure. Under the assumption that the transformation is
reversible, one form is the dual of the other. This is easily expressed as
(x, f (x)) (p, f (p))
(4)
p is the slope and f (p) is called the convex conjugate of the function f (x). A
conjugate allows one to build a dual problem which may be easier to solve than
the primal problem. Legendre-Fenchel conjugate is always convex.
How to use the Duality
There are two ways of viewing a curve or a surface, either as a locus of points
or envelope of tangents [1]. Now, let us imagine that we want to use the duality
of tangents to represent a function f (x) instead of points. A tangent is parameterised by two variables namely the slope p and the intercept it cuts on negative
y-axis (using negative y-axis for intercept is purely a matter of choice) which is
denoted by c. We provide two ways to solve for the intercept and the slope and
arrive at the same result.
III
3.1
Motivation I
Let us denote the equation of line having a slope p and intercept c by

y = px c
(5)
Now imagine this line is to touch the function f (x) at x, then we can equate
both of them and write as
px c = f (x)
(6)
Also suppose that the function f (x) is convex and is x2 , a parabola (as an example). Then we can solve for quadratic equation
x2 px + c = 0
(7)
Now this quadratic equation has two roots

p
p p2 4c
x=
2
(8)
But we know that this line should touch this convex function only at once (if the
function was non-convex, the line could touch the function at two points) and
because we want to use this line as a tangent to represent the function in dual
space. Therefore, the roots of this equation should be equal which is to say that
the determinant of the above quadratic equation should be zero i.e. p2 4c = 0
which gives us
f (p) = c =
p2
4
(9)
This is nothing but our Legendre-Fenchel transformed convex conjugate and

p = 2x
(10)
is the slope. That is

(x, f (x)) (p, f (p))
(x, x2 ) (p,
3.2
p2
4
(11)
(12)
Motivation II
For we know that

y = px c
(13)
to be a tangent to f (x) at x it must be that

p = f 0 (x)
(14)
IV
Again, if the function is

f (x) = x2
(15)
we can take the first order derivative to obtain the slope which is
p = f 0 (x) = 2x
(16)
p
2
(17)
x = f 01 (p) =
Replacing y we get
y = f (x)
(18)
y = f ( f 01 (p))
(19)
p
p2
y = f( ) =
2
4
(20)
and substituting x we get,
and therefore we can solve for c as

c = p f 01 (p) f ( f 01 (p))
c=p
c=
p2
2
4
p2
4
(21)
(22)
(23)
How about fuctions which are not differentiable

everywhere?
Let us now turn our attention towards the cases where a function may not be
differentiable at a given point x and has a value f (x ). In this case we can
rewrite our equation as
px c = f (x )
(24)
c = px f (x )
(25)
which means
is only a linear function of p. So the duality of a non-differentiable point at x

induces a linear function of slope in the dual space. And p can take a value from
), f 0 (x )]. This is because the slope of the tangent at a point in the very
[ f 0 (x
+
is f 0 (x ) and the right,
small vicinity of x to the left side of it, denoted by x
Fig. 2. At the point x we can draw as many tangents we want with slopes ranging from
[1, 1] and they form the subgradients of the curve at x .
denoted by x+ is f 0 (x+ ). At the point of discontinuity in the space of function

), f 0 (x )]. This
f (x), we can draw many tangents with slopes ranging from [ f 0 (x
+
interval is defined as the subgradient. This is explained briefly in the next section.
Therefore, the point of non-differentiability in the primal space can be readily explained in dual space with a continuously varying slope in the range
), f 0 (x )] defining a linear function, c(p), of this slope. Therefore, even if
[ f 0 (x
+
the primal space is non-differentiable, the dual space is differentiable.
4.1
Subdifferential and Subgradient
In calculus, we are, majority of time, interested in minimising or maximising a

function f . The point x which minimises the function is referred to as the criti = 0. Convex functions belong to
cal point, the minimiser of function i.e. f (x)
the class of functions which have global minimiser. However there are certain
convex functions which are not differentiable everywhere therefore one can not
compute the gradient. A notorious function f (x) = | x | is an example of convex
function which is not differentiable at x = 0. Instead, one defines the subdifferential. A subdifferential is therefore formalised as f (x) such that
f (x) = {y Rn : hy, x 0 x i f (x 0 ) f (x), x 0 Rn }
(26)
VI
Fig. 3. An illustration of duality where lines are mapped to points while points are
mapped to lines in the dual space.
In simpler term, a subdifferential is defined as the slope of the line at point x such
that it is either always touching or remaining below the graph of function. For
the same notorious | x | function the differential is not defined at x = 0 because
x
is not defined at x = 0 while subdifferential at point x = 0 is the close interval
|x|
[-1, 1] because we can always draw a line with a slope between [-1,1] which
is always below the function. The subdifferential at any point x < 0 is the
singleton set {-1}, while the subdifferential at any point x > 0 is the singleton
{1}. Members of the subdifferential are called subgradients.
4.2
Proof: Legendre-Fenchel conjugate is always convex

f (z) = sup {xt z f (x)}
xRn
(27)
In order to prove that function is convex we need to prove that for a given
0 1 the function should obey Jensens inequality i.e.
f (z1 + (1 )z2 ) f (z1 ) + (1 ) f (z2 )
(28)
Let us expand the left hand side of the inequality to be proved.

f (z1 + (1 )z2 ) = sup {xt (z1 + (1 )z2 ) f (x)}
xRn
(29)
We can rewrite f (x) as

f (x) = f (x) + (1 ) f (x)
(30)
VII
and replacing it in the equation above yields a new expression which is

f (z1 + (1 )z2 ) = sup {(xt z1 f (x)) + (1 )(xt z2 f (x))}
xRn
(31)
Assuming p1 = (xt z1 f (x)) and p2 = (xt z2 f (x)) for brevity, we know that
sup {p1 + (1 )p2 } sup {p1 } + sup {(1 )p2 }
xRn
xRn
xRn
(32)
It is the property of supremum which states that

sup {x + y} sup {x} + sup {y}
(33)
Therefore, we can substitute for

sup {(xt z1 f (x))} = f (z1 )
xRn
(34)
sup {(1 )(xt z2 f (x))} = (1 ) f (z2 )

xRn
(35)
and
We arrive it our desired result which is

f (z1 + (1 )z2 ) f (z1 ) + (1 ) f (z2 )
(36)
f (z)
Therefore,
is convex always irrespective of the whether the function f (x)
is convex or not.
5
5.1
Examples
Example 1: Norm function (non-differentiable at zero)
f (y) = ||y||
(37)
t
f (z) = sup{y z ||y||}

y R
(38)
By using Cauchy-Schwarz inequality we can also write
||y|| = max yt b
||b||1
(39)
Now, we know that the maximum value of yt b is ||y|| so it is trivial to see that
max {yt b ||y||} = 0 y R
||b||1
(40)
Therefore, we can write the conjugate as

0
if ||z|| 1
f (z) =
otherwise
The fact that the conjugate is when ||z|| > 1 can be easily explained using
Figure 4.
VIII
Fig. 4. The image explains the process involved when fitting a tangent with slope p =
2 or z = 2. Any line with slope
/ [-1,1] has to intersect the y-axis at , to be able to be
tangent to the function.
5.2
Example 2: Parabola
f (y) = y2
(41)
LF transform of a function f (y) for an n-dimensional vector y is defined as

f (z) = sup{yz f (y)}
y R
(42)
The function attains maxima when

y (yz y2 ) = 0
(43)
z 2y = 0
(44)
z = 2y
substituting the value of y in the above function f (z) we get
2
z
1
z
= z2
f (z) = z
2
2
4
5.3
(45)
(46)
Example 3: A general quadratic n-D curve

f (y) =
1 t
y Ay
2
(47)
Let us assume that A is symmetric, then the LF transform of a function f (y) for
an n-dimensional vector y is defined as
f (z) = sup {yt z f (y)}
yRn
(48)
IX
Fig. 5. The plot shows the parabola y2 and its conjugate which is also a parabola 41 z2 .
The function attains maxima when

1
y (yt z yt Ay) = 0
2
1
z (A + A T )y = 0
2
z Ay = 0 [ A
y=A
(49)
(50)
is symmetric]
(51)
(52)

1
f (z) = (A1 z)t z (A1 z)t A(A1 z)
2
1
= zt A1 z zt A1 AA1 z
2
1
= zt A1 z zt A1 AA1 z
2
1 t 1
= zA z
2
5.4
(53)
(54)
(55)
(56)
Example 4: l p Norms
f (y) =
1
||y|| p
p
1<p<
(57)
Again writing the LF transform as

f (z) = sup{yt z
y R
1
||y|| p }
p
(58)
1
||y|| p ) = 0
p
y
z ||y|| p1
=0
||y||
y (yt z
(59)
(60)
z = ||y|| p2 y
||z|| = ||y||
p 1
1
p 1
||y|| = ||z||
z
y=
p 2
||z|| p1
(61)
(62)
(63)
(64)
substituting this value of y into the function gives

f (z) =
=
=
zt
p 2
p 1
||z||
zt z
p
1
||z|| p1
p
p
1
||z|| p1
p
(66)
p 2
p 1
p
1
||z|| p1
p
(67)
p 2
p
1
||z|| p1
p
p
2(p1)(p2)
1
p 1
||z|| p1
= ||z||
p
p
2p2 p+2
1
= ||z|| p1 ||z|| p1
p
p
p
1
= ||z|| p1 ||z|| p1
p
p
1
= (1 )||z|| p1
p
= ||z||
(65)
p 2
p 1
||z||
||z||2
||z||
2 p 1
(68)
(69)
(70)
(71)
(72)
= (1
=
1
1
)||z|| 1 p
p
1
1
(1 1p )
||z|| 1 p
(73)
(74)
(75)
Let us call
q=
1
(1 1p )
(76)
XI
Therefore, we obtain
f (z) =
1
||z||q
q
1 1
+ =1
p q
5.5
(77)
(78)
Example 5: Exponential function
Fig. 6. The plot shows the function e x and its conjugate which is z(ln z 1).
f (y) = ey
(79)
f (z) = sup{yt z ey }
(80)
y R
(81)
if z < 0 : supyR {yt z ey } is unbounded so f (z) =
if z > 0 : supyR {yt z ey } is bounded and can be computed as
y (yz ey ) = 0
(82)
z ey = 0
(83)
y = ln z
(84)
XII

f (z) = z ln z z = z(ln z 1)
if z = 0 : supyR
5.6
{yt z ey }
= supyR
{ey }
(85)
=0
Example 6: Negative logarithm

f (y) = log y
(86)
f (z) = sup{yt z ( log y)}

y R
(87)
Fig. 7. The plot shows the function -log y and its conjugate which is -1 -log (-z).
y (yz + log y) = 0
1
z+ = 0
y
y=
(88)
(89)
1
z
(90)
Substituting this value back into the equation we get

1
z + log(1/ z)
z
f (z) = 1 log(z)
f (z) =
This is only valid if z < 0
(91)
(92)
XIII
Summary of noteworthy points

The Legendre-Fenchel transform only yields convex functions
Points of function f are transformed into slopes of f , and slopes of f are
transformed into points of f
The Legendre-Fenchel transform is more general than the Legendre transform because it is also applicable to non-convex functions as well as nondifferentiable functions. The Legendre-Fenchel transform reduces to Legendre transform for convex functions.
Applications to computer vision
Many problems in computer vision can be expressed in the form of energy minimisations [7]. A general class of the functions representing these problems can
be written as
min{ F(Kx) + G(x)}
xX
(93)
where F and G are proper convex functions and K Rnm . Usually, F(Kx)
corresponds to regularisation term of the form ||Kx || and G(x) corresponds to
the data term. The dual form can be easily derived by replacing F(Kx) with its
convex conjugate, that is
min max{hKx, yi F (y) + G(x)}
x X y Y
(94)
because F is a convex function then, by definition of Legendre-Fenchel transformation

F(Kx) = max{hKx, yi F (y)}
y Y
(95)
We know that dot product is commutative so we can re-write
hKx, yi = h x, K T yi
(96)
and in case the dot product is defined on hermitian space we can write it as
hKx, yi = h x, K yi
(97)
where K is the adjoint conjugate of K which is more general. Going back to

Eqn. 94, the equation now becomes
min max{h x, K yi F (y) + G(x)}
(98)
min{h x, K yi + G(x)} = G (K y)
(99)
x X y Y
Now by definition
xX
XIV
because
max{h x, K yi G(x)} = G (K y)
(100)
(101)
(102)
(103)
xX
xX
xX
xX
Under the weak assumptions in convex analysis , min and max can be switched
in Eqn. 98, the dual problem then becomes
max{ G (K y) F (y)}
(104)
y Y
Primal Dual Gap is then defined as

min{ F(Kx) + G(x)} max{ G (K y) F (y)}
xX
y Y
(105)
For the primal-dual algorithm to be applicable, one should be able to compute

the proximal mapping of F and G, defined as:
1
ProxF (x) = arg min || x y||2 + F(y)
y 2
(106)
Therefore, one can formulate the minimisation steps as

yn+1 = ProxF (yn + Kx n )
x
n+1
n+1
n+1
= ProxG (x K y
=x
n+1
+ (x
n+1
(107)
)
x )
(108)
(109)
Note that being able to compute the proximal mapping of F is equivalent to

being able to compute the proximal mapping of F , due to Moreaus identity:
x = ProxF (x) + ProxF/ (x /)
(110)
It can be shown that if 0 1 and ||K ||2 < 1, x n converges to the minimiser of the original energy function.
Note: The standard way of writing the update equations is via proximity operator but it reduces to standard pointwise gradient descent with projection
onto a unit ball which is expressed in max(1, | p|) in subsequent equations. This
projection comes from the indicator function.
7.1
Premilinaries
Given a and x are column vectors where the dot product ha, xi can be written
as
ha, xi = aT x
(111)
XV
Then we can write the associated derivates with respect to x as

aT x xT a
=
=a
(112)
x
x
We will be representing a 2-D image matrix as a vector in which the elements
are arranged in lexicographical order.
a1,1
a1,2
a1,1 a1,2 a1,n
..
.
a2,1 a2,2 a2,n
Am,n = .
(113)
.. . .
.. = a1,n
.
.
. . a
.
m,1
am,1 am,2 am,n

.
..
am,n
The divergence of matrix where elements are stacked in a vectorial fashion can
be derived using the following

x 0 0
0 0 a1,1
x
. . . . . .
.. .. . . . . . . . . ... a1,2
..
.
0 0 0 x
a1,n
(114)
A = 0 0
y
am,1
0 0
.
..
. y
. .. . . . . . . . . ..
. . . . . a
. .
m,n
0 0 0 y
x
a1,1
ax
1,2
.
..
ax
1,n
ax
m,1
.
0 ..
x
0
am,n
.. ay
1,1
. y
a1,2
.
y
..
y
a1,n
y
am,1
.
..
divA =
0 0 0
0
..
.
0
..
.
0
0 0
. . .. .. . .
. . . .
0 0
0 0 0 0
0 y
0 0
.. .. .. . . .. . .
.
. . . . .
0
0
am,n
(115)
XVI
7.2
Representation of norms
We will be using the following notational brevity to represent the L1 norm:

Etv (u) = ||u||1
(116)
W H
||u||1 = |ui,j |
(117)
i=1 j=1
|ui,j | =
( x ui,j )2 + (y ui,j )2
(118)
where the partial derivatives are defined on discrete 2D grid as follows
x ui,j = u(i, j) u(i 1, j)
(119)
y ui,j
= u(i, j) u(i, j 1)
(120)
+x ui,j = u(i + 1, j) u(i, j)
(121)
+y ui,j
(122)
= u(i, j + 1) u(i, j)
For Maximisation we will be using gradient as

pn+1 pn
= p E(p)
p
(123)
For Minimisation we will be using gradient as

pn pn+1
= p E(p)
p
(124)
NOTE the switch in the iteration numbers in Maximisation and Minimisation.

For brevity, ui,j is denoted by u.
NOTE ON COMPUTING GRAD AND DIVERGENCE: To compute forward differences +x are used while for computing divergence backward differences,
x are used.
7.3
ROF Model
A standard ROF model can be written as

min ||u||1 +
u X
||u g||22
2
We know that the convex conjugate of ||.|| norm is an indicator function

0
if || p|| 1
(p) =
otherwise
(125)
XVII

||u||1 = max h p, ui P (p)
p P
(126)
Therefore, we can write the ROF function as

min maxh p, ui +
u X p P
||u g||22 P (p)

2
(127)
Let us call this new function E(u, p)

1. Compute the derivative with respect to p i.e. p E(u, p) which is

p E(u, p) = p h p, ui + ||u g||22 P (p)

(128)
2

p h p, ui = u [proof given]
(129)

p
||u g||22 = 0
(130)
2
p P (p) = 0
[ indicator function i.e. constant function]
(131)
p E(u, p) = u
(132)
2. Compute the derivate with respect to u i.e. u E(u, p) which is

u E(u, p) = u h p, ui + ||u g||22 P (p)

2

u h p, ui = u hu, divpi = divp

||u g||22 = (u g)
u
2
u P (p) = 0
u E(u, p) = divp + (u g)
(133)
(134)
(135)
(136)
(137)
3. Use simple gradient descent 1

pn+1 pn
= p E(u, p) = un
pn+1 = pn + un
pn + un
pn+1 =
max(1, | pn + un |)
un un+1
= u E(u, p) = divp + (un+1 g)
un + divpn+1 + g
un+1 =
1 +
1
(138)
(139)
(140)
(141)
(142)
The projection onto unit ball max(1, | p|) comes from the indicator function as explained in a note just below the Primal Dual Gap paragraph. In the subsequent equations we have used this property wherever projection is made.
XVIII
7.4
Huber-ROF model
The interesting thing about Huber model is that it has a continuous first derivative, so a simple gradient descent on the function can bring us to the minima
while if we were to use Newton-Raphson method which requires the second
order derivative, it wouldnt be possible to do so because the second derivative
of Huber model is not continuous. So, the function we want to minimise is
min ||u||h +
u X
||u g||22
2
(143)
where ||.||h is the Huber-Norm and is defined as

(
|| x || =
| x |2
2
|x|
if | x |
if | x | >
The convex conjugate of a parablic function can be written as

f (p) =
|| p||22
2
|| p||
(144)
and the conjugate of ||.|| function is the same indicator function
f (p) =

2
if < || p|| 1
otherwise
(145)
Therefore the minimisation can be re-written as

min max h p, ui P (p) || p||2 + ||u g||22

2
2
u X p P
(146)
Minimisation The minimisation can be carried out following the series of steps

p E(u, p) = p h p, ui P (p) || p||2 + ||u g||22

2
2

p h p, ui = u

p
||u g||22 = 0
2

p P (p) = 0
[ p is an indicator function]

p || p||2 = p
2
p E(u, p) = u p
(147)
(148)
(149)
(150)
(151)
(152)
XIX

u E(u, p) = u h p, ui P (p) || p||2 + ||u g||22

2

2

||u g||22 = (u g)
u
2
u P (p) = 0

u || p||2 = 0
2
u E(u, p) = divp + (u g)
(153)
(154)
(155)
(156)
(157)
(158)
3. Use simple gradient descent

pn+1 pn
= p E(u, p) = un pn+1
pn + un
pn+1 =
1 +
(159)
(160)
pn + un
pn+1 =
1+
pn + un
1+ |)
max(1, |
un un+1
= u E(u, p) = divp + (un+1 g)
un + divpn+1 + g
un+1 =
1 +
(161)
(162)
(163)
TVL1 denoising
TVL1 denoising can be rewritten as

min ||u||1 + ||u f ||1
u X
(164)
This can be further rewritten as a new equation where is subsumed inside the
norm i.e.
min ||u||1 + ||(u f )||1
u X
(165)
We know that the convex conjugate of ||.|| norm is an indicator function

0
if || p|| 1
(p) =
otherwise

||u||1 = max h p, ui P (p)
p P

and ||(u f )||1 = max hq, (u f )i Q (q)
q Q
(166)
(167)
XX
Therefore, we can write the TVL1 denoising function as

min max maxh p, ui + hq, (u f )i P (p) Q (q)
u X p P q Q
(168)
Let us call this new function E(u, p, q)

1. Compute the derivative with respect to p i.e. p E(u, p, q) which is

p E(u, p, q) = p h p, ui + hq, (u f )i P (p) Q (q)

p h p, ui = u [proof given]

p hq, (u f )i = 0
(169)
(170)
(171)
p P (p) = 0
[ indicator function i.e. constant function]

(172)
p Q (q) = 0
(173)
p E(u, p, q) = u
(174)
2. Compute the derivate with respect to q i.e. u E(u, p, q) which is

q E(u, p, q) = q h p, ui + hq, (u f )i P (p) Q (q) (175)

q h p, ui = 0
(176)

q hq, (u f )i = (u f )
(177)
q P (p) = 0
(178)
q Q (q) = 0
(179)
q E(u, p, q) = (u f )
(180)
3. Compute the derivate with respect to u i.e. u E(u, p, q) which is

u E(u, p, q) = u h p, ui + hq, (u f )i P (p) Q (q)


u hq, (u f )i = q
(181)
u P (p) = 0
(184)
u Q (q) = 0
(185)
u E(u, p) = divp + q
(186)
(182)
(183)
XXI

pn+1 pn
= p E(u, p, q) = un
pn+1 = pn + un
pn + un
pn+1 =
max(1, | pn + un |)
qn+1 qn
= q E(u, p, q) = (un f )
qn+1 = qn + (un f )
qn + (un f )
qn+1 =
max(1, |qn + (un f )|)
un un+1
= u E(u, p, q) = divpn+1 + qn+1
un+1 = un + divpn+1 qn+1
(187)
(188)
(189)
(190)
(191)
(192)
(193)
(194)
Another alternative that is used in [8] is to do the projection a bit differently

where the update equations on q and u have slightly different equations. The
lambda appears only in the projection step i.e.
pn+1 pn
= p E(u, p, q) = un
pn+1 = pn + un
pn + un
pn+1 =
max(1, | pn + un |)
qn+1 qn
= q E(u, p, q) = (un f )
qn+1 = qn + (un f )
qn + (un f )
qn+1 =
|qn +(un f )|
)
max(1,
un un+1
= u E(u, p, q) = divpn+1 + qn+1
un+1 = un + divpn+1 qn+1

8.1
(195)
(196)
(197)
(198)
(199)
(200)
(201)
(202)
Image Deconvolution
min ||u||1 +
u X
|| Au g||22
2
(203)
The problem can be written in terms of saddle-point problem as

min maxh p, ui +
u X p P
|| Au g||22 P (p)
2
(204)
XXII
Minimisation The minimisation can be carried out following the series of

steps

p E(u, p) = p h p, ui P (p) + || Au g||22

2

p h p, ui = u

p
|| Au g||22 = 0
2

p P (p) = 0
[ p is an indicator function]
p E(u, p) = u
(205)
(206)
(207)
(208)
(209)

u E(u, p) = u h p, ui P (p) + || Au g||22

2


u
|| Au g||22 = A T Au A T g
2
u P (p) = 0

T
T
u E(u, p) = divp + A Au A g
(210)
(211)
(212)
(213)
(214)
u || Au g||22 = u (Au g)T (Au g)

T
(215)
(Au g) (Au g) = ((Au) g )(Au g)

T
(216)
((Au) g )(Au g) = (u A g )(Au g)

T
(217)
T
(218)
[Lets say B is A A]
(219)
(u A g )(Au g) = u A Au u A g + g Au + g g
T
u u A Au = u u Bu
T
u u Bu = (B + B )u
T
(220)
T
u u A Au = (A A + (A A) )u
T
u u A Au = (A A + A A)u
u u A Au = 2A Au
(221)
(222)
(223)
XXIII
un un+1
pn+1 pn
= p E(u, p) = un
pn+1 = pn + un
pn + un
pn+1 =
max(1, | pn + un |)
T

(A A) n+1
= u E(u, p) = divpn+1 +
u
AT g
1

un+1 I + A T A = un + divpn+1 + A T g
u
n+1

=
I + A A
1
(un + divpn+1 + A T g)
(224)
(225)
(226)
(227)
(228)
(229)
(230)
This requires matrix inversion. In some cases the matrix may be singular because it is generally sparse and therefore inversion is not a feasible solution.
Therefore, one resorts to using Fourier Analysis.
Another alternative is to dualise again with respect to u, which then yields
min max h p, ui + h Au g, qi P (p)

u X p P,q Q
1
||q||2
2
(231)
1. Compute the derivative with respect to p i.e. p E(u, p, q) which is

p E(u, p) = u
(232)
2. Compute the derivate with respect to q i.e. q E(u, p, q) which is
q E(u, p, q) = Au g
1
q
(233)
3. Compute the derivate with respect to u i.e. u E(u, p, q) which is

u E(u, p, q) = divp + A T q
(234)
XXIV

pn+1 pn
= p E(u, p, q) = un
p
pn+1
pn+1 = pn + p un
pn + p un
=
max(1, | pn + p un |)
qn+1 qn
1
= q E(u, p, q) = Aun g qn+1
q
n + Aun g
q
q
q
qn+1 =
1 + q
un un+1
= u E(u, p) = divpn+1 + A T qn+1
un+1 = un + divpn+1 A T qn+1
(235)
(236)
(237)
(238)
(239)
(240)
(241)
This saves matrix inversion. One of the benefits of using the LegendreFenchel transformation.
Interesting tip Imagine we have a function of the form
E = (h u f )2
(242)
where operator denotes the convolution. If one wants to take the derivate
with respect to u, one can make use of the fact that h u can be expressed as
a linear function of sparse matrix D, i.e. Du. Rewriting the equation we can
derive
E = (Du f )2 = (Du f )T (Du f )
(243)
Now it is very trivial to see the derivative of this function with respect to u.
Referring to the Eqn. 76 in the [2], we can then write the derivative of E with
respect to u as follows
E
= 2D T (Du f )
u
Du = h u
D T (Du f ) = h (h u f )
(244)
(245)
(246)
where h is the mirrored kernel, i.e. h h( x)

8.2
Optic Flow
Optic flow was popularised by Horn and Schunks seminal paper [4] which
has over the next two decades sparked a great interest in minimising the energy function associated with computing optic flow and its various different
XXV
formulations [5][6][9]. Writing the standard L1 norm based optic flow equation

min
||u||1 + ||v||1 + | I1 (x + f ) I2 (x)|
(247)
u X,vY
where f is a flow vector (u, v) at any pixel (x, y) in the image. For brevity (x, y)
is replaced by x. Substituting p for dual variable corresponding to u, q for v
and r for I1 (x + f ) I2 (x), we can rewrite the original energy formulation in its
primal-dual form as

max
min
p P,q Q,r R u X,vY
h p, ui + hq, vi

(248)
+ hr, (I1 (x + f ) I2 (x))i p (P) q (Q) r (R)

We have used the same trick for writing the dual formulation corresponding to
the I1 (x + f ) I2 (x) by subsuming the inside, we used while writing the dual
formulation of the data term in TV-L1 denoising equation 167. Various derivates
required for gradient descent can be computed as shown below
1. Compute the derivate with respect to p i.e. p E(u, v, p, q, r) which is
p E(u, v, p, q, r) = u
(249)
2. Compute the derivate with respect to q i.e. q E(u, v, p, q, r) which is

q E(u, v, p, q, r) = v
(250)
3. Compute the derivate with respect to r i.e. r E(u, v, p, q, r) which is

r E(u, v, p, q, r) = (I1 (x + f ) I2 (x))
(251)
4. Compute the derivate with respect to u i.e. r E(u, v, p, q, r) which is

u E(u, v, p, q, r) = divp + u (hr, (I1 (x + f ) I2 (x))i)
(252)
Linearising around f 0 , we can rewrite the above expression involving r as
hr, (I1 (x + f ) I2 (x))i = hr, (I1 (x + f 0 ) I2 (x))+( f f 0 )t [Ix Iy ]T )i. (253)

We can expand the terms involving f and f 0 as
I1 (x + f 0 ) I2 (x)+( f f 0 )t [Ix Iy ]T = I1 (x + f 0 ) I2 (x)+(u u0 )t Ix +(v v0 )t Iy
(254)
It is easy to then expand the dot-product expression involving r as
hr, (I1 (x + f 0 ) I2 (x) + ( f f 0 )t [Ix Iy ]T )i

= hr, I1 (x + f 0 ) I2 (x)i + hr, (u u0 )t Ix i + hr, (v v0 )t Iy i
(255)
XXVI
hr, (u u0 )t Ix i = hr, Ix (u u0 )i = r t Ix (u u0 ) = hIx T r, (u u0 )i (256)

Ix is a diagonal matrix composed of entries corresponding to gradient along
x-axis for each pixel and similarly Iy is composed of gradient along y-axis.
The derivative of E with respect to u can be then written as
u E(u, v, p, q, r) = divp + Ix T r
(257)
5. Compute the derivate with respect to v i.e. v E(u, v, p, q, r) which is

v E(u, v, p, q, r) = divq + Iy T r
(258)
Gradient descent equations then follow straightforward

1. Maximise with respect to p
pn+1
pn+1 pn
= un
p
pn + p un
=
max(1, | pn + p un |)
(259)
(260)
2. Maximise with respect to q
qn+1
qn+1 qn
= vn
q
qn + q vn
=
max(1, |qn + q vn |)
(261)
(262)
3. Maximise with respect to r
r n+1
r n+1 r n
= (I1 (x + f n ) I2 (x))
r
r n + r (I1 (x + f n ) I2 (x))
=
max(1, |r n + r (I1 (x + f n ) I2 (x))|)
(263)
(264)
4. Minimise with respect to u

un un+1
= divpn+1 + Ix T r n+1
u
(265)
5. Minimise with respect to v

vn vn+1
= divqn+1 + Iy T r n+1
v
(266)
XXVII
8.3
Super-Resolution
The formulation was first used in [8] but we will describe here the minimisation
procedue below.

min
u X
||u ||ehu
h
+ ||DBWi u fi ||ed
With > 0, let us now rewrite the conjugate for ||.|| we see

f (p) = sup h p, ui ||u||
u R

p
f (p) = sup h , ui ||u||
u R
Let us now denote
(267)
i=1
(268)
(269)
= k, then we can write

f (p) = sup hk, ui ||u||
u R
(270)
f (p) = f (k)
(271)

But we know that supuR hk, ui ||u|| is an indicator function defined
by

0
if ||k|| 1
f (k) =
(272)
otherwise
Therefore, we can write the f (p) as

0
f (p) =
Now replace k by
if ||k|| 1
otherwise
(273)
we can then come to an expression

(
f (p) =
p
if || || 1
otherwise
(274)
The saddle-point formulation then becomes

eu
u i
min max h p,
|| p ||2 {| p |h2 }
q
2h2
u X p,

N
ed
h
2
+ hqi , DBWi u f i i X
||q|| {|qi |(h)2 }
2(h)2
i =1
Minimisation Minimisation equations can be written as follows
(275)
XXVIII

eu
p,
q)
= p h p,
u i
p E(u,
|| p ||2 {| p |h2 }
2h2

N
ed
h
2
||q|| {|qi |(h)2 }

+ hqi , DBWi u f i i X
2(h)2
i=1
(276)

p h p, ui = u
(277)
p E(u, p) = u
eu
p
h2
(278)
2. Compute the derivate with respect to qi i.e. u E(u, p, qi ) which is

ed
2
||
q
||
i
2(h)2
e
p,
qi ) = (h)2 (DBWi u fi ) d 2 qi
qi E(u,
(h)
p,
qi ) = (h)2 (DBWi u fi ) qi
qi E(u,
(279)
(280)
3. Compute the derivate with respect to u i.e. u E(u, p, qi ) which is

N
p,
qi ) = divp + (h)2 u (qiT (DBWi u fi ))
u E(u,
(281)
i=1
p,
qi ) = divp + (h)2 (WiT BT DT qi )
u E(u,
(282)
i=1

pn+1 pn
eu
= p E(u, p) = h2 un 2 pn+1
p
h
p eu n+1
p
pn+1 pn = p h2 h un
h2
p h2 h un + pn
pn+1 =
p eu
1 + h
2
pn+1 =
pn+1
max(1,
| pn+1 |
)
(283)
(284)
(285)
(286)
XXIX
qin+1 qin
e
= (h)2 (DBWi u n fi ) d 2 qin+1
q
(h)
q ed n+1
qin+1 qin = q (h)2 (DBWi u n fi )
q
(h)2 i
qn + q (h)2 (DBWi u n fi )
qin+1 = i
q ed
1 + (h)
2
qin+1 =
qin+1
max(1, |qin+1 |)
(287)
(288)
(289)
(290)
N
u n u n+1
p,
qi ) = divpn+1 + (h)2 (WiT BT DT qin+1 ) (291)
= u E(u,
i=1

N
u n+1 = u n divpn+1 + (h)2 (WiT BT DT qin+1 )
(292)
i=1
8.4
Super Resolution with Joint Flow Estimation
Let us now try to turn our attention towards doing full joint tracking and super
resolution image reconstruction. Before we derive anything lets try to formulate the problem from bayesian point of view. We are given the downsampling,
blurring operators and we want to determine the optical flow between the images and reconstruct the super resolution image at the same time. The posterior
probability can be written as
N
N
{w i }i=1
, D, B)
|{ fi }i=1
P(u,
(293)
Using standard bayes rule, we can write this in terms of likelihoods and priors
as
N
N
{w i }i=1
P(u,
|{ fi }i=1
, D, B)
P( fi |w i , u, D, B)P(w i , u, D, B)
(294)
i=1
D, B) is our standard super resolution likelihood model and under

P( fi |w i , u,
the L1 norm can be expressed as follows
D, B) = || DBu(x
+ w i ) fi ||
logP( fi |w i , u,
(295)
D, B) marks our prior for the super resolution image and the flow.
while P(w i , u,
It can be easily simplied under the assumption that flow prior is independent
of super resolution prior.

D, B) =
P(w i , u,
P(
w
)
i P(u)
i=1
(296)
XXX
The priors are standard TV-L1 priors and can be written as
logP(w i ) = i {||w xi || + ||w yi ||}
(297)
= ||u ||
logP(u)
(298)
and
The combined energy function can be written as

N
i=1
i=1
N
{w i }i=1
+ w i ) fi || + i {||w xi || + ||w yi ||} + ||u ||
E(u,
) = || DBu(x
(299)
We dualise with respect to each L1 norm and obtain the following expression
N
N
{w i }i=1
+ w i ) fi i
E(u,
) = hqi , DBu(x
i=1
(300)
+ i {hr xi , w xi i + hryi , w yi i} + h p, u i
i=1
Optimising with respect to qi

qin+1 qin
= DBu n (x + win ) fi
q
(301)
Optimising with respect to p

pn+1 pn
= un
p
(302)
Optimising with respect to r xi

n
r n+1
xi r xi
= wnxi
rx
(303)
Optimising with respect to ryi

n+1 r n
ryi
yi
ry
n
= wyi
(304)
XXXI
Optimising with respect to w

xi The linearisation around the current solution
leads to expanding the flow equation as
n ) fi
+ w in ) fi = DBu(x
+ w in1 + dw
DBu(x
i
(305)
(306)
n ) fi = DB{u(x
n
+ w in1 + dw
+ w in1 ) + x u(x
+ w in1 )dw
DBu(x
i
xi
n } fi
+ w n1 )dw
+ y u(x
(307)
yi
n and dw
n by w n w n1 and w n w n1 respectively we can
Replacing dw
xi
yi
xi
yi
xi
yi
rewrite the above equation as
n ) fi = DB{u(x
+ w in1 + dw
+ w in1 )+ x u(x
+ w in1 )(w nxi w nxi1 )
DBu(x
i
(308)
+ w n1 )(w n w n1 )} fi
+ y u(x
i
yi
yi
Treating now w in1 to be constant, we can minimise the energy function with ren respectively. The obtained update equations can be written
spect to w xi and w yi
as

w nxi w nxi1
+ w in1 ) + x u(x
+ w in1 )(w nxi w nxi1 )
= w xi hqi , DB u(x
w

n 1
n
n 1
+ w i )(w yi w yi ) f i i + hr xi , w xi i + hryi , w yi i
+ y u(x
(309)
w nxi w nxi1
= IxT B T D T qin divr nxi
w
(310)
or
nxi
w n+1
xi w
= IxT B T D T qin+1 divr n+1
xi
w
+ w in )))
Ix = diag( x (u(x
(311)
(312)
Optimising with respect to w

yi Similar optimisation scheme with respect to
w yi yields a similar update equation
n+1 w
n
yi
w yi
n+1
= IyT B T D T qin+1 divryi
(313)
+ w in )))
Iy = diag(y (u(x
(314)
Optimisations with respect to qi , wxi , wyi , r xi and ryi are done on a coarseto-fine pyramid fashion.
XXXII
Optimising with respect to u Given the current solution for w in we can write
+ w in ) as a linear function of u by multiplying it with warping matrix Win u
u(x
N

u n+1 u n
(n+1) T T T n+1
n+1
= (Wi
) B D qi divp
(315)
u
i=1
Setting the step sizes
The constants and are usually very easy to set if the operator K in the equation
min F(Kx) + G(x)
(316)
xX
is a simple operator, in which case and can be easily found from the constraint that L2 1, where L is the operator norm i.e. ||K ||. For instance if we
try to look at our problem of ROF model and dualise it we can see that
min ||u||1 +
u X
||u g||22
2
(317)
using p as a dual to u and q as a dual to u g we can reduce this to its dual

form
1 2
q
(318)
min max maxh p, ui P (p) + hq, u f i
2
u X p P q Q
Let us denote that we want to use a single dual variable y as a substitute for
concatenated vector of p and q, only to simply this equation to obtain an expression in the form of Eqn. 94 so that we treat our u in this equation as x there.
We can then rewrite this above expression very simply in x and y as
min maxhKx, yi F (y) + G(x)
x X y Y
where our K now is

P (p)
p
K=
, x = u, y =
, F (y) =
and G(x) = 0
1
2
T
I
q
q f + 2 q
(319)
(320)
It is easy to see that if K has a simple form, we can write the closed form solution
of the norm of K, i.e. ||K ||. However, if K has some complicated structure e.g. in
the case of deblurring or super resolution K would have different entries in each
row and its hard to come up with a closed form solution of the norm of K in
which case one would like to know how to set the and so that we can carry
out the succession of iterations for our variables involved in minimisation. A
new formulation from Pock et al. [10] describe a way to set the and such
that the optimality condition of convergence still holds. It is
j =
where generally = 1.
N
i=1
1
|Kij |2
and i =
M
j=1
1
|Kij |
(321)
XXXIII
10
When to and when not to use Duality: What price do we

pay on dualising a function?
It may at first seem a bit confusing that adding more variables using duality
makes the optimisation quicker. However, expressing any convex function as a
combination of simple linear functions in dual space makes the whole problem
easy to handle. Working on primal and dual problems at the same time brings
us closer to the solution very quickly. Switching between min and max between
the optimisation means a strong duality holds.
10.1
If the function is well convex and differentiable, should we still

dualise?
Let us take an example of a function which is convex and differential everywhere. We take the ROF model and replace the L1 norm with the standard L2
norm, i.e.
E(u, u) = min ||u||22 +
u X
||u g||22
2
(322)
If we were to use the standard Euler-Lagrange equations, we will obtain the

following update equation
un+1 un
E(u, u)
=
u
u
where
E(u,u)
u
is defined according to Euler-Lagrange equation as

E(u, u) E
E
E
=
u
u
x u x
y uy
(323)
(324)
where ofcourse our ||u||2 is defined as (u2x + u2y ) where u x is the derivative
with respect to x and similarly for y. Now if we were to write the gradient
descent update step with respect to u, we will obtain the following updateequation.
un+1 un
= ((u f ) 2 (u x ) 2 (uy ))
u
x
y
(325)
It therefore involves the Laplacian which is nothing but the
2 u =
u x uy 2 u 2 u
+
= 2+ 2
x
y
x
y
(326)
Therefore our minimisation with respect to u takes us to the final gradient step
update as
un+1 un
= ((u f ) 22 u)
u
(327)
XXXIV
It is therefore clear that if we were to use the Euler-Lagrange equations, we

will still have to compute the derivative of the regulariser in order to update
with respect to u. However, if we were to dualise the function, we could see the
data-term and regulariser-term are decoupled. Following is the Primal-Dual
min-max equation for the same problem.
min max maxh p, ui
u X p P q Q
1 2
1 2
p + hq, u f i
q
2
2
(328)
Optimising with respect to p

pn+1 pn
= un pn+1
p
pn + p un
pn+1 =
1 + p
(329)
(330)
Optimising with respect to q

qn+1 qn
1
= un f qn+1
q
(331)
Optimising with respect to u

un un+1
= divpn+1 + qn+1
u
(332)
It is clear that while updating q we only work on u f and while updating p we

only work on u. The equations we obtain are system of linear equations and
are pointwise separable. While in the case of Euler-Lagrange we will have to
approximate the Laplacian operator with a kernel which makes the solution at
on point dependent on the neighbours. Therefore, the Primal-Dual form decouples the data and the regulariser terms. It makes the problem easier to handle.
References
1. Rockafellar. T.: Convex Analysis. II
2. Matrix Cookbook: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/
3274/pdf/imm3274.pdf XXIV
3. http://gpu4vision.icg.tugraz.at/
4. B.K.P. Horn and B.G. Schunck.: Determining optical flow. Artificial Intelligence, vol
17, pp 185-203, 1981. XXIV
5. Simon Baker and Ian Matthews: Lucas-Kanade 20 Years On: A Unifying Framework,
International journal of computer vision, Volume 56, Number 3, 221-255 XXV
6. C. Zach, T. Pock and H. Bishof: A Duality Based Approach for Realtime TV-L1 Optical
Flow, Proceedings of the DAGM conference on Pattern recognition, 2007. XXV
XXXV
7. Chambolle, A., Pock, T.: A First-Order Primal-Dual Algorithm for Convex Problems,
Journal of Mathematical Imaging and Vision XIII
8. Unger, M., Pock, T., Werlberger, M., Bishof, H.: A covex approach for variational
Super-Resolution, XXI, XXVII
9. Steinbruecker, F., Pock, T., Cremers, D.: Large Displacement Optical Flow Computation without Warping, International Conference on Computer Vision 2009 XXV
10. Pock, T., Chambolle, A.:Diagonal preconditioning for first order primal-dual algorithms in convex optimization, International Conference on Computer Vision 2011
XXXII

CVX Duality

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CVX Duality

Uploaded by

Copyright:

Available Formats

Applications of Legendre-Fenchel transformation to

computer vision problems

Abstract. We aim to provide a small background on Lengenre-Fenchel

A vector definition can be written as

Motivation: Why do we need it?

Duality is the principle of looking at a function or a problem from two different

Fig. 1. Image Courtesy: Wikipedia.org

How to use the Duality

Let us denote the equation of line having a slope p and intercept c by

Now this quadratic equation has two roots

This is nothing but our Legendre-Fenchel transformed convex conjugate and

is the slope. That is

For we know that

to be a tangent to f (x) at x it must be that

Again, if the function is

and substituting x we get,

and therefore we can solve for c as

How about fuctions which are not differentiable

is only a linear function of p. So the duality of a non-differentiable point at x

denoted by x+ is f 0 (x+ ). At the point of discontinuity in the space of function

Subdifferential and Subgradient

In calculus, we are, majority of time, interested in minimising or maximising a

Proof: Legendre-Fenchel conjugate is always convex

Let us expand the left hand side of the inequality to be proved.

We can rewrite f (x) as

and replacing it in the equation above yields a new expression which is

It is the property of supremum which states that

Therefore, we can substitute for

sup {(1 )(xt z2 f (x))} = (1 ) f (z2 )

We arrive it our desired result which is

f (z) = sup{y z ||y||}

By using Cauchy-Schwarz inequality we can also write

Therefore, we can write the conjugate as

LF transform of a function f (y) for an n-dimensional vector y is defined as

The function attains maxima when

Example 3: A general quadratic n-D curve

The function attains maxima when

substituting the value of y in the above function f (z) we get

Again writing the LF transform as

substituting this value of y into the function gives

Example 5: Exponential function

substituting the value of y in the above function f (z) we get

Example 6: Negative logarithm

f (z) = sup{yt z ( log y)}

Substituting this value back into the equation we get

This is only valid if z < 0

Summary of noteworthy points

Applications to computer vision

because F is a convex function then, by definition of Legendre-Fenchel transformation

We know that dot product is commutative so we can re-write

where K is the adjoint conjugate of K which is more general. Going back to

Primal Dual Gap is then defined as

For the primal-dual algorithm to be applicable, one should be able to compute

Therefore, one can formulate the minimisation steps as

Note that being able to compute the proximal mapping of F is equivalent to

Then we can write the associated derivates with respect to x as

am,1 am,2 am,n

We will be using the following notational brevity to represent the L1 norm:

where the partial derivatives are defined on discrete 2D grid as follows

x ui,j = u(i, j) u(i 1, j)

+x ui,j = u(i + 1, j) u(i, j)

For Maximisation we will be using gradient as

For Minimisation we will be using gradient as

NOTE the switch in the iteration numbers in Maximisation and Minimisation.