PDF

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO.
9, SEPTEMBER 2002 1167
Limits on Super-Resolution
and How to Break Them
Simon Baker and Takeo Kanade, Fellow, IEEE
AbstractÐNearly all super-resolution algorithms are based on the fundamental constraints that the super-resolution image should
generate the low resolution input images when appropriately warped and down-sampled to model the image formation process. (These
reconstruction constraints are normally combined with some form of smoothness prior to regularize their solution.) In the first part of this
paper, we derive a sequence of analytical results which show that the reconstruction constraints provide less and less useful information
as the magnification factor increases. We also validate these results empirically and show that, for large enough magnification factors,
any smoothness prior leads to overly smooth results with very little high-frequency content (however, many low resolution input images
are used). In the second part of this paper, we propose a super-resolution algorithm that uses a different kind of constraint, in addition to
the reconstruction constraints. The algorithm attempts to recognize local features in the low-resolution images and then enhances their
resolution in an appropriate manner. We call such a super-resolution algorithm a hallucination or recogstruction algorithm. We tried our
hallucination algorithm on two different data sets, frontal images of faces and printed Roman text. We obtained significantly better results
than existing reconstruction-based algorithms, both qualitatively and in terms of RMS pixel error.
Index TermsÐSuper-resolution, analysis of reconstruction constraints, learning, faces, text, hallucination, recogstruction.
1 INTRODUCTION
S UPER-RESOLUTION is the process of combining multiple

low-resolution images to form a higher resolution one.
Numerous super-resolution algorithms have been proposed
prior on the high-resolution image [47], [26], [21].1 Their
solution can be estimated either in batch mode or recursively
using a Kalman filter [22], [18]. Several refinements have been
in the literature [39], [32], [51], [33], [29], [31], [53], [30], [34], proposed, including simultaneously computing structure
[37], [13], [9], [35], [47], [49], [7], [38], [26], [21], [15], [23], [18] [13], [49], [50] and removing other degrading effects such as
dating back to the frequency domain approach of Huang motion blur [7].
and Tsai [28]. Usually, it is assumed that there is some In practice, the results obtained using these reconstruction-
(small) relative motion between the camera and the scene; based algorithms are mixed. While the super-resolution
however, motionless super-resolution is indeed possible if images are generally an improvement over the inputs, the
other imaging parameters (such as the amount of defocus high-frequency components of the images are generally not
blur) vary instead [21]. If there is relative motion between reconstructed very well. To illustrate this point, we conducted
the camera and the scene, then the first step to super- an experiment, the results of which are included in Fig. 1. We
resolution is to register or align the images, i.e., compute the took a high-resolution image of a face (shown in the top left of
motion of pixels from one image to the others. The motion the figure) and synthetically translated it by random subpixel
fields are typically assumed to take a simple parametric amounts, blurred it with a Gaussian, and then down-sampled
form, such as a translation or a projective warp [8], but it. We repeated this procedure for several different (linear)
instead could be dense optical flow fields [20], [2]. We down-sampling factors; 2, 4, 8, and 16. In each case, we
assume that image registration has already been performed generated multiple down-sampled images, each with a
different random translation. We generated enough images
and concentrate on the second half of super-resolution,
so that there were as many low-resolution pixels in total as
which consists of fusing the multiple (aligned) low-
pixels in the original high-resolution image. For example, we
resolution images into a higher resolution one.
generated four half-size images, 16 quarter size images, and so
The second, fusion step is usually based on the constraints
on. We then applied the algorithms of Schultz and Stevenson
that the super-resolution image, when appropriately warped [47] and Hardie et al. [26]. The results for Hardie et al. [26] are
and down-sampled to take into account the alignment and to shown in the figure. The results for Schultz and Stevenson [47]
model the image formation process, should yield the low- were very similar and are omitted. We provided the
resolution input images. These reconstruction constraints algorithms with exact knowledge of both the point spread
have been used by numerous authors since first studied by function used in the down-sampling and the random subpixel
Peleg et al. [39] and Irani and Peleg [29]. The constraints can translations (although a minor modification of the iterative
easily be embedded in a Bayesian framework incorporating a algorithm in [26] does a very good job of estimating the
translation even for the very low resolution 6 8 pixel images
. The authors are with the Robotics Institute, Carnegie Mellon University, 1. The three papers [47], [26], [21] all use slightly different priors.
5000 Forbes Ave., Pittsburgh, PA 15213. E-mail: {simonb, tk}@cs.cmu.edu. Roughly speaking though, the priors are all smoothness priors that
encourage each pixel to take on the average of its neighbors. In our
Manuscript received 4 Oct. 2000; revised 6 Sept. 2001; accepted 4 Feb. 2002. experience, with the first two algorithms, the exact details of the prior make
Recommended for acceptance by W.T. Freeman. fairly little difference to the super-resolution results. To our knowledge,
For information on obtaining reprints of this article, please send e-mail to: however, there is no paper that rigorously compares the performance of
tpami@computer.org, and reference IEEECS Log Number 112943. different priors for super-resolution.
0162-8828/02/$17.00 ß 2002 IEEE

1168 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 9, SEPTEMBER 2002
Fig. 1. Results of the reconstruction-based super-resolution algorithm [26] for various magnification factors. The original high-resolution image
(shown in the top left) is translated multiple times by random subpixel amounts, blurred with Gaussian, and then down-sampled. (The algorithm is
provided with exact knowledge of the point spread function and the subpixel translations.) Comparing the images in the right-most column, we see
that the algorithm does quite well given the very the low resolution of the input. The degradation in performance as the magnification increases from
left to right is very dramatic, however.
in the right-most column). Restricting attention to the right- number of images. Using more low-resolution images there-
most column of Fig. 1, the results look very good. The fore does not help, at least beyond a point (which, in practice, is
algorithm is able to do a decent job of reconstructing the face determined by a wide range of factors. See the discussion at the
from input images which barely resemble faces. On the other end of the paper for more details.) The additional (8-bit, say)
hand, the performance gets much worse as the magnification input images simply do not provide any more information
increases (from left to right.) because the information was lost when they were quantized to
This paper is divided into two parts. In the first part, we 8-bits. Suppose, however, that the input images contain text.
analyze the super-resolution reconstruction constraints. We Moreover, suppose that it is possible to perform optical
derive three results which all show that super-resolution character recognition (OCR) and recognize the text. If the font
becomes much more difficult as the magnification factor can also be determined, it would then be easy to be perform
increases. First, we show that, for square point spread super-resolution. The text could be reproduced at any
functions (and integer magnification factors), the reconstruc- resolution by simply rendering it from the recognized text
tion constraints are not even invertible and, moreover, the and the definition of the font. In the second half of this paper,
dimension of the null space grows as a quadratic function of we describe a super-resolution algorithm based on this
the linear magnification. In the second result, we show that, principle, which we call recognition-based super-resolution or
even though the constraints are generally invertible for other hallucination [1], [3]. More generally, we propose the term
point spread functions, the condition number also always recogstruction for recognition-based reconstruction techni-
grows at least as fast as a quadratic. (This second result is ques. Our hallucination algorithm, however, is based on the
proven for a large class of point spread functions, including all recognition of generic local ªfeaturesº (rather than the
point spread functions which reasonably model CCD sen- characters detected by OCR) so that it can be applied to other
sors.) This second result, however, does not entirely explain phenomena. The recognized local features are used to predict
the results shown in Fig. 1. It is frequently possible to invert an a recognition-based prior which replaces the smoothness priors
ill-conditioned problem by simply imposing a smoothness used in existing algorithms such as [13], [47], [26], [21]. We
prior on the solution. In our third result, we use the fact that the trained our hallucination algorithm separately on both frontal
pixels in the input images take values in a finite set (typically, images of faces and computer generated text. We obtained
integers in the range 0-255) to show that the volume of significantly better results than traditional reconstruction-
solutions to the discretized reconstruction constraints grows based super-resolution, both visually and in terms of RMS
at an extremely fast rate. This, then, is the underlying reason pixel intensity error.
for the results in Fig. 1. For large magnification factors, there Our algorithm is closely related to the independent work of
are a huge number of solutions to the reconstruction [25] in which a learning framework for low-level vision was
constraints, including numerous very smooth solutions. The proposed, one application of which is image interpolation. In
smoothness prior that is typically added to resolve the their paper, Freeman et al. learn a prior on the higher
ambiguity in the large solution space simply ensures that it resolution image using a belief propagation network. Our
is one of the overly smooth solutions that is chosen. (Strictly, algorithm has the advantage of being applicable to an
the final solution to the overall problem is only an approx- arbitrary number of images. Our algorithm is also closely
imate solution of the reconstruction constraints since both sets related to [19] in which the parameters of an ªactive-
of constraints are added as least squares constraints.) appearanceº model are used for super-resolution. This
How, then, can high-magnification super-resolution be algorithm can also be interpreted as having a strong, learned
performed? Our analytical results hold for an arbitrary face prior.
BAKER AND KANADE: LIMITS ON SUPER-RESOLUTION AND HOW TO BREAK THEM 1169
Fig. 2. Our image formation model. We assume that the low-resolution input images Loi m are formed by the convulation of the irradiance Ei with
the camera point spread function PSFi . We model the point spread function itself as the covolution of two terms: 1) !i models the optical effects
caused by the lens and the finite aperture and 2) ai models the spatial integration performed by the CCD sensor.
The remainder of this paper is organized as follows: We PSFi x !i ai x; 2
begin in Section 2 by deriving the super-resolution where !i x models the blurring caused by the optics and
reconstruction constraints, before analyzing them in Section ai x models the spatial integration performed by the CCD
3. We present our hallucination algorithm (with results) in sensor [5]. The optical blurring !i is typically further split
Section 4. We end with a discussion in Section 5. into a defocus factor that can be approximated by a pill-box
function and a diffraction-limited optical transfer function
2 THE SUPER-RESOLUTION RECONSTRUCTION that can be modeled by the square of the first-order Bessel
function of the first kind [10]. We aim to be as general as
CONSTRAINTS possible and so avoid making any assumptions about !i x.
Denote the low-resolution input images by Loi m, where Instead, (most of) our analysis is performed for arbitrary
i 1; . . . ; N and m m; n is a vector in Z2 containing the functions !i x. We do, however, assume a parametric form
(column and row) pixel coordinates. The starting point in for ai . We assume that the the photo-sensitive areas of the
the derivation of the reconstruction constraints is then the CCD pixels are square and uniformly sensitive to light, as in
continuous image formation equation (see Fig. 2) [27], [36]: [5], [6]. If the length of the side of the square photosensitive
Z area is Si , the spatial integration function is then:
Loi m Ei PSFi m Ei x PSFi x m dx; 1
Loi if jxj S2i and jyj S2i
ai x Si2 3
1 0 otherwise:
where Ei is the continuous irradiance light-field that would In general, the photosensitive area is not the entire pixel
have reached the image plane of Loi under the pinhole model, since space is need for the circuitry to read out the charge.
PSFi is the point spread function of the camera, and x Therefore, Si is just assumed to take some value in the range
x; y 2 R2 are coordinates in the image plane of Loi (over 0; 1. Our analysis of the super-resolution problem is then
which the integration is performed). All that (1) says is that in terms of this parameter (not the interpixel distance.)
the pixel intensity Loi m is the result of convolving the (Although a detailed model of the point spread function is
irradiance function Ei with the point-spread function of the needed to analyze the limits of super-resolution, typically,
camera PSFi and then sampling it at the discrete pixel this modeling is not performed for super-resolution algo-
locations m m; n 2 Z2 . (In a more general formulation, rithms because the point spread function is a very complex
Ei may also be a function of both time t and wavelength . function which depends upon a large number of parameters.
Equation (1) would then also contain integrations over these In practice, a simple parametric form is assumed for PSFi ,
two variables as well. We do not model these effects because more often than not, that it is Gaussian. The parameter
they do not affect the spatial analysis.) (sigma) is then estimated empirically. Since the point spread
function describes ªthe image of an isolated point object
2.1 Modeling the Point Spread Function located on a uniformly black backgroundº [36], the para-
We decompose the point spread function of the camera into meter(s) can be estimated from the image of a light placed a
two components (see Fig. 2): large distance from the camera.)
2.2 What is Super-Resolution Anyway? In this paper, we study the most general case, i.e., the
We wish to estimate a super-resolution image Sup, where combination of super-resolution and deblurring. We estimate
p p; q 2 Z2 are pixel coordinates. Precisely what does Sup, a representation of Ez on the grid defined by (4).
this mean? Let us begin with the coordinate frame of Sup.
2.3 Representing Continuous Images
The coordinate frame of Su is typically defined by that of
one of the low resolution input images, say Lo1 m. If the In order to proceed, we need to specify which continuous
linear magnification of the super-resolution process is M, function Ez is represented by the discrete image Sup. The
the pixels in Su will be M times closer to each other than simplest case is that Sup represents the piecewise constant
function:
those in Lo1 . The coordinate frame of Su can therefore be
defined in terms of that for Lo1 using the equation: Ez Sup 7
1 for all z 2 p 0:5; p 0:5 q 0:5; q 0:5 and where
p m: 4
M p p; q 2 Z2 are the coordinates of a pixel in Su. Then,
In the introduction, we said that we would assume that the (5) can be rearranged to give:
input images Loi have been registered with each other. We X Z
@ri
can therefore assume that they have been registered with Loi m Sup PSFi ri z m dz; 8
p @z
the coordinate frame of the super-resolution image Su p
defined by (4). Then, denote the pixel in image Loi that where the integration is performed over the pixel p, i.e.,
corresponds to pixel p in Su by ri p. From now on, we over p 0:5; p 0:5 q 0:5; q 0:5. The super-resolu-
assume that ri is known. tion reconstruction constraints are therefore:
The integration in (1) is performed over the low- X
resolution image plane. Transforming to the super-resolu- Loi m Wi m; p Sup where
tion image plane of Su using the registration x ri z gives: p
Z 9
Z @ri
@ri Wi m; p PSFi ri z m dz
Loi m Ei ri z PSFi ri z m dz; 5 p @z
Su @z
@r for i 1; . . . N, i.e., a set of linear constraints on the
where @z is the determinant of the Jacobian of the
i
unknown super-resolution pixels Sup, in terms of the
registration transformation ri . (Note that we have
known low resolution pixels Loi m. The constant coeffi-
assumed here that ri is invertible. A similar analysis, albeit
cients Wi m; p depend on both the point spread function
approximate, can be conducted wherever ri is locally
PSFi and on the registration ri . (Similar derivations can
invertible by truncating the point spread function.)
be performed for other representations of Ez, such as
Now, Ei ri z is the irradiance that would have reached
piecewise linear or quadratic ones [14].)
the image plane of the ith camera under the pinhole model,
transformed onto the super-resolution image plane. Assum-
ing the registration ri is correct and that the radiance of the 3 ANALYSIS OF THE RECONSTRUCTION
scene does change, Ei ri z should be the same for all i CONSTRAINTS
1; . . . N and, moreover, equal to the irradiance that would
We now analyze the super-resolution reconstruction con-
have reached the super-resolution image plane of Su under a
straints defined by (9). As can be seen, the equations depend
pinhole model. Denoting this value by Ez, we have:
upon two imaging properties: 1) the point spread function
Z
@ri PSFi and 2) the registration ri . Without some assump-
Loi m Ez PSFi ri z m dz: 6 tions about these functions, any analysis would be mean-
Su @z
ingless. If the point spread function is arbitrary, it can be
Given this equation, we distinguish two processes: chosen to simulate the ªsmall pixelsº of the super-resolution
Deblurring is estimating a representation of Ez (that is, as image. Similarly, if ri is arbitrary, it can be chosen (in effect)
opposed to estimating E PSFi ), i.e., deblurring is to move the camera toward the scene and thereby directly
removing the effects of the convolution with the point capture the super-resolution image. We therefore have to
make some (reasonable) assumptions about the imaging
spread function PSFi . Deblurring is independent of
conditions.
whether the representation of Ez is on a denser grid
Assumptions Made about the Point Spread Function.
than that of the input images. The resolution may or may
We assume that the point spread function is the same for all
not change during deblurring.
of the images Loi and takes the form:
Resolution Enhancement consists of estimating either of the
irradiance functions (E or Ei PSFi ) on a denser grid than PSFi x !i ai x where
that of the input image(s). For example, enhancing the (
1
2 if jxj S2 and jyj S2 10
resolution by the linear magnification factor M consists of ai x S
estimating the irradiance function on the grid defined by 0 otherwise:
(4). If the number of input images is one, resolution In particular, we assume that the width of the photosensitive
enhancement is known as interpolation. If there is more area S is the same for all images. In the first part of the
than one input image, resolution enhancement is known as analysis, we also assume that !i x x, the Dirac delta
super-resolution. Resolution is therefore synonymous with function. Afterward, we allow !i x to be an arbitrary
pixel grid density. function, i.e., the analysis holds for arbitrary optical blurring.
Assumptions Made about the Registration. To outlaw

motions which (effectively) allow the camera to be moved
toward the scene, we assume that the registration between
each pair of low resolution images is a translation. When
combined with the super-resolution coordinate frame, as
defined in (4), this assumption means that each registration
takes the form:
1
ri z z ci ; 11
M
where ci ci ; di 2 R2 is a constant (which is different for
each low resolution image Loi ) and M > 0 is the linear
magnification of the super-resolution problem.
Even given these assumptions, the performance of any
super-resolution algorithm will depend upon the number of
input images N, the exact values of ci , and, moreover, how Fig. 3. The pixel p over which the integration is performed in (12) is
indicated by the small square at the top left of the figure. The larger
well the algorithm can register the low resolution images to
square on the bottom right is the region in which ai is nonzero. Since
estimate the ci . Our goal is to show that super-resolution ai takes the value 1=S 2 in this regiom, the integral in (12) equals A=S 2 ,
becomes fundamentally more difficult as the linear magni- where A is the area of the intersection of the two squares. This figure is
fication M increases. We therefore assume that the conditions used to illustrate the proof of Theorem 1.
are as favorable as possible and perform the analysis for an
arbitrary number of input images N, with arbitrary values of intersection of the two squares in Fig. 3. For 1D, we just
ci . Moreover, we assume that the algorithm has estimated consider one row of the figure. Any element of the null
these values perfectly. Any results derived under these space therefore corresponds to an assignment of values to
conditions will only be stronger in practice, where the ci the small squares in a way that their weighted sum (over the
might take degenerate values or might be estimated large square) equals zero, where the weights are the areas of
inaccurately. intersection with the large square. (To be able to conduct
this argument for every pixel in the super-resolution
3.1 Invertibility Analysis for Square Point Spread image, we need to assume that the number of pixels in
Functions every row and every column of the super-resolution image
We analyze the reconstruction constraints in three different is greater than 2 M S. This is a minor assumption since it
ways. The first analysis is concerned with when the corresponds to assuming that the low resolution images are
constraints are invertible and what the rank of the null bigger than 2 2 pixels. This follows from the fact that S is
space is when they are not invertible. In order to get an physically constrained to be less than one.)
easily interpretable result, the analysis in this section is Changing ci to slide the large square along the row by a
performed under the simplified scenario that the optical small amount, we get a similar constraint on the elements
blurring can be ignored and, so, !i x x, the Dirac in the null space. The only difference is in the left-most and
delta function. This assumption will be removed in the right-most squares. Subtracting these two constraints
following two sections, where the analysis is for arbitrary shows that the left-most square and the right-most square
optical blurring models !i x. Assuming a square point must have the same value. This means that Sup must
spread function PSFi x ai x (and that the registration equal both Sup dM Se; 0 and Sup bM Sc; 0, if
ri is a translation) (9) simplifies to: the assignment is to lie in the null space.
X If M S is not an integer (or is 1), this proves that
Loi m Wi m; p Sup where neighboring values of Sup must be equal and, hence, 0.
p Therefore, values can always be chosen for the ci so that
Z 12 the null space only contains the zero vector, i.e., the
1 1
Wi m; p ai z ci m dz; linear system is, in general, invertible. (The equivalence
M2 p M
of the null space being nontrivial and the linear system
where the integration is performed over the pixel p, i.e., being not invertible requires the assumption that the
over p 0:5; p 0:5 q 0:5; q 0:5. Using the defini- number of pixels in the super-resolution image is finite,
tion of ai , it is easy to see that Wi m; p is equal to i.e., the super-resolution image is bounded.)
1=M S2 times the area of the intersection of the two If M S is an integer, this constraint places an upper
squares in Fig. 3. We then have: bound of M S 1 on the dimension of the null space
(since the null space is contained in the set assignments to
Theorem 1. If M S is an integer greater than 1, then, for all Su that are periodic with period M S). This value can also
choices of ci , the set of (12) is not invertible. Moreover, the be shown to be a lower bound on the dimension of the null
minimum achievable dimension of the null space is space
M S 12 . If M S is not an integer, ci s can always be PMS 1by the space of period assignments for which
i0 Sup i; 0 0. All of these assignments can
chosen so that the set of (12) are invertible. easily be seen to lie in the null space (for any choice of the
Proof. We provide a proof for 1D images. (See Fig. 3.) The translations ci ). u
t
extension to 2D is straightforward. To validate this theorem, we solved the reconstruction
P The0 null space of (12) is defined0 by the constraints constraints using gradient descent for the two cases M 2:0
p Wi m; p Sup 0, where Wi ; is the area of and M 1:5, where S 1:0. The results are presented in
Fig. 4. Validation of Theorem 1: The results of solving the reconstruction constraints using gradient descent for a square point spread function with
S 1:0. (a) M 2:0. When M S is an integer, the equations are not invertible and, so, a random periodic image in the null space is added to the
original image. (b) M 1:5. When M is not an integer, the reconstruction constraints are invertible and, so, a smooth solution is found, even without
a prior. (The result for M 1:5 has been interpolated to make it the same size as that for M 2:0.) (c) M 2:0, with prior. When a smoothness prior
is added to the reconstruction constraints, the difficulties seen in (a) disappear.
w1
Fig. 4. In this experiment, no smoothness prior is used and CondA ; 14
gradient decent is run for a sufficiently long time that the wn
starting image (which is smooth) does not bias the results. The where w1 . . . wn 0 are the singular values of A.
input in both cases consisted of multiple down-sampled The one property of singular values that we need is that
images similar to the one at the top of the second column in if x is any vector:
Fig. 1. Specifically, 1,024 randomly translated images were
used as input. Exactly the same inputs are used for the two kAxk2
w1 wn ; 15
experiments. The only difference is the magnification factor in kxk2
the super-resolution algorithm. The output for M 1:5 is
therefore actually smaller than that for M 2:0 (and was where k k2 is the L2 norm. (This result follows immedi-
enlarged to the same size in Fig. 4 for display purposes only.) ately from the SVD A USV T . The matrices U and V T do
not affect the L2 norm of a vector since their columns are
As can be seen in Fig. 4, for M 2:0, the (additive) error is
orthonormal. Equation (15) clearly holds for S.) It follows
approximately a periodic image with period 2 pixels. For
immediately that if x and y are any two vectors, then:
M 1:5, the equations are invertible and, so, a smooth
solution is found, even though no smoothness prior was kxk2 kAyk2
used. For M 2:0, the fact that the problem is not invertible CondA : 16
kyk2 kAxk2
does not have any practical significance. Adequate solutions
can be obtained by simply adding a smoothness prior to the It follows from (12) that if Sup 1 for all p, then
reconstruction constraints, as shown in Fig. 4. For M 2m, Loi m 1 for all m. Setting Sup Sup; q to be the
the situation is different, however. As will be shown in the checkerboard pattern (1 if p q is even, -1 if odd), we
third part of our analysis, it is the rapid rate of increase of the find that jLoi mj 1=M S2 since the integration of
dimension of null space that is the root cause of the problems the checkerboard over any square in the real plane lies in
for large M. the range 1; 1. (Proof omitted.) By setting y to be the
first of these vectors and x the second, it follows
3.2 Conditioning Analysis for Arbitrary Point immediately from (16) that CondA M S2 .
Spread Functions To generalize to arbitrary point spread functions, note
Any linear system that is close to being not invertible is that (13) can be rewritten as:
usually ill-conditioned. It is no surprise then that changing Z
from a square point spread function to an arbitrary function Suz 1
Loi m 2
PSF i z ci m dz
PSFi !i ai results in an ill-conditioned system, as we Su M M
17
now show in the second part of our analysis: PSFi Su ci m

Theorem 2. Suppose
R !i x is any function for which !i x 0 !i ai Su ci m;
for all x and !i x dx 1. Then, the condition number of
where we have changed variables x M1 z and set
the following linear system grows at least as fast as M S2 : Sux Su M x. The example vectors x and y used
X above can still be used to prove the same result with ai
Loi m Wi m; p Sup where
replaced by !i ai using the standard properties of the
p
Z 13 convolution operator: 1) The convolution of a function that
1 1 takes the value 1 everywhere with a function that is positive
Wi m; p PSFi z ci m dz;
M2 p M and has unit area is also one everywhere and 2) the
maximum absolute value of the convolution of a function
where PSFi !i ai . with a positive function that has unit area cannot increase
Proof. We first prove the theorem for the square point spread during the convolution. Hence, the desired (more general)
function ai (i.e., for (12)) and then generalize. The result follows immediately from the last line of (17) and the
condition number of an m n matrix A is defined [42] as: properties of x and y used above. u
t
This theorem is more general than the previous one because 0 PSFi 1=S 2 . Therefore, adding M S2 to any pixel
it applies to (essentially) arbitrary point spread functions. On in Su is still a solution since the right-hand side of (19)
the other hand, it is a weaker result (in some situations) increases by at most 1. (The integrand is increased by less
because it only predicts that super-resolution is ill-condi- than one gray-level in the pixel, which only has an area of
tioned (rather than not invertible.) This theorem on its own, one unit.) The volume of solutions of (18), therefore,
therefore, does not entirely explain the poor performance of
contains an n-dimensional simplex, where the angles at
super-resolution. As we showed in Fig. 4, problems that are ill-
conditioned (or even not invertible, where the condition one vertex are all right-angles and the sides are all M S2
number is infinite) can often be solved by simply adding a units long. The volume of such a simplex grows
smoothness prior. The not invertible super-resolution pro- asymptotically, like M S2n (treating n as a constant
blem in Fig. 4a is solved in Fig. 4c in this way. Several and M and S as variables). The desired result follows. t u
researchers have performed conditioning analysis of various This third and final theorem provides the best explanation
forms of super-resolution, including [21], [48], [43]. Although
of the super-resolution results presented in Fig. 1. For large
useful, none of these results fully explain the drop-off in
performance with the magnification M. The weakness of magnification factors M, there is a huge volume of solutions to
conditioning analysis is that an ill-conditioned system may be the discretized reconstruction constraints in (18). The smooth-
ill-conditioned because of a single ªalmost singular value.º As ness prior which is added to resolve this ambiguity simply
indicated by the rapid growth in the dimension of the null ensures that it is one of the overly smooth solutions that is
space in Theorem 1, super-resolution has a large number of chosen. (Of course, without the prior, any solution might be
ªalmost singular valuesº for large magnifications. This is the chosen which would, generally, be even worse. As mentioned
real cause of the difficulties seen in Fig. 1. One way to show this
in Section 1, the final solution is really only an approximate
is to derive the volume of solutions, as we now do in the third
part of our analysis. solution of the reconstruction constraints since both sets of
constraints are added as least squares constraints.)
3.3 Volume of Solutions for Arbitrary Point Spread In Fig. 5, we present quantitative results to illustrate
Functions Theorems 2 and 3. We again used the reconstruction-based
If we could work with noiseless, real-valued quantities and algorithm [26]. We verified our implementation in two ways:
perform arbitrary precision arithmetic, then the fact that the 1) We checked that, for small magnification factors and no
reconstruction constraints are ill-conditioned might not be a prior, our implementation does yield (essentially) perfect
problem. In reality, however, images are always intensity reconstructions and 2) for magnifications of four, we checked
discretized (typically to 8-bit values in the range 0-255 gray that our numerical results are consistent with those in [26]. We
levels.) There will therefore always be noise in the measure-
also tried the related algorithm of [47] and obtained very
ments, even if it is only plus-or-minus half a gray-level.
Suppose that int denotes the operator which takes a real- similar results.
valued irradiance measurement and turns it into an integer- Using the same inputs as Fig. 1, we plot the reconstruction
valued intensity. If we incorporate this quantization into our error against the magnification, i.e., the difference between
image formation model, then (17) becomes: the reconstructed high-resolution image and the original. We
Z compare this error with the residual error, i.e., the difference
Suz z
Loi m int PSFi ci m dz : 18 between the low-resolution inputs and their predictions from
2
Su M M the reconstructed high-resolution image. As expected for an
Suppose that Su is a fixed size image2 with n pixels. We ill-conditioned system, the reconstruction error is much
then have: higher than the residual. We also compare with a rough
prediction of the reconstruction error obtained by multi-
Theorem 3. If int is the standard rounding operator which plying the lower bound on the condition number (M S 2 ) by
replaces a real number with the nearest integer, then the volume
an estimate of the expected residual assuming that the gray-
of the set of solutions of (18) grows asymptotically at least as fast
levels are discretized from a uniform distribution. For low
as M S2n (treating n as a constant and M and S as variables.)
magnification factors, this estimate is an underestimate
Proof. First, note that the space of solutions is convex since because the prior is unnecessary for noise free data, i.e.,
integration is a linear operation. Next, note that one better results would be obtained without the prior. For high
solution of (18) is the solution to: magnifications, the prediction is an overestimate because the
Z z local smoothness assumption does help the reconstruction
Suz
Loi m 0:5 2
PSFi ci m dz: 19 (albeit at the expense of overly smooth results.)
Su M M
We also plot interpolation results in Fig. 5, i.e., just using the
The definition of the point spread function as PSFi reconstruction constraints for one image (as was proposed, for
!i ai and the properties of the convolution give example, in [46]). The difference between this curve and the
reconstruction error curve is a measure of how much
2. There are at least two ways we could analyze (18). One is to assume
that the super-resolution image is of fixed size and fixed resolution. It is the information the reconstruction constraints provide. Similarly,
resolution of the inputs that varies. The other way is to assume that the the difference between the predicted error and the reconstruc-
input images are fixed and the resolution of the super-resolution image
varies. We chose to analyze (18) in the first case. We assume that the super-
tion error is a measure of how much information the
resolution image is fixed and that we are given a sequence of super- smoothness prior provides. For a magnification of 16, we see
resolution tasks, each with different input images and different pixel sizes. that the prior provides more information than the super-
The advantage of this approach is that the quantity that we are trying to
estimate stays the same. Moreover, the size of the space of all super- resolution reconstruction constraints. This, then, is an alter-
resolution images stays the same. native interpretation of why the results in Fig. 1 are so smooth.
Fig. 5. An illustration of Theorems 2 and 3 using the same inputs as in Fig. 1. The reconstruction error is much higher than the residual, as would be
expected for an ill-conditioned system. For low magnifications, the prior is unnecessary and, so, the results are worse than predicted. For high
magnifications, the prior does help, but at the price of overly smooth results. (See Fig. 1.) A rough estimate of the amount of information provided by
the reconstruction constraints is given by the improvement of the reconstruction error over the single image interpolation error. Similarly, the
improvement from the predicted error to the reconstruction error is an estimate of the amount of information provided by the smoothness prior. By
this measure, the smoothness prior provides more information than the reconstruction constraints for a magnification of 16.
4 RECOGNITION-BASED SUPER-RESOLUTION OR images Loi , given that the super-resolution image is Su. It is
HALLUCINATION therefore normally set to be a quadratic (i.e., energy) function
of the error in the reconstruction constraints:
How then is it possible to perform high magnification

super-resolution without the results looking overly smooth? 1 X X
As we have just shown, the required high-frequency ln PrLoi j Su 2 Loi m Sup
2 m;i p
information was lost from the reconstruction constraints 2 22
Z
when the input images were discretized to 8-bit values. @ri

PSFi ri z m dz :
Generic smoothness priors may help regularize the pro- @z
p
blem, but cannot replace the missing information.
As outlined in the introduction, our goal in this section is In using this expression, we are implicitly assuming that the
to develop a super-resolution algorithm that uses the noise is independently and identically distributed (across
information contained in a collection of recognition deci- both the images Loi and the pixels m) and is Gaussian with
sions (in addition to the reconstruction constraints). Our covariance 2 . (All of these assumptions are standard [13],
approach is to embed the results of the recognition [47], [26], [21].) Minimizing the expression in (22) is then
decisions in a recognition-based prior on the solution of the equivalent to finding the (unweighted) least-squares solution
reconstruction constraints, thereby resolving the inherent of the reconstruction constraints.
ambiguity in their solution (see Section 3.3).
4.2 Recognition-Based Priors for Super-Resolution
4.1 Bayesian MAP Formulation of Super-Resolution
The second term on the right-hand side of (21) is (the
We begin with the (standard) Bayesian formulation of
negative logarithm of) the prior ln PrSu. Usually, this
super-resolution [13], [47], [26], [21]. In this approach,
super-resolution is posed as finding the maximum prior on the super-resolution image is chosen to be a simple
a posteriori (or MAP) super-resolution image Su, i.e., smoothness prior [13], [47], [26], [21]. Instead, we would like
estimating arg maxSu PrSu j Loi . Bayes law for this estima- to choose it so that it depends upon a set of recognition
tion problem is: decisions. Suppose that the outputs of the recognition
decisions partition the set of inputs (i.e., the low-resolution
PrLoi j Su PrSu input images Loi ) into a set of subclasses fCi;k j k 1; 2; . . .g.
PrSu j Loi : 20
PrLoi We then define a recognition-based prior as one that can be
written in the following form:
Since PrLoi is a constant because the images Loi are inputs
(and so are ªknownº) and since the logarithm function is a X
PrSu PrSu j Loi 2 Ci;k PrLoi 2 Ci;k : 23
monotonically increasing function, we have:
k
arg max PrSu j Loi arg min ln PrLoi j Su ln PrSu: Essentially, there is a separate prior PrSu j Loi 2 Ci;k for
Su Su
each possible partition Ci;k . Once the low-resolution input
21
images Loi are available, the various recognition algorithms
The first term in this expression ln PrLoi j Su is the can be applied and it can be determined which partition the
(negative log) probability of reconstructing the low-resolution inputs lie in. The recognition-based prior PrSu then
Fig. 6. The (a) Gaussian pyramid, (b) Laplacian pyramid, and (c) First Derivative pyramids of an image of a face. (We also use two second derivatives, but
omit them from the figure.) We combine these pyramids into a single multivalued pyramid, where we store a vector of the Laplacian and the derivatives at
each pixel. The Parent Structure vector PSl m; n in the lth level of the pyramid consists of the vector of values for that pixel, the vector for its parent in the
l 1th level, the vector of its parent's parent, etc. [16]. The Parent Structure vector is therefore a high-dimensional vector of derivatives computed at
various scales. In our algorithms, recognition means finding the training sample with the most similar Parent Structure vector.
jmk jnk
reduces to the more specific prior PrSu j Loi 2 Ci;k . This PSl Im; n Fl Im; n; Fl1 I ; ; . . . ; FN I
prior can be made more powerful than the overall prior 2 2
j m k j n k
PrSu because it can be tailored to the (smaller) subset of
the input domain Ci;k . ; :
2N l 2N l
4.3 Multiscale Derivative Features: The Parent 25
Structure As illustrated in Fig. 6, the Parent Structure vector at any
We decided to try to recognize generic local image features particular pixel in the pyramid consists of the feature vector
(rather than higher level concepts such as human faces or at that pixel, the feature vector of the parent of that pixel,
ASCII characters) because we want to apply our algorithm to the feature vector of its parent, and so on. Exactly as in [16],
a variety of phenomena. Motivated by [16], [17], we also our notion of two pixels being similar is then that their
decided to use multiscale features. In particular, given an Parent Structure vectors are approximately the same (as
image I, we first form its Gaussian pyramid G0 I; . . . ; GN I measured by some norm.)
[11]. Afterwards, we also form its Laplacian pyramid
L0 I; . . . ; LN I [12], the horizontal H0 I; . . . ; HN I and 4.4 Recognition as Finding the Nearest-Neighbor
vertical V0 I; . . . ; VN I first derivatives of the Gaussian Parent Structure
pyramid, and the horizontal H02 I; . . . ; HN 2
I and vertical Suppose we have a set of high-resolution training images
2 2
V0 I; . . . ; VN I second derivatives of the Gaussian pyramid Tj . We can then form all of their feature pyramids
[1]. (See Fig. 6 for examples of these pyramids for an image of a F0 Tj ; . . . ; FN Tj . Also, suppose that we are given a low-
face.) Finally, we form a pyramid of features: resolution input image Loi . Finally, suppose that this image is
at a resolution that is M 2k times smaller than the training
Fj I Lj I; Hj I; Vj I; Hj2 I; Vj2 I samples. (The image may have to be interpolated to make this
24
ratio exactly a power of two. Since the interpolated image is
for j 0; . . . ; N:
immediately down-sampled to create the pyramid, it is only
The pyramid F0 I; . . . ; FN I is a pyramid where there are the lowest level of the pyramid features that are affected by
five values stored at each pixel, the Laplacian and the four this interpolation. The overall effect on the prior is therefore
derivatives, rather than the single value typically stored in very small.) We can then compute the feature pyramid for the
most pyramids. (The choice of the features in (24) is an input image from level k and upward Fk Loi ; . . . ; FN Loi .
instance of the ªfeature selectionº problem. For example, Fig. 7 shows an illustration of this scenario for k 2.
steerable filters [24] could be used instead or the second For each pixel m; n in the input Loi independently, we
derivatives could be dropped if they are too noisy. We compare its Parent Structure vector PSk Loi m; n against
found the performance of our algorithms to be largely all of the training Parent Structure vectors at the same level k,
independent of the choice of features. The selection of the i.e., we compare against PSk Tj p; q for all j and for all p; q.
optimal features is outside the scope of this paper.) The best matching image BIi m; n j and the best matching
Then, given a pixel in the low-resolution image that we pixel BPi m; n p; q are stored as the output of the
are performing super-resolution on, we want to find (i.e., recognition decision, independently for each pixel m; n in
recognize) a pixel in a collection of training data that is Loi . (We found the performance to be largely independent of
locally ªsimilar.º By similar, we mean that both the the distance function used to determine the best matching
Laplacian and the image derivatives are approximately Parent Structure vector. We actually used a weighted
the same, at all scales. To capture this notion, we define the L2 -norm, giving the derivative components half as much
Parent Structure vector [16] of a pixel m; n in the lth level weight as the Laplacian values and reducing the weight by a
of the feature pyramid F0 I; . . . ; FN I to be: factor of two for each increase in the pyramid level.)
Fig. 7. (a) High-resolution training images Tj . (b) Low-resolution input image Loi . (c) Recognition output. We compute the feature pyramids
F0 Tj ; . . . ; FN Tj for the training images Tj and the feature pyramid Fk Loi ; . . . ; FN Loi for the low-resolution input image Loi . For each pixel in the
low-resolution image, we find (i.e., recognize) the closest matching Parent Structure in the high-resolution data. We record and output the best
matching image BIi and the pixel location of the best matching Parent Structure BPi . Note that these data structures are both defined independently
for each pixel m; n in the images Loi .
Recognition in our hallucination algorithm therefore (2k of them to be precise. See also Fig. 7.) We therefore
means finding the closest matching pixel in the training impose a separate gradient constraint for each pixel m; n
data in the sense that the Parent Structure vectors of the two in the 0th level of the pyramid (and for each Loi .) Now, the
pixels are the most similar. This search is, in general, best matching pixel BPi is only defined on the kth level of
performed over all pixels in all of the images in the training the pyramid. For notational convenience, therefore, given a
data. If we have frontal images of faces, however, we pixel m; n on the 0th level of the pyramid, define the best
restrict this search to considering only the corresponding matching pixel on the 0th level of the pyramid to be:
pixels in the training data. In this way, we treat each pixel in j mk j n k
the input image differently, depending on its spatial BPi m; n 2k BPi k ; k m; n
location, similarly to the ªclass-basedº approach of [44]. jm2k j n 2k 27
2k k
; k :
4.5 A Recognition-Based Gradient Prior 2 2
For each pixel m; n in the input image Loi , we have Also, for notational convenience,
define the best matching
recognized the pixel that is the most similar in the training image as BIi m; n BIi 2mk ; 2nk .
data, specifically, the pixel BPi m; n in the kth level of the If m; n is a pixel in the 0th level of the pyramid for image
pyramid for training image TBIi m;n . These recognition Loi , the corresponding
pixel in the super-resolution image Su
decisions partition the inputs Loi into a collection of is ri 1 2mk ; 2nk . We therefore want to impose the constraint that
subclasses, as required by the recognition-based prior the first derivatives of Su at this point should equal the
described in Section 4.2. If we denote the subclasses by derivatives of the closest matching pixel (Parent Structure) in
Ci;BPi ;BIi (i.e., using a multidimensional index rather than k), the training data. Parametric expressions for H0 Su and
(23) can be rewritten as: V0 Su at ri 1 2mk ; 2nk can easily be derived as linear functions
X of the unknown pixels in the high-resolution image Su. We
PrSu PrSu j Loi 2 Ci;BPi ;BIi PrLoi 2 Ci;BPi ;BIi ; assume that the errors in the gradient values between the
BPi ;BIi recognized training samples and the super-resolution image
26 are independently and identically distributed (across both
the images Loi and the pixels m; n) and, moreover, that they
where PrSu j Loi 2 Ci;BPi ;BIi is the probability that the are Gaussian with covariance 2r . Therefore:
super-resolution image is Su, given that the input images
1 X m n
Loi lie in the subclass that will be recognized to have BPi as
PrSu j Loi 2 Ci;BPi ;BIi 2 H0 Su ri 1 k ; k
the closest matching pixel in the training image TBIi (in the kth 2r i;m;n 2 2
level of the pyramid.) 2
We now need to define PrSu j Loi 2 Ci;BPi ;BIi . We decided H0 TBIi m;n BPi m; n
to make this recognition-based prior a function of the gradient
m n 28
because the base, or average, intensities in the super- 1 X
resolution image are defined by the reconstruction con- 2 V0 Suri 1 k ; k
2r i;m;n 2 2
straints. It is the high-frequency gradient information that is 2
missing. Specifically, we want to define PrSu j Loi 2
Ci;BPi ;BIi to encourage the gradient of the super-resolution V0 TBIi m;n BPi m; n :
image to be close to the gradient of the closest matching
training samples. This expression enforces the constraints that the gradient of
Each low-resolution input image Loi has a (different) the super-resolution image Su should be equal to the
closest matching (Parent Structure) training sample for each gradient of the best matching training image (separately for
pixel. Moreover, each such Parent Structure corresponds to each pixel m; n in each input image Loi ). These constraints
a number of different pixels in the 0th level of the pyramid. are also linear in the unknown pixels of Su.
Fig. 8. (a) Variation with the number of images and (b) variation with additive noise. A comparison of Schultz and Stevenson [47] and Hardie et al.
[26]. In (a), we plot the RMS pixel intensity error computed across the 100 image test set against the number of low-resolution input images. Our
algorithm outperforms the traditional super-resolution algorithms across the entire range. In (b), we vary the amount of additive noise. Again, we find
that our algorithm does better than the traditional super-resolution algorithms, especially as the standard derivation of the noise increases.
4.6 Algorithm Practicalities were then used as the training samples Tj . (In most of our
Equations (21), (22), (26), and (28) form a high-dimensional experiments, we also added eight synthetic variations of
linear least squares problem. The constraints in (22) are the each image to the training set by translating the image eight
standard super-resolution reconstruction constraints. Those times, each time by a small amount. This step enhances the
in (28) are the recognition-based prior. The relative weights performance of our algorithm slightly, although it is not
of these constraints are defined by the noise covariances 2 vital to obtain good performance.)
and 2r . We assume that the reconstruction constraints are We used a ªleave-one-outº methodology to test our
the more reliable ones and, so, set 2 2r (typically, algorithm. To test on any particular person, we removed all
2r 20 2 ) to make them almost hard constraints. occurrences of that individual from the training set. We
then trained the algorithm on the reduced training set and
The number of unknowns in the linear system is equal to
tested on the images of the individual that had been
the total number of pixels in the super-resolution image Su.
removed. Because this process is quite time consuming, we
Directly inverting a linear system of such size can prove used a test set of 100 images of 100 different individuals
problematic. We therefore implemented a gradient descent rather than the entire training set. The test set was selected
algorithm (using a diagonal approximation to the Hessian at random from the training set. As will be seen, the test set
[42] to set the how step size in a similar way to [52]). Since spans both sex and race reasonably well.
the error function is quadratic, the algorithm converges to
the single global minimum without any problem. 4.7.1 Comparison with Existing Super-Resolution
Algorithms
4.7 Experimental Results on Human Faces
We initially restrict attention to the case of enhancing 24 32
Our experiments for human faces were conducted with a
subset of the FERET data set [40] consisting of 596 images of pixel images four times to give 96 128 pixel images. Later,
278 individuals (92 women and 186 men). Each person we will consider the variation in performance with the
appears between two and four times. Most people appear magnification factor. We simulate the multiple slightly
twice, with the images taken on the same day under translated images required for super-resolution using the
approximately the same illumination conditions, but with FERET database by randomly translating the original FERET
different facial expressions (one image is usually neutral, images multiple times by subpixel amounts before down-
the other typically a smile). A small number of people sampling them to form the low-resolution input images.
appear four times, with the images taken over two different In our first set of experiments, we compare our algorithm
days, separated by several months. with those of Hardie et al. [26] and Schultz and Stevenson
The images in the FERET data set are 256 384 pixels; [47]. In Fig. 8a, we plot the RMS pixel error against the
however, the area of the image occupied by the face varies number of low-resolution inputs, computed over the
considerably. Most of the faces are around 96 128 pixels 100 image test set. (We compute the RMS error using the
original high-resolution image used to synthesize the inputs
or larger. In the class-based approach [44], the input images
from.) We also plot results for cubic B-spline interpolation
(which are all frontal) need to be aligned so that we can
[41] for comparison. Since this algorithm is an interpolation
assume that the same part of the face appears in roughly the
algorithm, only one image is ever used and, so, the
same part of the image every time. This allows us to obtain performance is independent of the number of inputs.
the best results. This alignment was performed by hand In Fig. 8a, we see that our hallucination algorithm does
marking the location of three points, the centers of the two outperform the reconstruction-based super-resolution algo-
eyes, and the lower tip of the nose. These three points define rithms, from one input image to 25. The improvement is
an affine warp [8], which was used to warp the images into consistent across the number of input images and is around
a canonical form. The canonical image is 96 128 pixels 20 percent. The improvement is also largely independent of
with the right eye at 31; 63, the left eye at 63; 63, and the the actual input. In particular, Fig. 9 contains the best and
lower tip of the nose at 47; 83. These 96 128 pixel images worst results obtained across the entire test set in terms of
Fig. 9. The best and worst results in Fig. 8a in terms of the RMS of the hallucination algorithm for nine input images. In (a) input 24 32,
(b) hallucinated, (c) Hardie et al. [26], (d) original, and (e) cubic B-spline, we display the results for the best performing image in the 100 image test
set. The results for the worst image are presented in (f) input 24 32, (g) hallucinated, (h) Hardie et al. [26], (i) original, and (j) cubic B-spline. (The
results for Schultz and Stevenson are similar to those for Hardie et al. and are omitted.) There is little difference in image quality between the best
and worst hallucinated results. The hallucinated results are also visibly better than those for Hardie et al.
Fig. 10. An example from Fig. 8b of the variation in the performance of the hallucination with additive zero-mean, white Gaussian noise. The outputs
of the hallucination algorithm are shown for various levels of noise. As can be seen, the output is hardly affected until 4-bits of intensity noise have
added to the inputs. This is because the hallucination algorithm uses the strong recognition-based face prior to generate smooth, face-like images,
however noisy the input images are. At around 4-bits of noise, the recognition decisions begin to fail and the performance of the algorithm begins to
drop off. (a) Std. dev. 1.0, (b) std. dev 2.0, (c) std. dev. 4.0, (d) std. dev. 8.0, and (e) std. dev 16.0.
the RMS error of the hallucination algorithm for nine low algorithms. The reason for this increased robustness is
resolution inputs. As can be seen, there is little difference probably that the hallucination algorithm always tends to
between the best results in Figs. 9a, 9b, 9c, 9d, and 9e and generate smooth, face-like images (because of the strong
the worst ones in Figs. 9f, 9g, 9h, 9i, and 9j. Notice, also, how recognition-based prior), however noisy the inputs are. So
the hallucinated results are a dramatic improvement over long as the recognition decisions are not affected too much,
the low-resolution input and, moreover, are visibly sharper the results should look reasonable. One example of how the
than the results for Hardie et al. output of the hallucination algorithm degrades with the
amount of additive noise is presented in Fig. 10.
4.7.2 Robustness to Additive Intensity Noise
Fig. 8b contains the results of an experiment investigating the 4.7.3 Variation in Performance with the Input Image Size
robustness of the three super-resolution algorithms to We do not expect our hallucination algorithm to work for all
additive intensity noise. In this experiment, we added zero- sizes of input. Once the input gets too small, the recognition
mean, white Gaussian noise to the low-resolution images decisions will be based on essentially no information. In the
before passing them as inputs to the algorithms. In the figure, limit that the input image is just a single pixel, the algorithm
the RMS pixel intensity error is plotted against the standard
will always generate the same face (for a single input image),
deviation of the additive noise. The results shown are for four
but with different average gray levels. We therefore
low-resolution input images and, again, the results are an
average over the 100 image test set. (The results for cubic investigated the lowest resolution at which our hallucination
B-spline interpolation just use one input image, of course.) As algorithm works reasonably well.
would be expected, the performance of all four algorithms In Fig. 11, we show example results for one face in the test
gets much worse as the standard deviation of the noise set for four different input sizes. (All of the results use just
increases. The hallucination algorithm (and cubic B-spline four input images.) We see that the algorithm works reason-
interpolation), however, seem somewhat more robust ably well down to 12 16 pixels, but, for 6 8 pixel images, it
than the traditional reconstruction-based super-resolution produces a face that appears to be a pieced-together
Fig. 11. The variation in the performance of our hallucination algorithm with the imput image size. From the example in the top two rows, we see that
the algorithm works well down to 12 16 pixel images, but not for 6 8 pixel images. (See also Fig. 12.) The improvement in the RMS error over the
100 image test set in the last row confirms the fact that the algorithm begins to break down between these two image sizes.
combination of a variety of faces. This is not too surprising algorithm is compared with that of Schultz and Stevenson.
because the 6 8 pixel input image is not even clearly an (The results for Hardie et al. are similar and, so, are
image of a face. (Many face detectors, such as [45], use input omitted.) Our algorithm marginally outperforms both
windows of around 20 20 pixels, so it is unlikely that the reconstruction-based algorithms. In particular, the eye-
6 8 pixel image would be detected as a face.) brows, the face contour, and the hairline are all a little
In the last row of Fig. 11, we give numerical results of the sharper in the hallucinated result. The improvement is quite
average improvement in the RMS error over cubic B-spline small, however. This is because the hallucination algorithm
interpolation (computed over the 100 image test set). We see is currently very sensitive to illumination conditions and
that, for 24 32 and 12 16 pixel images, the reduction in other photometric effects. We are working on making our
the error is very dramatic. It is roughly halved. For the other
algorithm more robust to such effects, as well as on several
sizes, the results are less impressive, with the RMS error
other refinements.
being cut by about 25 percent. For 6 8 pixel images, the
reason is that the hallucination algorithm is beginning to
break down. For 48 64 pixel images, the reason is that
4.7.5 Results on Images Not Containing Faces
cubic B-spline does so well that it is hard to do much better. In Fig. 15, we briefly present a few results on images that do
The results for the 12 16 pixel image are excellent, not contain faces, even though the algorithm has been
however. (Also see Fig. 12 which contains several more trained on the FERET data set. (Fig. 15a is a random image,
examples.) The input images are barely recognizable as faces Fig. 15b is a miscellaneous image, and Fig. 15c is a constant
and the facial features, such as the eyes, eyebrows, and image.) As might be expected, our algorithm hallucinates
mouths, only consist of a handful of pixels. The outputs, albeit an outline of a face in all three cases, even though there is no
slightly noisy, are clearly recognizable to the human eye. The face in the input. This is the reason we called our algorithm
facial features are also clearly discernible. The hallucinated a ªhallucination algorithm.º (The hallucination algorithm
results are also a huge improvement over Schultz and naturally performs worse on images that it was not trained
Stevenson [47]. for than reconstruction-based algorithms do.)
4.7.4 Results on Non-FERET Test Images 4.8 Experimental Results on Text Data
In our final experiment for human faces, we tried our We also applied our algorithm to text data. In particular, we
algorithm on a number of images not in the FERET data set. grabbed an image of a window displaying one page of a
In Fig. 13, we present hallucination results just using a single letter and used the bit-map as the input. The image was
input image. As can be seen, the hallucinated results are a big split into disjoint training and test samples. (The training
improvement over cubic B-spline interpolation. The facial and test data therefore contain the same font, are at the
features, such as the eyes, nose, and mouth, are all enhanced same scale, and the data is noiseless. The training and test
and appear much sharper in the hallucinated result than in data are not registered in any way, however.) The results
either the low-resolution input or in the interpolated image. are presented in Fig. 16. The input in Fig. 16a is half the
In Fig. 14, we present results on a short eight-frame resolution of the original in Fig. 16f. The hallucinated result
video. The face region is marked by hard in the first frame in Fig. 16c is the best reconstruction of the text, both visually
and then tracked over the remainder of the sequence. Our and in terms of the RMS intensity error. For example,
Fig. 12. Selected results for 12 16 pixel images, the smallest input size for which our hallucination algorithm works reliably. (The input consists of
only four low-resolution input images.) Notice how sharp the hallucinated results are compared to the input and the results for the Schultz and
Stevenson [47] algorithm. (The results for Hardie et al. [26] are similar to those for Schultz and Stevenson and so are omitted. (a) Input 12 16 (one
of four images). (b) Hallucinated. (c) Schultz and Stevenson. (d) Original. (e) Cubic B-spline.
Fig. 13. Example results on a single image not in the FERET data set. The facial features, such as eyes, nose, and mouth, which are blurred and
unclear in the original cropped face, are enhanced and appear much sharper in the hallucinated image. In comparison, cubic B-spline interpolation
gives overly smooth results. (a) Cropped. (b) Cubic B-spline. (c) Hallucinated.
compare the appearance of the word ªwasº in the second 5 DISCUSSION

sentence of the text in Figs. 16b, 16c, 16d, 16e, and 16f. The In the first half of this paper, we showed that the super-
hallucination algorithm also has an RMS error of only resolution reconstruction constraints provide less and less
24.5 gray levels, compared to over 48.0 for the three other useful information as the magnification factor increases. The
algorithms, almost a factor of two improvement. major cause of this phenomenon is the spatial averaging over
Fig. 14. Example results on a short video of eight frames. (Only one of the input images and the cropped low-resolution face region are shown. The
other seven input images are similar except that the camera is slightly translated.) The results of the hallucination algorithm are slightly better than
those of the Schultz and Stevenson algorithm, for example, around the eyebrows, around the face contour, and around the hairline. The
improvement is only marginal because of the harsh illumination conditions. At present, the performance of our hallucination algorithm is very
dependent upon such effects. (a) Cropped, (b) Schultz and Stevenson, and (c) hallucinated.
Fig. 15. The results of applying our hallucination algorithm to images not containing faces. (We have omitted the low-resolution input and have just
displayed the original high-resolution image.) As is evident, a face is hallucinated by our algorithm even when none is present, hence the term
ªhallucination algorithm.º (a) Random, (b) misc., and (c) constant.
the photosensitive area, i.e., the fact that S is nonzero. The to frontal faces, the robustness of the algorithm to
underlying reason that there are limits on reconstruction- illumination conditions must be improved. This lack of
based super-resolution is therefore the simple fact that CCD robustness to illumination can be seen in Fig. 14 where the
sensors must have a nonzero photosensitive area in order to performance of our algorithm on images captured outdoors
be able to capture a nonzero number of photons of light. and in novel illumination conditions results in significantly
Our analysis assumes quantized noiseless images, i.e., the less improvement over existing reconstruction-based algo-
intensities are 8-bit values, created by rounding noiseless real- rithms than that seen in some of our other results. (The most
valued numbers. (It is this quantization that causes the loss of appropriate figure to compare Fig. 14 with is Fig. 9.) We are
information which, when combined with spatial averaging, currently working on these and other refinements.
means that high magnification super-resolution is not The two halves of this paper are related in the following
possible from the reconstruction constraints.) Without this sense: Both halves are concerned with where the information
assumption, however, it might be possible to increase the comes from when super-resolution is performed and how
number of bits per pixel by averaging a collection of quantized strong that information is. The first half investigates how
noisy images (in an intelligent way). In practice, taking much information is contained in the reconstruction con-
advantage of such information is very difficult. This point also straints and shows that the information content is fundamen-
does not affect another outcome of our analysis, which was to tally limited by the dynamic range of the images. The second
show that reconstruction-based super-resolution inherently
half demonstrates that strong class-based priors can provide
trades off intensity resolution for spatial resolution.
far more information than the simple smoothness priors that
In the second half of this paper, we showed that recognition
are used in existing super-resolution algorithms.
processes may provide an additional source of information for
super-resolution algorithms. In particular, we developed a
ªhallucinationº algorithm for super-resolution and demon- ACKNOWLEDGMENTS
strated that this algorithm can obtain far better results than
existing reconstruction-based super-resolution algorithms, The authors would like to thank Harry Shum for pointing out
both visually and in terms of RMS pixel intensity error. Similar the work of Freeman et al. [25], Iain Matthews for pointing out
approaches may aid other (i.e., 3D) reconstruction tasks. the work of Edwards et al. [19], and Henry Schneiderman for
At this time, however, our hallucination algorithm is not suggesting we perform the conditioning analysis in
robust enough to be used on typical surveillance video. Section 3.2. The authors would also like to thank numerous
Besides integrating it with a 3D head tracker to avoid the people for comments and suggestions, including Terry Boult,
need for manual registration and to remove the restriction Peter Cheeseman, Michal Irani, Shree Nayar, Steve Seitz,
Fig. 16. The results of enhancing the resolution of a piece of text by a factor of two. (Just a single input image is used.) Our hallucination algorithm
produces a clear, crisp image using no explicit knowledge that the input contains text. In particular, look at the word ªwasº in the second sentence.
The RMS pixel intensity error is also almost a factor of two improvement over the other algorithms. (a) Input image. (Just one image is used.)
(b) Cubic B-spline, RMS error 51.3. (c) Hallucinated, RMS error 24.5. (d) Schultz and Stevenson, RMS error 48.4. (e) Hardie et al., RMS error 48.5.
(f) Original high-resolution image.
Sundar Vedula, and everyone in the Face Group at Carnegie [8] J.R. Bergen, P. Anandan, K.J. Hanna, and R. Hingorani,
ªHierarchical Model-Based Motion Estimation,º Proc. Second
Mellon University. Finally, the authors would like to thank European Conf. Computer Vision, pp. 237-252, 1992.
the anonymous reviewers for their comments and sugges- [9] M. Berthod, H. Shekarforoush, M. Werman, and J. Zerubia,
ªReconstruction of High Resolution 3D Visual Information,º Proc.
tions. The research described in this paper was supported by Conf. Computer Vision and Pattern Recognition, pp. 654-657, 1994.
US Department of Defense Grant MDA-904-98-C-A915. A [10] M. Born and E. Wolf, Principles of Optics. Permagon Press, 1965.
preliminary version of this paper [4] appeared in June 2000 in [11] P.J. Burt, ªFast Filter Transforms for Image Processing,º Computer
Graphics and Image Processing, vol. 16, pp. 20-51, 1980.
the IEEE Conference on Computer Vision and Pattern [12] P.J. Burt and E.H. Adelson, ªThe Laplacian Pyramid as a Compact
Recognition. Additional experimental results can be found Image Code,º IEEE Trans. Comm., vol. 31, no. 4, pp. 532-540, 1983.
[13] P. Cheeseman, B. Kanefsky, R. Kraft, J. Stutz, and R. Hanson,
in the technical report [1]. ªSuper-Resolved Surface Reconstruction from Multiple Images,º
Technical Report FIA-94-12, NASA Ames Research Center, 1994.
[14] M.-C. Chiang and T.E. Boult, ªImaging-Consistent Super-Resolu-
REFERENCES tion,º Proc. DARPA Image Understanding Workshop, 1997.
[1] S. Baker and T. Kanade, ªHallucinating Faces,º Technical Report [15] M.-C. Chiang and T.E. Boult, ªLocal Blur Estimation and Super-
CMU-RI-TR-99-32, The Robotics Inst., Carnegie Mellon Univ., 1999. Resolution,º Proc. Conf. Computer Vision and Pattern Recognition,
[2] S. Baker and T. Kanade, ªSuper-Resolution Optical Flow,º pp. 821-826, 1997.
Technical Report CMU-RI-TR-99-36, The Robotics Inst., Carnegie [16] J.S. De Bonet, ªMultiresolution Sampling Procedure for Analysis
Mellon Univ., 1999. and Synthesis of Texture Images,º Computer Graphics Proc., Ann.
[3] S. Baker and T. Kanade, ªHallucinating Faces,º Proc. Fourth Int'l Conf. Series, (SIGGRAPH '97), pp. 361-368, 1997.
Conf. Automatic Face and Gesture Recognition, 2000. [17] J.S. De Bonet and P. Viola, ªTexture Recognition Using a Non-
[4] S. Baker and T. Kanade, ªLimits on Super-Resolution and How to Parametric Multi-Scale Statistical Model,º Proc. Conf. Computer
Break Them,º Proc. IEEE Conf. Computer Vision and Pattern Vision and Pattern Recognition, pp. 641-647, 1998.
Recognition, 2000. [18] F. Dellaert, S. Thrun, and C. Thorpe, ªJacobian Images of Super-
[5] S. Baker, S.K. Nayar, and H. Murase, ªParametric Feature Resolved Texture Maps for Model-Based Motion Estimation and
Detection,º Int'l J. Computer Vision, vol. 27, no. 1, pp. 27-50, 1998. Tracking,º Proc. Fourth Workshop Applications of Computer Vision,
[6] D.F. Barbe, Charge-Coupled Devices. Springer-Verlag, 1980. pp. 2-7, 1998.
[7] B. Bascle, A. Blake, and A. Zisserman, ªMotion Deblurring and [19] G.J. Edwards, C.J. Taylor, and T.F. Cootes, ªLearning to Identify
Super-Resolution from an Image Sequence,º Proc. Fourth European and Track Faces in Image Sequences,º Proc. Third Int'l Conf.
Conf. Computer Vision, pp. 573-581, 1996. Automatic Face and Gesture Recognition, pp. 260-265, 1998.
[20] M. Elad, ªSuper-Resolution Reconstruction of Image Sequences [46] R. Schultz and R. Stevenson, ªA Bayseian Approach to Image
ÐAdaptive Filtering Approach,º PhD thesis, The TechnionÐIsrael Expansion for Improved Definition,º IEEE Trans. Image Processing,
Inst. Technology, Haifa, Israel, 1996. vol. 3, no. 3, pp. 233-242, 1994.
[21] M. Elad and A. Feuer, ªRestoration of Single Super-Resolution [47] R. Schultz and R. Stevenson, ªExtraction of High-Resolution
Image from Several Blurred, Noisy, and Down-Sampled Mea- Frames from Video Sequences,º IEEE Trans. Image Processing,
sured Images,º IEEE Trans. Image Processing, vol. 6, no. 12, vol. 5, no. 6, pp. 996-1011, 1996.
pp. 1646-1658, 1997. [48] H. Shekarforoush, ªConditioning Bounds for Multi-Frame Super-
[22] M. Elad and A. Feuer, ªSuper-Resolution Reconstruction of Image Resolution Algorithms,º Technical Report CAR-TR-912, Compu-
Sequences,º IEEE Trans. Pattern Analysis and Machine Intelligence, ter Vision Laboratory, Center for Automation Research, Univ. of
vol. 21, no. 9, pp. 817-834, Sept. 1999. Maryland, 1999.
[23] M. Elad and A. Feuer, ªSuper-Resolution Restoration of an Image [49] H. Shekarforoush, M. Berthod, J. Zerubia, and M. Werman, ªSub-
SequenceÐAdaptive Filtering Approach,º IEEE Trans. Image Pixel Bayesian Estimation of Albedo and Height,º Int'l J. Computer
Processing, vol. 8, no. 3, pp. 387-395, 1999. Vision, vol. 19, no. 3, pp. 289-300, 1996.
[50] V. Smelyanskiy, P. Cheeseman, D. Maluf, and R. Morris,
[24] W.T. Freeman and E.H. Adelson, ªThe Design and Use of
ªBayesian Super-Resolved Surface Reconstruction from Images,º
Steerable Filters,º IEEE Trans. Pattern Analysis and Machine
Proc. 2000 IEEE Conf. Computer Vision and Pattern Recognition, 2000.
Intelligence, vol. 13, pp. 891-906, 1991.
[51] H. Stark and P. Oskoui, ªHigh-Resolution Image Recovery from
[25] W.T. Freeman, E.C. Pasztor, and O.T. Carmichael, ªLearning Low- Image-Plane Arrays, Using Convex Projections,º J. Optical Soc.
Level Vision,º Int'l J. Computer Vision, vol. 20, no. 1, pp. 25-47, 2000. Am. A, vol. 6, pp. 1715-1726, 1989.
[26] R.C. Hardie, K.J. Barnard, and E.E. Armstrong, ªJoint MAP [52] R. Szeliski and P. Golland, ªStereo Matching with Transparency
Registration and High-Resolution Image Estimation Using a and Matting,º Proc. Sixth Int'l Conf. Computer Vision (ICCV '98),
Sequence of Undersampled Images,º IEEE Trans. Image Processing, pp. 517-524, 1998.
vol. 6, no. 12, pp. 1621-1633, 1997. [53] H. Ur and D. Gross, ªImproved Resolution from Subpixel Shifted
[27] B.K.P. Horn, Robot Vision. McGraw-Hill, 1996. Pictures,º Computer Vision, Graphics, and Image Processing, vol. 54,
[28] T.S. Huang and R. Tsai, ªMulti-Frame Image Restoration and no. 2, pp. 181-186, 1992.
Registration,º Advances in Computer Vision and Image Processing,
vol. 1, pp. 317-339, 1984.
[29] M. Irani and S. Peleg, ªImproving Resolution by Image Restora- Simon Baker received the BA degree in mathe-
tion,º Computer Vision, Graphics, and Image Processing, vol. 53, matics from the University of Cambridge in June
pp. 231-239, 1991. 1991, the MSc degree in computer science from
[30] M. Irani and S. Peleg, ªMotion Analysis for Image Enhancement: the University of Edinburgh in November 1992,
Resolution, Occulsion, and Transparency,º J. Visual Comm. and and the MA degree in mathematics from the
Image Representation, vol. 4, no. 4, pp. 324-335, 1993. University of Cambridge in February 1995. He is
[31] M. Irani, B. Rousso, and S. Peleg, ªImage Sequence Enhancement a research scientist in the Robotics Institute at
Using Multiple Motions Analysis,º Proc. 1992 Conf. Computer Carnegie Mellon University, where he conducts
Vision and Pattern Recognition, pp. 216-221, 1992. research in computer vision. Before joining the
[32] D. Keren, S. Peleg, and R. Brada, ªImage Sequence Enhancement Robotics Institute in September 1998, he was a
Using Sub-Pixel Displacements,º Proc. Conf. Computer Vision and graduate research assistant at Columbia University, where he obtained
Pattern Recognition, pp. 742-746, 1988. his PhD degree in the Department of Computer Science. He also spent a
summer visiting the Vision Technology Group at Microsoft Research. His
[33] S. Kim, N. Bose, and H. Valenzuela, ªRecursive Reconstruction of
current research focuses on a wide range of computer vision problems
High Resolution Image from Noisy Undersampled Multiframes,º
from stereo reconstruction and the estimation of 3D scene motion to
IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38,
illumination modeling and sensor design. His work has appeared in a
pp. 1013-1027, 1990.
number of international computer vision conferences and journals.
[34] S. Kim and W.-Y. Su, ªRecursive High-Resolution Reconstruction
of Blurred Multiframe Images,º IEEE Trans. Image Processing,
vol. 2, pp. 534-539, 1993.
Takeo Kanade received the doctoral degree in
[35] S. Mann and R.W. Picard, ªVirtual Bellows: Constructing High electrical engineering from Kyoto University,
Quality Stills from Video,º Proc. First Int'l Conf. Image Processing, Japan, in 1974. He is an U.A. Helen Whitaker
pp. 363-367, 1994. University Professor of Computer Science and
[36] V.S. Nalwa, A Guided Tour of Computer Vision. Addison-Wesley, Robotics at Carnegie Mellon University. After
1993. holding a faculty position in the Department of
[37] T. Numnonda, M. Andrews, and R. Kakarala, ªHigh Resolution Information Science, Kyoto University, he joined
Image Reconstruction by Simulated Annealing,º Image and Vision Carnegie Mellon University in 1980, where he
Computing, vol. 11, no. 4, pp. 213-220, 1993. was the director of the Robotics Institute from
[38] A. Patti, M. Sezan, and A. Tekalp, ªSuper-resolution Video 1992 to 2001. Dr. Kanade has worked in multiple
Reconstruction with Arbitrary Sampling Latices and Nonzero areas of robotics: computer vision, multimedia, manipulators, autono-
Aperture Time,º IEEE Trans. Image Processing, vol. 6, no. 8, mous mobile robots, and sensors. He has written more than
pp. 1064-1076, 1997. 250 technical papers and reports in these areas, as well as more than
[39] S. Peleg, D. Keren, and L. Schweitzer, ªImproving Image 15 patents. He has been the principal investigator of a dozen major
Resolution Using Subpixel Motion,º Pattern Recognition Letters, vision and robotics projects at Carnegie Mellon. He has been elected to
pp. 223-226, 1987. the National Academy of Engineering. He is a fellow of the IEEE, the
[40] P.J. Philips, H. Moon, P. Rauss, and S.A. Rizvi, ªThe FERET ACM, and American Association of Artificial Intelligence (AAAI), and the
Evaluation Methodology for Face-Recognition Algorithms,º Proc. founding editor of International Journal of Computer Vision. He has
IEEE Conf. Vision and Pattern Recognition (CVPR '97), 1997. received several awards including the C&C Award, Joseph Engelberger
[41] W.K. Pratt, Digital Image Processing. Wiley-Interscience, 1991. Award, Allen Newell Research Excellence Award, JARA Award, Otto
Franc Award, Yokogawa Prize, and Marr Prize Award. Dr. Kanade has
[42] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery,
served as a government, industry, and university advisory, or consultant
Numerical Recipes in C, second ed. Cambridge Univ. Press, 1992.
committees, including Aeronautics and Space Engineering Board
[43] H. Qi and Q. Snyder, ªConditioning Analysis of Missing Data (ASEB) of National Research Council, NASA's Advanced Technology
Estimation for Large Sensor Arrays,º Proc. IEEE Conf. Computer Advisory Committee, PITAC Panel for Transforming Healthcare Panel,
Vision and Pattern Recognition, 2000. Advisory Board of Canadian Institute for Advanced Research.
[44] T. Riklin-Raviv and A. Shashua, ªThe Quotient Image: Class Based
Recognition and Synthesis under Varying Illumination,º Proc. Conf.
Computer Vision and Pattern Recognition, pp. 566-571, 1999.
. For more information on this or other any computing topic,
[45] H.A. Rowley, S. Baluja, and T. Kanade, ªNeural Network-Based
please visit our Digital Library at http://computer.org/publications/dlib.
Face Detection,º IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 20, no. 1, pp. 23-38, Jan. 1998.

PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO.

9, SEPTEMBER 2002 1167

S UPER-RESOLUTION is the process of combining multiple

0162-8828/02/$17.00 ß 2002 IEEE

Assumptions Made about the Registration. To outlaw

compare the appearance of the word ªwasº in the second 5 DISCUSSION

You might also like