You are on page 1of 144

Title

Mutual information-based depth estimation and 3D


reconstruction for image-based rendering systems

Advisor(s)

Chan, SC; Chang, C

Author(s)

Zhu, Zhenyu;

Citation

Issued Date

URL

Rights

2012

http://hdl.handle.net/10722/173910

The author retains all proprietary rights, (such as patent rights)


and the right to use in future works.

Mutual Information-based Depth Estimation


and 3D Reconstruction for Image-based
Rendering Systems

by
ZHU Zhenyu ()
B.Eng.
Ph. D. Thesis

A thesis submitted in partial fulfillment of the requirements for


the Degree of Doctor of Philosophy
at the University of Hong Kong
July 2012

Abstract of thesis entitled

Mutual Information-based Depth Estimation


and 3D Reconstruction for Image-based
Rendering Systems
Submitted by
ZHU Zhenyu
for the degree of Doctor of Philosophy
at the University of Hong Kong
in July 2012

Image-based rendering (IBR) is an emerging technology for


rendering photo-realistic views of scenes from a collection of densely
sampled images or videos. It provides a framework for developing
revolutionary virtual reality and immersive viewing systems. There has
been considerable progress recently in the capturing, storage and
transmission of image-based representations. This thesis proposes two
image-based rendering (IBR) systems for improving the viewing
freedom and environmental modeling capability of conventional static
IBR systems. The first system consists of a circular array with 13 still
cameras (Canon 550D) for capturing ancient Chinese artifacts at high
resolution. The second one is constructed by mounting a linear array of 8
video cameras (Sony HDR-TGIE) on an electrically controllable wheel
I

chair with its motion being controllable manually or remotely through


wireless local area network (LAN) by means of additional hardware
circuitry.
Both systems support object-based rendering and 3D reconstruction
capability and consist of two main components. 1) A novel view
synthesis algorithm using a new segmentation and mutual information
(MI)-based algorithm for dense depth map estimation, which relies on
segmentation, local polynomial regression (LPR)-based depth map
smoothing and MI-based matching algorithm to iteratively estimate the
depth map. The method is very flexible and both semi-automatic and
automatic segmentation methods can be employed. They rank fourth and
sixth, respectively, in the Middlebury comparison of existing depth
estimation methods. This allows high quality renderings of outdoor and
indoor scenes with improved mobility/freedom to be obtained. This
algorithm can also be extended to object tracking. Experimental results
also show that the proposed MI-based algorithms are applicable to
robust registration in noisy dynamic ultrasound images. 2) A new 3D
reconstruction algorithm which utilizes sequential-structure-from-motion
(S-SFM) technique and the dense depth maps estimated previously. It
relies on a new iterative point cloud refinement algorithm based on
Kalman filter (KF) for outlier removal and the segmentation-MI-based
algorithm to further refine the correspondences and the projection
matrices. The mobility of our system allows us to recover more
conveniently 3D model of static objects from the improved point cloud
using a new robust radial basis function (RBF)-based modeling
algorithm to further suppress possible outliers and generate smooth 3D
meshes of objects. Moreover, a new rendering technique named view
dependent texture mapping is used to enhance the final rendering effect.
II

Experimental results show that the proposed 3D reconstruction


algorithm significantly reduces the adverse effect of the outliers and
produces high quality renderings using view dependent texture mapping
and the model reconstructed.
Overall, this study provides a framework for designing IBR systems
with improved viewing freedom and ability to cope with moving and
static objects in indoor and outdoor environment.
An abstract of exactly 439 words

III

Declaration

I hereby declare that this dissertation, submitted in partial


fulfillment of the requirements for the degree of Philosophy and entitled
Mutual Information-based Depth Estimation and 3D Reconstruction for
Image-based Rendering Systems represents my own work except where
due acknowledgement is made, and has not been previously included in
a thesis, dissertation, or report submitted to this or any other institution
for a degree, diploma or other qualification.

Zhu Zhenyu
August 2012

IV

Acknowledgement

First of all, I would like to extend my sincere gratitude to my


supervisor, Dr. S. C. Chan and Dr. C. Q. Chang for their instructive
advice and useful suggestions on my thesis. Without their consistent and
illuminating instruction, this thesis could not be possible.
Besides, I highly appreciate all postgraduate students and staffs in
the Digital Signal Processing (DSP) Laboratory for their helpful
discussion and support. They are: Dr. K. T. Ng, Dr. Z. G Zhang, Dr. K.
M. Tsui, Mr. James Koo, Mr. B. Liao, Mr. C. Wang, Mr. S. Zhang, Mr.
H. C. Wu and Miss Y. J. Chu.
Last, but not the least, my thanks go to my beloved family for their
patience and love in me all through these years.

Contents

DECLARATION .................................................................................. IV
ACKNOWLEDGEMENT ..................................................................... V
CONTENTS .......................................................................................... VI
LIST OF FIGURES ............................................................................... X
LIST OF TABLES .............................................................................. XV
LIST OF ABBREVIATIONS .......................................................... XVI
CHAPTER 1 INTRODUCTION ........................................................... 1
1.1 BACKGROUND .................................................................................. 1
1.2 THESIS OUTLINE .............................................................................. 4
CHAPTER 2 REVIEW OF BASIC TOPICS IN IMAGE-BASED
RENDERING .......................................................................................... 9
2.1 INTRODUCTION ................................................................................ 9
2.2 REVIEW OF PLENOPTIC FUNCTION ................................................... 9
2.2.1 Basic Theory .......................................................................... 10
VI

2.3 REVIEW OF LIGHT FIELD ................................................................ 13


2.3.1 Creating/capturing light field ................................................ 13
2.4 REVIEW OF RENDERING TECHNIQUES ............................................ 14
2.5 SUMMARY ...................................................................................... 18
CHAPTER 3 THE PROPOSED IMAGE-BASED RENDERING
SYSTEMS .............................................................................................. 20
3.1 INTRODUCTION .............................................................................. 20
3.2 CONSTRUCTION OF THE PROPOSED IBR SYSTEMS .......................... 23
3.2.1 Still Camera System ............................................................... 23
3.2.2 Moveable Camera System ...................................................... 26
3.3 PRE-PROCESSING ........................................................................... 30
3.3.1 Still Camera System ............................................................... 30
3.3.1.1 Camera Calibration ............................................................. 30
3.3.1.2 Color-Tensor-based Segmentation and Matting .................. 33
3.3.2 Moveable Camera System ...................................................... 36
3.3.2.1 Video Stabilization ............................................................... 36
3.4 SUMMARY ...................................................................................... 44

VII

CHAPTER 4 A NEW COMBINED SEGMENTATION-MUTUALINFORMATION (MI)-BASED ALGORITHM FOR DENSE


DEPTH MAP ESTIMATION .............................................................. 45
4.1 INTRODUCTION .............................................................................. 45
4.2 COMBINED SEGMENTATION-MI-BASED DEPTH ESTIMATION ......... 46
4.2.1 Object Segmentation Using Level-Set Method ...................... 47
4.2.2 Mutual Information Matching ............................................... 49
4.3 DEPTH MAP REFINEMENT .............................................................. 54
4.3.1 Occlusion Detection and Inpainting ...................................... 56
4.3.2 Smoothing of Depth Maps...................................................... 56
4.4 MUTUAL INFORMATION (MI)-BASED OBJECT TRACKING .............. 63
4.5 MORE RESULTS AND COMPARISON ................................................ 65
4.6 SUMMARY ...................................................................................... 69
CHAPTER 5 3D RECONSTRUCTION AND MODELING .......... 71
5.1 INTRODUCTION .............................................................................. 71
5.2 HOMOGENEOUS GEOMETRY........................................................... 74
5.3 POINT MATCHING IN THE STILL IBR SYSTEM ................................ 76
5.3.1 Epipolar Geometry................................................................. 76
VIII

5.3.2 Finding Correspondent Points............................................... 77


5.4 VIEW DEPENDENT TEXTURE MAPPING .......................................... 82
5.5 POINT MATCHING AND REFINEMENT IN THE MOVEABLE IBR
SYSTEM ................................................................................................. 88
5.5.1 Structure-from-motion ........................................................... 88
5.5.2 Point Cloud Generation and Refinement: KF-based Outlier
detection and Point Cloud Fusion ................................................... 90
5.6 RBF MODELING AND MESH GENERATION ...................................... 98
5.7 SUMMARY .................................................................................... 102
CHAPTER 6 CONCLUSION AND FUTURE RESEARCH ........ 103
6.1 CONCLUSION ................................................................................ 103
6.2 FUTURE RESEARCH ...................................................................... 105
APPENDIX I PUBLICATIONS ...................................................... 107
REFERENCES .................................................................................... 109

IX

List of Figures

Figure 1-1

Spectrum of IBR representations.

Figure 2-1

Light field describes the amount of light in


radiance along light rays traveling in every
direction through every point in empty space [Ikeu
2012]....

10

Figure 2-2

Forward mapping.............

16

Figure 2-3

Example renderings using (a) forward mapping in


point rendering [Chan 2005], (b) layered
representation (with two layers dancer and
background) [Chan 2009], (c) monolithic
rendering using 3D polygonal mesh (left) and
rendering results (right) [Zhu 2010]....

19

Plenoptic videos: Multiple linear camera array of


4D simplified dynamic light field with viewpoints
constrained along line segments. The camera
arrays developed at [Chan2009]. Each consists of
6 JVC video cameras.......................

21

Figure 3-2

Circular camera array constructed...

24

Figure 3-3

Snapshots: (a)Buddha (b) Dragon Vase...

25

Figure 3-4

Block diagram of the proposed IBR system....

26

Figure 3-5

The proposed moveable image-based rendering


system..

27

Figure 3-1

Figure 3-6

Snapshots of the plenoptic videos at a given time


instance: (a) is the Podium outdoor video from
camera 1 to camera 4 and (b) is the Presentation
X

indoor
video
from
camera
1
to
camera .........

29

Block diagram of the proposed M-IBR system


constructed...

30

Relationship between the world coordinate and the


camera coordinate........

31

Planar patten............................................................

33

Figure 3-10 (a) Extraction results using color-tensor-based


method. Left: original, middle: hard segmentation,
right: after matting. (b) Close up of segmentations
in (a). Left: hard segmentation, Right: after
matting....

36

Figure 3-11 Motion smoothing results for horizontal


(Translation-x) and vertical (Translation-y)
directions. The original motion path and the
smoothed motion path with different methods are
shown. In (a)-(b), the blue dotted lines correspond
to the shaky original motion path. Green and black
lines correspond to the smoothed motion path
using the method in [Mats 2005] with a small and
a large kernel sizes respectively...

43

Figure 3-12 Video Stabilization result. The first row shows the
original images captured by our system; the
second row shows the stabilized images without
video completion; the third row shows the
completed results.

43

Figure 3-7
Figure 3-8
Figure 3-9

Figure 4-1

Segmentation results using the level-set-based


tracking method. (a) is the initial segmentation
obtained by lazy snapping, (b) is the initial
segmentation obtained by graph cut method.......

48

Figure 4-2

A regular grid for Local Transformation.

51

Figure 4-3

(a) is an example depth map obtained by using MI


matching without segmentation information; (b)
shows the depth map obtained by using automatic
XI

segmentation MI matching; (c) shows the depth


map obtained by using semi-automatic
segmentation MI matching. Green areas in (c) are
the occlusion areas detected by our algorithm. (d)(e) show the refined depth maps of (c) by
inpainting and smoothing (c) using SK-LPR-R-ICI
and 25 25 ideal low-pass filter, respectively

55

(a) and (b) show the renderings obtained by Figs.


4-2(d) and (b); (c) and (d) are the enlargements of
the red boxes....

59

Rendering results obtained by the proposed


algorithm. (a) shows the depth maps
corresponding to images in (b). The highlighted
images in (b) shows the rendered views from the
adjacent views in (b) using depth maps in (a). (c)
shows depth maps at other positions...

62

Example rendering results. The first row shows the


original images captured by our M-IBR system.
The second and third rows show renderings with a
step-in ratio of about 1.15 to 1.25 times..................

63

Figure 4-7

Object tracking at different time instances......

65

Figure 4-8

Teddy test images [Scha2002] and depth maps for


comparison. (a) LEFT image; (b) RIGHT image;
(c) ground truth depth map; (d) depth map
calculated by semi-automatic segmentation-based
MI matching; (e) depth map calculated by
automatic segmentation-based MI matching...

67

Results for the conference. (a) and (c) are two


sample frames. (b) and (d) are the depth maps of
(a) and (c), respectively.......

68

Figure 4-4

Figure 4-5

Figure 4-6

Figure 4-9

Figure 4-10 Ultrasound images of RF muscle under relaxed


condition and at 50% maximal voluntary
contraction (MVC) contraction level and the
corresponding images with outlined boundary
contours. The tracked boundaries are highlighted
XII

in green............................................................

69

Figure 5-1

Epipolar Geometry...................................................

76

Figure 5-2

Feature Points Detection. The red points in (a) and


(b) are the feature points. (a) is from the first
camera. (b) is from the second camera

80

Figure 5-3

Epipolar Line...

81

Figure 5-4

Rectified images, (a) is the rectified left image, (b)


is the rectified right image. (c) is the part of (a), (d)
is the part of (b)

81

An initial point cloud extracted with noise and


outliers.....

82

Figure 5-6

View Dependent Texture Mapping......

83

Figure 5-7

View Dependent Texture, Left: blurred texture.


Right: texture after the proposed view dependent
texture......

85

3D models of Ancient Chinese Artifacts. (a)


Dragon Vase, (b) Buddha, (c) Green Bottle, (d)
Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine
Glass.....

87

Rendering Results of Ancient Chinese Artifact...

88

Figure 5-10 Iterative refinement of point cloud: (a) initial point


cloud. (b) point cloud after outlier detection and
Kalman filtering. (c) point cloud after the
proposed iteration method...

91

Figure 5-5

Figure 5-8

Figure 5-9

Figure 5-11

(a)-(b) shows the 3D to 2D re-projection at frame


20 and frame 21, respectively. Blue points are
inliers. Green points are outliers detected by the
segmentation consistency check. Red points are
the outliers detected by intensity and location
consistency checks. (c) shows the enlargement of
the highlight area in (a). The point cloud is downsampled for better visualization...
XIII

97

Figure 5-12 Convergence behavior of the root mean square


distance (RMSD) versus the number of iteration
for the proposed iterative 3D reconstruction
algorithm. The blue line shows the RMSD values
with the KF-based outlier detection. The red line
shows the RMSD values without KF-based outlier
detection...

97

Figure 5-13 3D reconstruction results (a) without using RBF,


(b) using RBF without outlier detection and (c)
using RBF with outlier removal

100

Figure 5-14 Object-based rendering results of Podium


sequences using the estimated 3D model and
shadow field at different lightening conditions...

101

Figure 5-15 Object-based rendering results of the conference


sequence. (a) and (b) are the 3D reconstruction
result of two time instances. (c) and (d) are the
rendering results of (a) and (b). Note, only partial
geometry of the dynamic object is recovered, since
it is partially observable

101

XIV

List of Tables

Table 2-1

A taxonomy of plenoptic functions....................

12

Table 4-1

Comparison of the rank using standard threshold of


1 pixel on middlebury test stereo images...

67

XV

List of Abbreviations

BRDF

Bidirectional Reflectance Distribution Function

BP

Belief Propagation

CLF

Circular Light Field

CPU

Central Processing Unit

CSA

Cross-Sectional Area

DCP

Disparity Compensation Prediction

DCT

Discrete Cosine Transform

DSCs

Digital Still Cameras

DSP

Digital Signal Processing

Fig.

Figure

fps

frames per second

GC

Graph Cut

GPU

Graphic Processing Unit

HD

High Definition

IBR

Image-Based Rendering

i.i.d.

independent identically distributed

ISKR

Iterative Steering Kernel Regression

JVT

Joint View Triangulation

KF

Kalman Filter

LAN

Local Area Network

L-BFGS

Limited-memory Broyden-Fletcher-Goldfarb-Shanno

LDIs

Layered Depth Images

LPR

Local Polynomial Regression


XVI

LS

Least Square

MCU

Micro-Controller Unit

MI

Mutual Information

M-IBR

Moveable Image-Based Rendering

MRF

Markov Random Field

MVC

Maximal Voluntary Contraction

PCA

Principal Component Analysis

pdf

probability density function

PRT

Pre-computed Radiance Transfer

QPP

Quadratic Programming Problem

RANSAC

RANdom SAmple Consensus

RBF

Radial Basis Function

RF

Rectus Femoris

R-ICI

Refined Intersection of Confidence Intervals

RMSD

Root-Mean Squared Distance

SCLF

Simplified Circular Light Field

SFM

Structure-From-Motion

SPIHT

Set Partitioning In Hierarchical Trees

S-SFM

Sequential-Structure-From-Motion

XVII

Chapter 1 Introduction
1.1 Background
Image-based rendering/representation (IBR) [Chen 1995], [Debe
1996], [Gort1996], [Levo 1996], [McMi 1995], [Pele 1997], [Szel 1997],
[Shad 1998], [Shum 1999] is a promising technology for rendering new
views of scenes from a collection of densely sampled images or videos.
It has potential applications in virtual reality, immersive television and
visualization systems. Central to IBR is the plenoptic function [Adel
1911], which describes the intensity of each light ray in the world as a
function of visual angle, wavelength, time, and viewing position. The
plenoptic function is thus a 7-dimensional function of the viewing
position (Vx , Vy , Vz ) , the azimuth and elevation angles ( , ) , time ,
and wavelengths . Traditional images and videos are just 2D and 3D
special cases of the plenoptic function. In principle, one can reconstruct
any views in space and time if sufficient number of samples of the
plenoptic function is available. The rendering of novel views can
therefore be viewed as the reconstruction of the plenoptic function from
its samples. Image-based representations are usually densely sampled
high dimensional data with large data sizes, but their samples are highly
correlated. Because of the multidimensional nature of image-based
representations and scene geometry, much research has been devoted to
the efficient capturing, sampling, rendering and compression of IBR.
Depending on the functionality required, there is a spectrum of IBR
as shown in Fig. 1-1. They differ from each other in the amount of
geometry information of the scenes/objects being used. At one end of the
spectrum, like traditional texture mapping, we have very accurate
1

geometric models of the scenes and objects say generated by animation


techniques, but only a few images are required to generate the textures.
Given the 3-D models and the lighting conditions, novel views can be
rendered using conventional graphic techniques. Moreover, interactive
rendering with moveable objects and light sources can be supported
using advanced graphic hardware.
More images

Less images

Less geometry

More geometry

Rendering with
no geometry

Rendering with
implicit geometry
Lumigraph

Light field
Mosaicking

View morphing

Concentric mosaics View interpolation

Rendering with
explicit geometry
Layered-depth images
Texture-mapped models
3D warping
View-dependent geometry
View-dependent texture,
Shadow light field

Figure 1-1 Spectrum of IBR representations [Chan 2010].

At the other extreme, light field or lumigraph rendering relies on


dense sampling (by capturing more image/videos) with no or very little
geometry information for rendering without recovering the exact 3-D
models. An important advantage of the latter is its superior image quality,
compared with 3-D model building for complicated real world scenes.
Another important advantage is that it requires much less computational
resources for rendering regardless of the scene complexity, because most
of the quantities involved are pre-computed or recorded. This has
attracted considerable attention in the computer graphic community
recently in developing fast and efficient rendering algorithms for realtime relighting and soft-shadow generation [Agra 2000], [Ng 2004],
[Sloa 2002], [Zhou 2005].
Broadly speaking, image-based representations can be classified
according to the geometry information used into three main categories: 1)
2

representations with no geometry, 2) representations with implicit


geometry and 3) representations with explicit geometry. 2-D Panoramas,
McMillan and Bishops plenoptic modeling [McMi 1995], 3-D
concentric mosaics and light field/lumigraph belong to the first category
and they can be viewed as the direct interpolation of the plenoptic
function. Layered-based, object-based representations [Chan 2009], popup light [Shum 2004] using depth maps fall into the second. Finally,
conventional 3-D computer graphic models and other more sophisticated
representations [Deve 1998], [Wang 2005] belong to the last category.
Although these representations also sample the plenoptic function,
further processing of the plenoptic function has been performed to infer
the scene geometry or surface property such as bidirectional reflectance
distribution function (BRDF) of objects. Such image-based modeling
approach has emerged as a more promising approach to enrich the
photorealism and user interactivity of IBR. Moreover, since 3-D models
of the scenes are unavailable, conventional image-based representations
are limited to the change of viewpoints and sometimes limited amount of
relighting. Recently, it was found that real-time relighting and softshadow computation are feasible using the IBR concepts and the
associated 3-D models using pre-computed radiance transfer (PRT)
[Sloa 2002] and precomputed shadow fields [Zhou 2005].
For multiple camera arrays, the huge amount of data and vast
amount of viewpoints to be provided present one of the major challenges
to IBR. Advanced algorithms for processing and manipulation of the
high dimensional representation to achieve such functions as
segmentation, depth estimation, object tracking, 3D reconstruction, etc.
are all major challenges to be addressed. Finally, the efficient
transmission, compression and display of dynamic IBR and models are
3

also urgent issues waiting for satisfactory solution in order for IBR to
establish itself as an essential media for communication and presentation.
All of these motivate us to study the design and construction of the
image-based rendering systems based on plenoptic videos. The system
can potentially provide improved viewing freedom to users and ability to
cope with moving and static objects for 3D reconstruction.

1.2 Thesis Outline


This thesis is devoted to the design of image-based rendering
systems and its associating algorithms so as to provide improved
viewing freedom and object modeling of stationary and moveable
objects in outdoor and indoor environment. The major contributions of
this thesis are summarized as follows:
1) The construction of a high resolution IBR system for capturing and
rendering of ancient Chinese artifacts and a moveable IBR system
for capturing and rendering indoor and outdoor objects.
2) Development of a novel mutual information (MI)-based algorithm
combined with segmentation for dense depth map estimation and
object tracking.
3) A 3D reconstruction algorithm for objects, which employs the
estimated dense depth maps to obtain dense point correspondences
from multiple views for 3D reconstruction.

Details of these contributions are briefly described below:


1) The first prototype system uses a multiple still camera array to
capture ancient Chinese artifacts. Because of the high resolution of
the still camera (Canon 550D), we can obtain excellent rendering
quality. This system can be used for digital preservation and
dissemination of cultural artifacts with high digital quality. To avoid
possible damage to the artifacts and speed up the capturing process,
we propose to employ the image-based approach instead of using
traditional 3D laser scanners. A circular array consisting of multiple
digital still cameras was therefore constructed in this work. Using
this circular camera array, we developed novel techniques for
rendering new views of the artifacts from the images captured using
the object-based approach. The multiple views so synthesized enable
the ancient artifacts to be displayed in modern multi-view displays.
A number of ancient Chinese artifacts from the University Museum
and Art Gallery at the University of Hong Kong were captured and
excellent rendering results were obtained. The second prototype
system uses a linear camera array consisting of 8 video cameras
(Sony HDR-TGIE) mounted on an electrically controllable wheel
chair. Its motion can be controlled manually or remotely by means of
additional hardware circuitry. Unlike the previous multiple camera
systems which are not designed to be moveable so that the viewpoints are somewhat limited and usually cannot cope with moving
objects and perform 3D reconstruction of objects in open
environment. Our moveable image-based rendering system can be
used to render large environment and moving objects.

2) A new combined segmentation-mutual-information (MI)-based


algorithm for dense depth map estimation is presented. It relies on
segmentation, local polynomial regression (LPR)-based depth map
smoothing and MI-based matching algorithm to iteratively estimate
the depth map. The method is very flexible and both semi-automatic
and automatic segmentations can be used. The semi-automatic and
automatic versions rank 4 and 6 respectively in the Middlebury
comparison of existing depth estimation methods. Using the depth
maps captured and the object-based approach, high quality
renderings of outdoor scenes along the trajectory can be obtained,
which considerably improved the viewing freedom. The mutual
information-based matching algorithm is also extended to object
tracking algorithms. It can be used to track the boundary of an object
in a video sequence. Experimental results show that its performance
is reliable even for noisy videos such as dynamic ultrasound images.
3) Using the IBR systems, correspondences from different views can be
integrated together for 3D reconstruction. For both of the systems,
camera calibration is firstly used to determine the value of internal
and external parameters of the cameras. For the still camera array, a
major technique to find the correspondent points is the epipolar
geometry which can constraint the corresponding points on the
conjugated epipolar lines. Meanwhile, via combining epipolar lines
with Scale-invariant feature transform (SIFT) [Lowe 2004] feature
detection, accurate sparse correspondent points can be located. Then
Gabor filter, which is rather insensitive to noise, is used to obtain the
dense correspondent points. For the moveable image-based rendering
(M-IBR) system, the sequential-structure-from-motion (S-SFM)
technique is adopted to estimate the locations of the M-IBR system
6

so as to obtain an initial set of fairly reliable 3D point cloud from the


2D correspondences. New iterative Kalman filter (KF)-based and
segmentation-MI-based algorithms are proposed to fuse the
correspondences from different views and remove possible outliers
to obtain an improved point cloud. More precisely, the proposed
algorithm relies on the KF to track the correspondences across
different views so as to suppress possible outliers while fusing
correspondences from different views. With these reliable matched
points, the camera parameters and hence the image correspondences
can be further refined by re-projecting the updated correspondences
to successive views to serve as prior features/correspondences for
MI-based matching. By iterating these processes, an improved point
cloud with reliable correspondences can be recovered. Simulation
results show that the proposed algorithm significantly reduces the
adverse effect of the outliers and generates a more reliable point
cloud. To recover the 3D model from the improved point cloud, a
new robust RBF-based modeling algorithm is proposed to further
suppress possible outliers and generate smooth 3D surfaces from the
raw 3D point cloud. Compared with the conventional RBF-based
smoothing, it is more robust and reliable. Finally, view dependent
texture is incorporated to enhance the final rendering effect.

This thesis is divided into 6 chapters. In Chapter 2, some background


materials on IBR are briefly reviewed. They include plenoptic
function, light field and rendering techniques. In Chapter 3, the
design and construction of two IBR systems are presented. Some
pre-processing techniques for capturing the ancient Chinese artifacts
7

including camera calibration and color-tensor-based segmentation


are also introduced. Chapter 4 is devoted to a new combined
segmentation MI-based depth estimation algorithm. The 3D
reconstruction and modeling algorithms will be presented in Chapter
5.Two different point matching algorithms are studied first and then
a RBF modeling algorithm is proposed for mesh generation. At last,
a view dependent texture mapping method for improving the
rendering quality will be presented. Finally, conclusion and future
research topics are given in Chapter 6.

Chapter 2 Review of Basic Topics in Image-Based


Rendering
2.1 Introduction
In this chapter, the fundamental topics in image-based rendering are
reviewed briefly. In section 2.2, the plenoptic function and its history are
introduced. The theory of light field is dicusssed in section 2.3. Section
2.4 is devoted to the rendering techniques in IBR.

2.2 Review of Plenoptic Function


The plenoptic function was proposed by Bergen and Anderson
[Adel 1991]. It is a function presented by visual angle, wavelength, time
and viewing position to describe the intensity of each light ray in the
world. All the information captured by an optical sensor can be depicted
by this function. The plenoptic function is a 7 dimension (7D) function
consisting of 3 dimenion (3D) position, 2 dimension visual angle ,
wavelength and time.
Sampling and processing of the plenoptic function are the main
research topic in the early computer vision study. For example, object
motion can be discribed by the derivatives of the plenoptic function in
terms of position and time. Because the wavelength is usually
represented by Red, Green, Blue channels in digital image processing,
the plenoptic function of images and videos can be simplified into two
dimension and three dimension special cases. Theoretically, if the
sampling rate is very high, novel views at intermediate positions can be
recovered from its samples. The algorithms which are trying to solve this
problem are usually called image-based rendering.

l(x,y,z,,)

(x,y,z)

Figure 2-1: Light field describes the amount of light in radiance along light rays
traveling in every direction through every point in empty space [Ikeu 2012].

Because the plenoptic function also describes the geometry and


surface properties, many algorithms are proposed to integrate the
geometry and surface information into image-based rendering to
improve user interaction and reduce the amount of samples required.
The capturing, sampling, rendering and processing of the plenoptic
function are all important research topics in IBR and related applications
such as computational photography, 3D/multiview videos and displays,
etc.
2.2.1 Basic Theory
The

7D

plenoptic

function

is

usually

defined

as

l (Vx ,Vy ,Vz , , , , ) . (Vx ,Vy ,Vz ) is the viewing position. and are

the elevation and azimuth angles respectively as shown in (Fig. 2-1).


and denote the wavelength and time respectively. By employing
different parameterization and simplification, different image-based
rendering algorithms can be derived from the plenoptic function.
10

There are several camera systems which are usually used in the
image-based rendering for capturing. For static scene, one camera can be
rotated around the camera centre at a given position V with different
elevation and azimuth angles. The plenoptic function is then simplified
to a panorama lV ( , ) . The spherical camera array can provide other
panoramas representation because the captured image can be projected
to a cylinder. If a multiple video camera array is employed instead, a
panoramic video can be obtained. And the plenoptic function will be
simplified to a 3D panorama lV ( , , ) for dynamic scenes. The close
relationship between plenoptic function and image-based rendering was
due to McMillan and Bishop [McMi 1995] who proposed plenoptic
modeling using the 5D complete plenoptic function for static scene
l (Vx ,Vy ,Vz , , ) .

In the static scene, the radiance along rays is constant. Therefore


the plenoptic function can be re-written as a 4D function which is called
the light field [Levo 1996] and lumigraph [Gort 1996] in computer
graphics. The set of light rays in a 4D static light field can be
parameterized in many ways. For example, the two-plane-based
parameterization is usually used. By adding the time into the static light
field, a 5D plenoptic function can be obtained. Lumigraph employs
depth maps into image-based rendering to improve the rendering quality
which can produce more accurate representations. In [Shum 1999], an
outward facing camera moving on a circle was used to capture a series
of densely sampled images.
A

commonly

used

parameterization

is

the

two-plane

parameterization, where a light ray in the light field is parameterized as


its intersections or coordinates with two parallel planes. These rays can
11

Dimension Year

View space

Name

1991

Free

Plenoptic function

1995

Free

Plenoptic modeling

1996

Bounding box

Light field/Lumigraph

1999

Bounding circle

Concentric Mosaics

1994

Fixed point

Cylindrical/Spherical panorama

Table 2-1: A taxonomy of plenoptic functions.

be captured by taking a series of pictures on a 2D rectangular plane,


which results in an array of images. The light field concept can be
similarly extended to time-varying or dynamic scenes, which results in a
5D function. Lumigraph is different from the light field, because the
geometry in form of depth maps is used to improve the rendering quality
which can produce more sophisticated representations in image-based
modeling. In [Shum 1999], a set of densely sampled images are captured
by an outward facing camera moving on a circle which are called
concentric mosaic. This system can render new views inside the circle.
Then some simplified systems are proposed to reduce the complexity
such as restricting the camera locations to line by line segments [Zitn
2004], [Chan 2005], [Chan 2009]. For time varying or dynamic scenes,
similar representations can be used. Because the light may change at
different viewing location, the light has to be captured continuously
which can be done by video camera arrays. For static scenes, the light
directions can be recorded at first. Then one can relight the rendering
12

with arbitrary lightings. A brief summary of these plenoptic function


representations is given in Table 2-1 [Ikeu 2012].

2.3 Review of Light Field


Light field was introduced firstly in a paper by A. Gershun [Gers
1939] for studying surface illumination by artificial lightings. A similar
concept was introduced to the computer graphics community as the light
field in [Levo 1996] and lumigraph in [Gort 1996]. The motivation is to
render new views or images of objects or scenes from densely sampled
images previously taken to avoid building or capturing complicated 3D
models. Light field or lumigraph rendering is a special representation of
image-based rendering and they require either no geometry [Levo 1996]
or limited geometry in terms of depth maps [Gort 1996]. The Light field
and lumigraph are four dimension (4D) simplification of the plenoptic
function for static scenes.
2.3.1 Creating/capturing light field
Light field can be created by rendering 3D models by computer
graphics or capture the real object by camera arrays. For real and static
scenes, light field can be captured by one still camera controlled by a
mechanical arm using lumigraph rendering [Bueh 2001]. In [Adel 1991],
[Ng 2005], a lenticular lens array was used to capture the light field. In
[Veer 2007], [Lian 2008], a coded aperture ,which can map rays from
different directions to near pixels in the sensor array, was used to record
images. The pixels of these images consist of a set of pixels recording
light from different direction. The novel views can be estimated by
combining these 4D samples in the light field. In [Ng 2005], a mirolens
array was placed in front of the handheld digital camera. The images

13

captured in this way can be refocused after they have been taken. The
light field video can be obtained in a similar way.
Multiple camera systems are usually used to achieve large disparity
in dynamic scenes. Much research effort has been devoted to the
construction of 2D camera arrays. To simplify the capturing hardware,
light field captured on line segments and circular arc have also been
reported in Section 2.2.

2.4 Review of Rendering Techniques


Rendering is the process to create new view from several images
and other auxiliary information obtained in the representations. In the
early stage of image-based rendering, it did not employ any geometry
information. Image blending in panoramas [Chen 1995] and ray space
interpolation in light field [Levo 1996] are used to do image rendering.
Each ray that goes through a target pixel is mapped to nearby sampled
rays in ray space interpolation. Since some sophisticated representations
use more geometry information such as layered depth images [Shad
1998], surface light field [Wood 2000], and pop-up light field [Shum
2004], graphics hardware has been exploited to accelerate the rendering
process. The geometry information can either implicitly rely on
positional correspondences or explicitly in the form of depth along
known lines-of-sight or 3D coordinates. Representations of the former
usually involve weakly calibrated cameras and rely on image
correspondences to render new views, say by triangulating two reference
images into patches according to the correspondences as in joint view
triangulation (JVT) [Lhui 2003]. These include view interpolation, view
morphing, JVT and transfer methods with fundamental matrices and
trifocal tensors. Representations employing explicit geometry include
14

sprites, relief textures, Layered Depth Images (LDIs), view-dependent


texture, surface light field, pop-up light field, shadow light field, etc.
In general, the rendering methods can be broadly classified into
three categories: 1) point-based, 2) layer-based, and 3) monolithic.
Point-based rendering works on 3D point clouds or point
correspondences and typically each point is rendered independently.
Points are mapped to the target image plane through forward mapping.
For the 3D point X in Fig. 2-2, the mapping can be written as
X C r r Pr xr C t t Pt xt ,

(2-4-1)

where xt and xr are homogeneous coordinates of the projection of X on


target screen and reference images, respectively. C and P are camera
center and projection matrix respectively and is a scale factor. Since
C t , Pt and the focus length f t are known for the target image, t can be

computed using the depth of X. Given x r and r , one can compute the
exact position of x t on the target screen and transfer the color
accordingly. Gaps or holes may exist due to magnification. Disocclusion
and splatting techniques have been proposed to solve this problem. The
painters algorithm is frequently used to avoid the problem that multiple
pixels from the reference view are mapped to the same pixel in the target
image.
Layered techniques usually separate the scene into a group of
planar layers consisting of a 3D plane with texture and optionally a
transparency map. The layers can be thought of as a continuous set of
polygonal models, which are amenable to conventional texture mapping
and view-dependent texture mapping. Usually, each layer is rendered
using either point-based or polygon meshes as in monolithic rendering
15

techniques before being composed in the back-to-front order using the


painters algorithm to produce the final view. Layer-based rendering can
be implemented easily using graphic processing unit (GPU). Since the
rendering of IBR requires very low complexity, it is even possible to
perform the calculation using central processing unit (CPU) by working
on individual layer or object [Chan 2009].
Monolithic rendering usually represents the geometry as continuous
polygon meshes with textures, which can be readily rendered using

xt

xr
et

er

Ct
Cr

xt

xr
et

er

Ct
Cr

Figure 2-2: Forward mapping.


16

graphics hardware. The 3D model normally consists of vertices, normals


of vertices, faces, and texture mapping coordinates. The data can be
stored in a variety of data formats. The most popular formats
are .obj, .3ds, .max, .stl, .ply, .wrl, .dxf, etc.
Relighting, shadow generation and interactivity have played an
increasingly important role in 3D interactive rendering. The most
popular algorithms are shadow mapping, shadow volume, ray-tracing,
pre-computed radiance transfer, pre-computed shadow field, etc. Some
of them have better rendering quality, while others are more efficient for
real time rendering. Thanks to the development of GPU, basic lighting
and shading algorithms like shadow mapping and shadow volume have
been realized on the fly. Modern GPUs can even offer programmable
rendering pipelines for customized rendering effects and shader is a
set of software instructions running on these GPUs to control the
pipelines. Using shader programming, high quality shadow rendering
algorithms like precomputed shadow field can be done in real time. Fig.
2-3 shows examples renderings of the three techniques.
Though there has been substantial progress in capturing,
representing, rendering and modeling scenes, the ability to handle
general complex scenes remains challenging for IBR. A lot of work is
still required to ensure robustness in handling reflection translucency,
highlights, depth estimation, capturing complexity, object manipulation,
etc. Interacting with IBR representations remains challenging because
IBR uses images for rendering. Recent approaches have been focused on
using advanced computer vision techniques, such as stereo/multiview
vision and photometric stereo, and depth sensing devices to extract more
geometry information from the scene so as to enhance the functionalities
17

of IBR representations. While there has been considerable progress in


relighting and interactive rendering of individual real static objects, such
operations are still difficult for real and complicated scenes.

For

dynamic scenes, the huge amount of data and vast amount of viewpoints
to be provided present one of the major challenges to IBR. Advanced
algorithms for processing and manipulation of the high dimensional
representation to achieve such functions as object extraction, model
completion, scene inpainting, etc. are all major challenges to be
addressed. Finally, the efficient transmission, compression and display
of dynamic IBR and models are also urgent issues waiting for
satisfactory solution in order for IBR to establish itself as an essential
media for communication and presentation. All of these motivate us to
study the design and construction of new image-based rendering systems
based on plenoptic videos. The system can potentially provide improved
viewing freedom to users and ability to cope with moving and static
objects and perform 3D reconstruction.

2.5 Summary
In this chapter, the basic topics in image-based rendering have been
reviewed. The plenoptic function which serves an important concept for
describing visual information in our world was introduced. Then a brief
review on light field was given. In fact, how to achieve high quality
rendering and display light field with a wide range of viewing positions
in large scale environmental will be studied in Chapters 3 and 4. Finally,
some rendering techniques including point-based, layer-based, and
monolithic methods are discussed. An extension of these rendering
techniques will be further studied in Chapter 5.

18

(a)

(b)

(c)
Figure 2-3: Example renderings using (a) forward mapping in point rendering [Chan
2005], (b) layered representation (with two layers dancer and background) [Chan
2009], (c) monolithic rendering using 3D polygonal mesh (left) and rendering results
(right) [Zhu 2010].

19

Chapter 3 The Proposed Image-based Rendering


Systems
3.1 Introduction
Both IBR systems are based on the simplified light field. As
mentioned earlier, two IBR systems are constructed and studied in this
thesis, one for capturing and rendering ancient Chinese artifacts and the
other for environmental modeling. They belong to the general class of
image-based representations. Since capturing 3D models in real-time is
still a very difficult problem, light field- or lumigraph-based dynamic
IBR representations with little amount of geometry information have
received considerable attention in immersive TV (also called 3D or
multi-view TVs) applications. Because of the multidimensional nature of
the plenoptic function and the scene geometry, much research has been
devoted to the efficient capturing, sampling, rendering and compression
of IBR. There has been considerably progress in these areas since the
pioneer work of lumigraph by Gortler et al [Gort 1996] and light field by
Levoy and Hanrahan [Levo 1996]. Other IBR representations include the
2D panorama [Szel 1997, Pele 1997], Chen and Williams view
interpolation [Chen 1993], McMillan and Bishops plenoptic modeling
[McMi 1995], layer depth images [Shad 1998] and the 3D concentric
mosaics [Shum 1996], etc. Motivated by light field and lumigraph, the
predecessors in the authors lab have developed a real-time system for
capturing and rendering a simplified dynamic light field called the
plenoptic videos [Chan 2003], [Chan 2004], [Chan 2005], [Chan
2009], [Gan2005] with four dimensions. It is a simplified dynamic light
field, where videos are taken along line segments as shown in Fig. 3-1,

20

instead of a 2D plane, to simplify the capturing hardware for dynamic


scenes.

Figure 3-1: Plenoptic videos: Multiple linear camera array of 4D simplified dynamic
light field with viewpoints constrained along line segments. The camera arrays
developed at [Chan 2009]. Each consists of 6 JVC video cameras.

Pioneer projects in cultural heritage preservation of large scale


structure and sculptures include the Digital Michelangelo Project [Levo
2002], the 3D facial reconstruction and visualization of ancient Egyptian
mummies [Atta 1999], the great Buddha Project [Ikeu 2003], to name
just a few. To avoid possible damage to the ancient artifacts and speed
up the capturing process, we propose to employ the image-based
approach instead of using 3D laser scanners. A circular array consisting
of multiple digital still cameras (DSCs) was therefore constructed in this
thesis to capture the simplified light field of the ancient artifacts along
circular arcs, which we shall call the simplified circular light field
(SCLF) or circular light field (CLF) in short. The circular array is chosen
to provide users with a better visual experience, because it supports fly
over effect and close-up of the artifacts uniformly in the angular domain.
We also developed novel techniques for rendering new views of the
ancient artifacts from the images captured using the object-based
approach. The details will be discussed later in Chapters 4 and 5. A
number of ancient Chinese artifacts from the University Museum and
21

Art Gallery at The University of Hong Kong were captured and


excellent rendering results in ordinary as well as 3D/multiview displays
were achieved. The proposed IBR system and associated algorithms
server as a framework for culture preservation of media-sized ancient
artifacts.
While there are considerable IBR systems proposed previously, few
IBR systems are moveable. Therefore, another objective in this thesis is
to design a moveable IBR system for modeling objects in outdoor
environment. The moveable IBR system proposed uses a linear camera
array consisting of 8 video cameras mounted on an electrically
controllable wheel chair. Its motion can be controlled manually or
remotely by means of additional hardware circuitry. Unlike the previous
multiple camera systems which are not designed to be moveable so that
the view-points are somewhat limited and usually cannot cope with
moving objects and perform 3D reconstruction of objects in open
environment. Our moveable image-based rendering system can be used
to render large environment and moving objects. In particular, the
system supports object-based rendering and 3D reconstruction capability
and consists of two main components. 1) A novel view synthesis
algorithm using a new segmentation and mutual-information (MI)-based
algorithm for dense depth map estimation, which relies on segmentation,
LPR-based depth map smoothing and MI-based matching algorithm to
iteratively estimate the depth map. The method is very flexible and both
semi-automatic and automatic segmentation methods can be employed.
They rank fourth and sixth, respectively, in the Middlebury comparison
of existing depth estimation methods. This allows high quality
renderings of outdoor scenes with improved mobility/freedom to be
obtained. 2) A new 3D reconstruction algorithm which utilizes
22

sequential-structure-from-motion (S-SFM) technique and the dense


depth maps estimated previously. It relies on a new iterative point cloud
refinement algorithm based on Kalman filter (KF) for outlier removal
and the segmentation-MI-based algorithm to further refine the
correspondences and the projection matrices. The mobility of our system
allows us to recover more conveniently 3D model of static objects from
the improved point cloud using a new robust Radial basis function
(RBF)-based modeling algorithm to further suppress possible outliers
and generate smooth 3D meshes of objects. Experimental results show
that the proposed 3D reconstruction algorithm significantly reduces the
adverse effect of the outliers and produces high quality renderings using
shadow light field and the model reconstructed. The details will be
discussed later in Chapters 4 and 5.
The rest of this chapter is devoted to the general design and
construction of the systems. More precisely, Section 3.2 is devoted to the
design and configuration of the IBR systems. Section 3.3 presents some
pre-processing

including

camera

calibration,

color-tensor-based

segmentation and matting. Finally, conclusions are drawn in Section 3.4.

3.2 Construction of the Proposed IBR systems


3.2.1 Still Camera System
As mentioned previously in Section 3.1, the first prototype system
consists of an array of 13 Canon 550D cameras mounted on a camera
stand. The images/videos will be captured and then be processed and
viewed on a multiview TVs. A circular array is chosen to provide users
with a better visual experience, because it emulates fly over and
rotate kind of special effects. Fig. 3-2 shows the proposed capturing
system. Fig. 3-3 shows some snapshots captured by this system called
23

Buddha and Dragon Vase. The resolution of these images is 34652304.


The operation flow is illustrated in Fig. 3-4. Firstly, the objects are
captured by this system from different angles. Then we need to segment
the objects by color tensor which is insensitive to shadow and shading.
The natural matting can be adopted to improve the rendering quality
when objects are mixed on other backgrounds. From the segmented
objects, approximated geometry information for each object can be
estimated by point-based matching for rendering and 3D reconstruction.
Finally other rendering techniques such as shadow field re-lighting and
view dependent texture mapping will be added in the rendering. The
details of these algorithms will be discussed in the rest of the current
chapter and next chapter. .

Figure 3-2: Circular camera array constructed.

24

(a)

(b)
Figure 3-3: Snapshots: (a)Buddha (b) Dragon Vase.

25

Figure 3-4: Block diagram of the proposed IBR system.

3.2.2 Moveable Camera System


The second moveable IBR (M-IBR) system consists of a linear
array of cameras mounted on an electrically controllable wheel chair so
as to cope with moving objects in large environment and hence improve
the viewing freedom of users. Fig. 3-5 shows the moveable IBR system
that we have constructed. It consists of a linear array of 8 Sony HDRTGIE high definition (HD) video cameras which is mounted on a
FS122LGC wheel chair.
The motion of the wheel chair is originally controlled manually
through a VR2 joystick and power controller modules from PG drives
[PGDT] technology. To make it electronically controllable, we
examined the output of the joystick and generated the (x-,y-) motion
control voltages to the power controller using a Devasys USB-I2C/IO
[USBI] micro-controller unit (MCU). By appropriately controlling these
voltages, we can control the motion of the wheel chair electronically.

26

Figure 3-5: The proposed moveable image-based rendering system.

Moreover, by using the wireless LAN of a portable notebook mounted


on the wheel chair, its motion can be controlled remotely. By improving
the mobility of the IBR capturing system, we are able to cope with
moving objects in large environment.
The HD videos are captured in real-time into the storage cards of
cam-corders. They can be downloaded to PC for further processing such
as calibration, depth estimation, and rendering using the object-based
approach. For real-time transmission, the cam-corders are equipped with
a composite video output which can be further compressed and
transmitted. To illustrate the concept of multiview conferencing, a
ThinkSmart IVS-MV02 Intelligent Video surveillance system [IVS] was
used to compress the (320x240) 30 frames/sec videos online, which can
be retrieved remotely through the wireless LAN for viewing or further

27

processing. The system is built from Analog Device DSP and real-time
compression at a bit rate of 400kbps.
Before the cameras can be used for depth estimation, they must be
calibrated to determine the intrinsic parameters as well as their extrinsic
parameters, i.e. their relative positions and poses. This can be
accomplished by using a sufficient large checkerboard calibration
pattern. We follow the plane-based calibration method [Zhan 2000] to
determine the projective matrix of each camera, which connects the
world coordinate and the image coordinate. The projection matrix of a
camera allows a 3D point in the world coordinate be translated back to
the corresponding 2D coordinate in the image captured by that camera.
This will facilitate depth estimation. Fig. 3-6 shows snapshots of an
outdoor and indoor videos captured by the proposed system called
podium and presentation, respectively. The resolution of these realscene videos is 19201080i with 25frames per second (fps) in 24-bit
RGB format. The system flow of the proposed moveable IBR system is
summarized in Fig. 3-7. Firstly we need to stabilize the video to reduce
the shaky motion frequently encountered in typical moveable IBR
systems. Then, a novel view synthesis algorithm using a new
segmentation and mutual-information (MI)-based algorithm for dense
depth map estimation is used to iteratively estimate the depth map.
Finally we need to reconstruct the 3D model using a new 3D
reconstruction algorithm which utilizes sequential-structure-from-motion
(S-SFM) technique and the dense depth maps estimated previously. A
new robust radial basis function (RBF)-based modeling algorithm is
used to further suppress possible outliers and generate smooth 3D
meshes of objects.
28

(a)

(b)
Figure 3-6: Snapshots of the plenoptic videos at a given time instance: (a) is the
Podium outdoor video from camera 1 to camera 4 and (b) is the Presentation
indoor video from camera 1 to camera 4.

29

Video
stablization

SegmentationMI-based Depth
Estimation

Depth Map
Refinement

3D
Reconstruction

Image-based
Rendering

Figure 3-7: Block diagram of the proposed M-IBR system constructed.

3.3 Pre-Processing
3.3.1 Still Camera System
In order to speed up the whole processing procedure on the
proposed still camera system, some pre-processing needs to be done at
start. At first, all the cameras need to be calibrated. Because this system
is still, intrinsic and extrinsic parameters of cameras can obtained
precisely by following the plane-based calibration method. The proposed
still camera system will only focus on the objects we are interested. The
objects will be segmented out of the images for reducing the noise from
background.
3.3.1.1 Camera Calibration
In computer vision, the link between the 3D real world points and
image pixels is the camera parameters. The camera parameters contain
the extrinsic parameters and intrinsic parameters. Estimation of the
extrinsic and intrinsic parameters is called camera calibration [Truc
30

1998]. The extrinsic parameters define the translation between the


camera reference frame and the world reference frame. A 3D translation
vector T and a 3 3 rotation matrix R are used to represent the extrinsic
parameters [Truc 1998]. The relationship (see Fig. 3-8) between the
point in the world and the camera frame is

Pc R( Pw T ) .
ZW

(3-3-1)
Zc

R, T

XW

Xc
Pc

PW
YW
Yc

Figure 3-8 Relationship between the world coordinate and the camera coordinate.

The intrinsic parameters are defined in form of camera matrix C:


fx
C 0

d
fy
0

cx
cy ,

(3-3-2)

where f x and f y represent the focal length of the camera in terms of x


and y direcition. c x and c y are the coordinate value of the principal point.
d is the skew parameter which is zero for pinhole cameras. For the
uncertainty of the camera types, the skew parameter will be set. By
combining the extrinsic parameters and intrinsic parameters, perspective
projection matrix equation will be:

31

XW
xc

y CRI | T YW ,
c
ZW
zc

1

(3-3-3)

where ( xc , y c , z c ) is the point in the image coordinate system and


( X w , Yw , Z w ,1) is the point in the world coordinate system. is dot
product. Both of the coordinate systems are in homogeneous coordinate
system. By defining the projective matrix P as
P CRI | T ,

(3-3-4)

where P is a 3 4 matrix, the equation (3-3-3) can be rewriten as


XW
xc

y P YW .
c
ZW
zc

1

(3-3-5)

Camera calibration is to estimate the matrix P. Zhang has proposed


an algrothm for camera calibration by using planar pattens [Zhan 1999].
The planar pattern is usually chosen as a chessboard like plane as shown
in Fig. 3-9, which is used in our system.

32

Figure 3-9 Planar patten.

The basic procedure of Zhangs algorithm is:


1. Take a few image of the test pattern in different orientations.
2. Detect the feature points in the test images (often the corners).
3. Estimate the five intrinsic parameters (no skew paramter) and the
extirnsic paramters by using the closed-form solutions.
4. Estimate the radial distortion by solving linear least squares.
5. Refine all the parameters by minimizing error functions.
In this work the plane-based algorithm will be changed slightly to
fit my situation. The skew parameter will be added and the distortion
will not be estimated at first. Not only radial distortion, but also the
tangential distortion will be estimated.
3.3.1.2 Color-Tensor-based Segmentation and Matting
The first step to process these images is to segment the objects out
of the images. In the still camera system, we employ the photometric
invariant features [Weij 2006] to extract the foreground from the
33

monochromatic screen background. More precisely, the color tensor


describes the local orientation of a color vector f(x, y) as:
f xT f x
T ( x, y ) T
f y fx

f xT f y
,
f yT f y

(3-3-6)

where f(x,y) is a vector which contains the color component values at


position (x,y) and the subscripts x and y in f x ( x, y) and f y ( x, y ) denote
respectively the derivative of f(x,y) with respect to x and y, the image
coordinates. According to [Weij 2006], the color vector can be seen as a
weighted sum of two component vectors: [ R, G, B]T e(mb cb mi ci )
where c b is the color vector of the body reflectance, c i is the color
vector of the interface reflectance (i.e. specularities or highlights), mb
and mi are scalars representing the corresponding magnitudes of
reflection and e is the intensity the light source. Thus

[ R, B, G]Tx emb (c b ) x (e x mb e(mb ) x )c b


(e(mi ) x e x mi )c i

(3-3-7)

which suggests that the spatial derivative is a sum of three weighted


vectors, successively caused by body reflectance, shading-shadow and
specular changes. For matte surfaces, the intensity of interface
reflectance is zero (i.e. mi=0) and the projection of the spatial derivative
f x on the shadow-shading axis is the shadow-shading variant containing

all energy which can be explained by changes due to shadow and


shading. The shadow-shading axis direction is c b which is parallel to

f emb cb for matte surfaces. So the projection s1 of the spatial


derivative f x on the shadow-shading axis is

34

s1 ( f xT f / || f ||) f / || f || .

(3-3-8)

Subtraction of the shadow-shading variant s1 from the total derivative

f x results in the shadow-shading quasi-invariant s 2 f x s1 . In


summary, the derivative of the color tensor can be separated into
shadow-shading variant part s1 and shadow-shading invariant part s 2 .
The shadow-shading invariant part does not contain the derivative
energy caused by shadows and shading. To construct a shadow-shadingspecular quasi-invariant, this part is combined with the hue direction,
which is perpendicular to the light source direction ci and the shadow
and shading direction cb. Therefore the hue direction is

h (ci cb ) / ci cb .

(3-3-9)

The projection of the derivative on the hue direction is the desired


shadow-shading-specular-quasi-invariant part:

H ( f xT h / || h ||) h / || h || .

(3-3-10)

By replacing f x in the color tensor equation (1) by s 2 or H , we can get


the shadow-shading-specular-quasi-invariant color tensor and the
shadow-shading invariant color tensor respectively. By setting a suitable
threshold value for the color tensor, we can detect the boundary of the
object. Fig. 3-10 shows some segmentation results that were obtained
using the color tensor method, followed by Bayesian matting for
extracting a foreground from the background. After segmentation, the
hard boundary of the object will be obtained. Matting can then be
applied to obtain soft segmentation information, called the matte, of the
object. The matte, which is an image containing the portion of
foreground with respect to the background (from 0 to 1) at a particular
35

location, greatly improves the visual quality of mixing the objects onto
other backgrounds.

(a)

(b)
Figure 3-10:(a) Extraction results using color-tensor-based method. Left: original,
middle: hard segmentation, right: after matting. (b) Close up of segmentations in
(a). Left: hard segmentation, Right: after matting.

3.3.2 Moveable Camera System


Unlike the static camera system described before, the moveable
camera system will experience shaky motion during movement and
hence video stabilization has to be performed.
3.3.2.1 Video Stabilization
To ensure good tracking of objects and to obtain more image
samples for high quality rendering, the wheel chair is usually driven
steadily during capturing. However, one problem with M-IBR system is
36

that the ground surfaces may not be smooth and the whole mechanical
structure can vibrate considerably during movement. In our M-IBR
system, the shaky motion of the camera array of the outdoor
environment seems to come from the roughness of the ground surfaces
and the vibration of the mechanical structure during the movement.
Besides, the video captured may also appear shaky when the system is
moving and about to settle down in indoor environment. To reduce these
annoying effects, video stabilization [Hu 2007], [Mats 2005], [Mats
2006], [Rata 1998] is frequently employed to eliminate the undesired
motion fluctuation in the captured videos.
As mentioned above, our M-IBR system was driven steadily during
capturing. Therefore, the undesired motion fluctuation will usually
appear as high frequency components compared to the intentional
motion. As a result, the problem of video stabilization can also be
viewed as the removal of high frequency components in the estimated
velocity. To this end, one needs to estimate the global motion of the
camera, say by mean of optical flow on the video sequence, so that this
annoying high frequency local motion can be removed to stabilize the
videos.
The proposed video stabilization algorithm is divided into three
major steps as follows. 1) Global motion estimation: firstly, the
geometric transformation between a location x [ x1 , x2 ]T in a frame
with that in an adjacent frame, x ' , is modeled by an affine transformation

x' T [ x] Ax t , where t [t x1 , t x2 ]T is the translational component


and the affine rotation, scaling, and stretch are represented by the matrix
a
A 1
a3

a2
. In homogeneous coordinate, x h [ x1 , x2 ,1]T , T can be

a4
37

conveniently represented by a matrix multiplication Th x h , where

A t
Th
. T is estimated from the tracked features in adjacent video
0
1

frames using the scale invariant feature transformation (SIFT) [Lowe


2004], instead of the Lucas-Kanade tracker in [Chan 2010]. 2) Local
smoothing of motion: the intentional motion, which is assumed to be
slow and smooth, is then obtained by smoothing the global motion
estimated using local polynomial regression (LPR) with adaptive
bandwidth selection [Zhan 2009]. Unlike conventional methods, the
bandwidth or window size for smoothing can be automatically
determined. This will be further discussed below. 3) Video Completion:
the uncovered areas are filled using motion inpainting [Mats 2005],
[Mats 2006].
We now describe each step in more details. Let I t ( x) | t 0,, N ,
where x [ x1 , x2 ]T , 1 x1 1 , 1 x2 2 , be a video sequence
consisting of N video frames with resolution 1 2 captured by our
M-IBR system. Consider the global motion transformation up to time
instant t , {T01 ,, Ttt 1} , where Ti i 1 is the coordinate transformation from
the i-th to the (i+1)-th frame. If Ti i 1 is smoothed separately, a smoothed
transformation chain {T01 ,, Tt t1} is obtained and the t-th compensated
image frame I't can be obtained as:
I t(ti 10 (Ti i 1Ti i 1 )[ x]) I t ( x) ,

(3-3-11)

where Ti i 1 and Ti i 1 denote respectively the transformation from


frame i+1 to i and the smoothed transformation from frame i to i 1 . In
order to avoid error accumulation due to the cascade of original and
38

smoothed transformation chains, [Mats 2005] proposed to compute


~
directly the transformation Tt from the current frame I t (x ) to the
corresponding motion compensated frame I t(x ) using only the
~
neighboring transformation matrices as Tt Tt i G(i) , where
it

t { f : t f t } denotes the indices of neighboring frames,

G( x) ( 2 ) 1 e x

/ 2 2

is a Gaussian kernel, 2 is the support of t

or window size, and denotes the element-wise convolution operation.


It can be seen that the selection of the kernel size affects the degree
of smoothing. A large kernel size will lead to the problem of oversmoothing, while a small kernel size may not be able to remove the high
frequency undesirable motion. The green and black lines in Fig. 3-11
illustrate the effect of using a small kernel size of 3 and a large
kernel size of 20 , respectively, using the method in [Mats 2005].
To address this issue, we propose a new method for choosing
adaptively the kernel size using local polynomial regression (LPR) with
adaptive bandwidth selection. The close relationship between curve
fitting and video stabilization has been recognized for example in [Hu
2007], where a local parabolic fitting is used to compute the smoothed
motion path. However, the kernel size is also fixed. The advantage of
our method is that the kernel size can be adaptively selected from the
data.
LPR is a very flexible and efficient nonparametric regression
method in statistics, and it has been widely applied in many research
areas such as data smoothing, density estimation, and nonlinear
modeling. Given a set of noisy samples of a signal, the data points are
fitted locally by a polynomial using the least-squares (LS) criterion with
39

a kernel function having certain bandwidth parameters. Since signals


may vary considerably over time, it is crucial to choose a proper kernel
size or local bandwidth to achieve the best basis-variance tradeoff. In
this paper, we used the refined intersection of confidence intervals (RICI) method to perform bandwidth selection. Here, we follow the
homoscedastic data model of the time series:
Yi m( X i ) ( X i ) i ,

(3-3-12)

where {(Yi , X i ) | i 1,2,, n} are a set of uninvariate observations,


m( X i ) is a smooth function specifying the conditional mean of

Yi

given

X i , and i is an independent identically distributed (i.i.d.) additive

white Gaussian noise. The problem is to estimate m( X i ) and its k-th


derivative m ( k ) ( X i ) from the noisy sample Yi so as to achieve
smoothing. Since m( X i ) is a smooth function, we can approximate it
locally as a general degree-p polynomial at a given point x0 :
m( x) m( x0 ) m( x0 )(x x0 )

m( x0 )
m ( p ) ( x0 )
2

( x x0 )
( x x0 ) p
2!
p!

(3-3-13)

0 1 ( x x0 ) p ( x x0 ) p ,
where x is in the neighborhood of x0 and k (k 0,1, , p) is the k-th
polynomial coefficient. The coefficient vector [ 0 , 1 ,, p ]T at
location x0 can be obtained by solving the following weighted leastsquares (WLS) regression problem:

40

i 1

k 0

min{ K h ( X i x0 )[Yi k ( X i x0 ) k ]2 } ,

(3-3-14)

where K h ( X i x0 ) K ( X i h x0 ) / h , K () is a kernel function with


bandwidth parameter h, which emphasizes the influence of neighboring
observations around x0 in the estimation. The parameter h is adaptively
chosen at different locations x0 so as to adapt to the local characteristics
of the signal (i.e. the intentional motion path). Differentiating the
objective function in (3-3-14) with respect to and setting the
derivative as zero, we get the following LS solution in the matrix form:
( x0 ,h) ( X T WX ) 1 X T Wy ,

where

1 ( X 1 x0 )

1 ( X 2 x0 )
X

1 ( X n x0 )

( X 1 x0 ) P

( X 2 x0 ) P
,

( X n x0 ) P

(3-3-15)

y Y1 Y2 Yn

and

W diag{K h ( X i x0 )} is the weighting matrix.

By estimating ( x0 ,h) with an optimized bandwidth h at different


x0 , we obtain a smoothed representation of the data from the noisy

observations. In the context of video stabilization, a key problem of


applying LPR is thus to select an optimal bandwidth parameter h to
achieve the best bias-variance tradeoff in estimation. Here, we use the RICI bandwidth selection algorithm [Zhan 2008] to select the optimal
bandwidth. The basic idea of the R-ICI adaptive bandwidth selection
method is to calculate a set of smoothing results with different
bandwidths and then to examine a sequence of confidence intervals of
these smoothing results to determine and refine the optimal bandwidth.
In this thesis, the kernel K (u ) is chosen as the Epanechnikov
41

kernel K (u) (3 / 4)(1 u ) , and the bandwidth parameter set for R-ICI
2

is {h j :| h j a j / N , j 1,,10} with a 1.2 . N is the total number


of frames. The details of the algorithm are omitted and interested readers
are referred to [Zhan 2008].
After video stabilization, some of the pixels at the boundaries may
be missing as illustrated in the second row of Fig. 3-12. These missing
areas can be filled in or completed by video completion using motion
inpainting [30], which can propagate the motion field into the missing
areas rather than simply propagating the RGB color values.
Figs. 3-11(a) and (b) show the original and smoothed translational
motion in the x and y directions (i.e. t x1 and t x2 in T). Green and black
lines show the motion obtained by a fixed small kernel and a fixed large
kernel using the method in [Mats 2005] respectively. The red lines show
the motion obtained by using the LPR with the R-ICI (LPR-R-ICI)
bandwidth selection method over time. It can be seen that the proposed
LPR-R-ICI method is able to suppress the high frequency components
while preserving the smooth intentional motion. Example original
images, stabilized results and inpainted results using the proposed
method are shown in Fig. 3-12.

42

(a)

(b)

Figure 3-11: Motion smoothing results for horizontal (Translation-x) and vertical
(Translation-y) directions. The original motion path and the smoothed motion path
with different methods are shown. In (a)-(b), the blue dotted lines correspond to the
shaky original motion path. Green and black lines correspond to the smoothed
motion path using the method in [Mats 2005] with a small and a large kernel sizes
respectively.

Figure 3-12: Video Stabilization result. The first row shows the original images
captured by our system; the second row shows the stabilized images without video
completion; the third row shows the completed results.

43

3.4 Summary
The design and construction of the two proposed IBR systems and
their associated processing flow are presented. The first still IBR system
is designed for capturing ancient Chinese artifacts. This system can be
used for preservation and dissemination of cultural artifacts with high
digital quality. The second moveable IBR system is designed for moving
object and larger environment, which can potentially provide improved
viewing freedom to users. Moreover, some pre-processing techniques
including camera calibration, color-tensor-based segmentation and
matting are presented. After camera calibration, we can obtain the
camera matrix for subsequent 3D reconstruction. Color-tensor-based
segmentation is insensitive to the shadow and shading. The natural
matting can be adopted to improve the rendering quality when objects
are mixed on other backgrounds. Experimental results show that these
algorithms work well with some shadow and shading. These pilot
studies provide useful experience for the design and construction of
similar and more general IBR systems.

44

Chapter 4 A New Combined Segmentation-MutualInformation (MI)-based Algorithm for Dense Depth


Map Estimation
4.1 Introduction
Conventional depth estimation techniques are mostly based on
computing the correspondences from stereo or multiple views using
feature point matching. More recent algorithms employ Markov Random
Field (MRF) [Zhan 2005] to model the observation and estimate the
depth map by maximizing the a posterior probability (MAP). In
particular,

Graph

Cuts(GC)-based

[Boyk

2001]

and

Belief

Propagation(BP)-based [Sun 2005] methods for performing the


optimization have been widely used because of their good performances.
The success of these methods depends critically on how the physical
phenomena such as occlusion, edges, color correlation, are modeled.
Techniques such as occlusion penalization [Kolm 2001], visibility
checking [Sun 2005], [Yang 2009] and structural information [Yang
2009], [Klau 2006], [Wang 2008], [Bley 2010], [Tagu 2008] are areas of
active research. Another popular direction is to combine segmentation
with GC or BP [Yang 2009], [Klau 2006], [Wang 2008], [Bley 2010],
[Tagu 2008]. In this thesis, we proposed a modified MI-based dense
matching algorithm by utilizing prior segmentation information. The
segmentation information, which can be obtained semi-automatically
and automatically, considerably reduces possible matching errors arising
from occlusion. Apart from its flexibility, the proposed algorithm only
involves the selection of few parameters and it works well for indoor
scenes and outdoor scenes. Moreover, the algorithm can be extended to
track the boundary of the object. Experimental results show that its
45

performance is reliable even for noisy videos such as dynamic


ultrasound images. Though the algorithm can be used for multiview
images and videos, we shall largely based our discussion on the
moveable IBR system with multiple videos. For the static IBR system,
the ancient Chinese artifacts contain much texture information and hence
a simpler dense depth map estimation with faster speed is proposed. The
detail will be described in Chapter 5.
This chapter is organized as follows: Section 4.2 is devoted to the
combined segmentation-MI-based dense depth map estimation. Depth
map refinement such as noise removal and smoothing will be introduced
in Section 4.3. The MI-based matching will be extended to object
tracking in Section 4.4 using noisy dynamic ultrasound images as
illustration. More comparison and experimental results are shown in
Section 4.5. Finally, conclusion is drawn in Section 4.6.

4.2 Combined Segmentation-MI-based Depth Estimation


Since we wish to perform image-based rendering and 3D
reconstruction of selected object(s) in the scene using multiview videos,
the first step is to establish dense 2D correspondences between adjacent
views so as to generate a dense point cloud for 3D reconstruction or
depth maps for rendering. In our M-IBR system, there are 8 cameras and
hence 8 views are obtained at each time instance for depth estimation.
Here, a modified MI-based dense matching algorithm with segmentation
is employed. Because we have segmented the images into several parts,
the whole matching process is performed on each image segment. In the
sequel, we shall use image to denote the segmented parts in the image.

46

4.2.1 Object Segmentation Using Level-Set Method


For rendering (intermediate view synthesis) and 3D reconstruction
of a given object in the scene, it is usually advantageous to segment the
object for further processing so as to preserve depth discontinuities.
Following the object-based approach [Chan 2010], we first segment the
object at a given frame using a semi-automatic segmentation method.
Object tracking is then employed to track and segment the object
automatically in subsequent frames and adjacent views. In our system,
the initial segmentation for each camera is performed by means of Lazy
snapping [Li 2004]. Then, the object at other time instants of each
camera is tracked using the level-set method [Gan 2005]. Example video
tracking results are shown in Fig. 4-1(a). It can be seen that major depth
discontinuous are well delineated. If automatic segmentation methods
are used to obtain the initial segmentation, the segmented part may
sometimes involve more objects. Moreover, part of background and
foreground will be grouped together. Though such segmentation is in
principle consistent in terms of image intensity, depth discontinuities
may not be preserved as well. As an illustration, a graph-cut-based
automatic segmentation method [Felz 2004] is used to obtain an initial
segmentation for tracking. Figs. 4-1 (a) and (b) compare the example
results using the semi-automatic-based and automatic-segmentationbased tracking. It can be seen that the former gives better result than the
latter, for example at the light pole and foreground of the scene. To
obtain a better rendering result, the semi-automatic method will be
adopted in this work, though our framework also works for automatic
methods. We now describe the proposed depth estimation method.

47

(a)

(b)
Figure 4-1: Segmentation results using the level-set-based tracking method. (a) is the
initial segmentation obtained by lazy snapping, (b) is the initial segmentation
obtained by graph cut method.

48

4.2.2 Mutual Information Matching


The mutual information (MI) of two random variables X and Y is
defined

as

p ( x, y )
dxdy
I ( X ; Y ) Y X p( x, y ) log
p X ( x ) pY ( y )

where

p X (x) and pY ( y) are the probability density function (pdf) of X and Y.

I ( X ; Y ) can also be expressed in terms of the entropy and joint entropy

of X and Y as follows:
I ( X ; Y ) H ( X ) H (Y ) H ( X , Y ) ,

(4-2-1)

where H ( X , Y ) Y X p( x, y) log p( x, y)dxdy is the joint entropy of X


and Y, and H ( X ) X p( x) log p( x)dx and H (Y ) Y p( y) log p( y)dy
are the entropy of X and Y, respectively. Intuitively, MI measures the
information that X and Y share by measuring how much knowing one of
these variables reduces the uncertainty about the other.
In [Huan 2006], a free form deformation method using spline
function was proposed for 2D shape registration in pattern recognition
systems. We now extend it to our segmentation-MI-based depth
estimation algorithm. More precisely, the intensities of the two rectified
image segments, say A and B, from two views to be matched are treated
as random variables with pdfs, p A (i A ) and p B (i B ) , and joint pdf
p AB (i A , i B ) . B is then deformed by mean of the disparity transformation

function T (B) with parameters to be determined. Ideally, when the two


images are registered, the MI, I ( A; T ( B)) , will be maximized. Therefore,
by maximizing I ( A, T ( B)) using the parameters of T () , the two original
image segments can be registered to infer their correspondences.
Consequently, the entropies can be calculated from the probability
density functions as follows:
49

H ( p A (i A )) I ( A) p A (i A ) log p A (i A )di A

I ( A),I ( B ) p A,T ( B ) (i A , i B ) log p A (i A )di A di B

(4-2-2)

H ( pT ( B ) (i B )) I ( B ) pT ( B ) (i B ) log pT ( B ) (i B )di B
I ( A),I ( B ) p A,T ( B ) (i A , i B ) log pT ( B ) (i B )di A di B

(4-2-3)

H ( p A,T ( B ) (i A , i B ))
I ( A),I ( B ) p A,T ( B ) (iA , iB ) log p A,T ( B ) (iA , iB )diAdiB .

(4-2-4)

where i A and i B are the intensity valuables of A and T(B) and their
ranges are I (A) and I (T ( B)) I ( B) , respectively. The latter follows
from the fact the T () is a disparity transformation which does not
change the range of the intensity values. To proceed further, one needs
to determine the corresponding pdfs. A powerful method for
approximating the pdfs is the kernel method [Silv 1986] which
approximates the pdfs directly from the image data as follows:
p A (iA )

1
G1 (iA I A ( x1 , x2 ))dx1dx2 ,
V

pT ( B ) (iB )

(4-2-5)

1
G1 (iB IT ( B ) ( x1 , x2 )))dx1dx2
V

(4-2-6)

p A,T ( B ) (iA , iB )

(4-2-7)

1
G2 (iA I A ( x1 , x2 ), iB IT ( B ) ( x1 , x2 ))diAdiB ,
V

where G1 ( x)

1
2

e x

/( 2 )
2

, G2 ( x1 , x2 )

x2 x2
1( 1 2 )
1
2 1 2

12 22

, I A ( x1 , x2 ) and

I B ( x1 , x2 ) are the intensities of A and B at pixel location x [ x1 , x2 ]T ,


I T ( B ) ( x1 , x2 ) I B (T 1 ( x1 , x2 )) , T -1 is the inverse function of T, is the
image domain and V is the area of . In practice, the integrals are
approximated by summing over the pixel coordinates.
50

Figure 4-2: A regular grid for Local Transformation.

For accurate matching, the transformation in our approach is carried


out in two steps, namely global and local transformations. In global
transformation, which is performed first, the parameters of a global
transformation are determined by matching the two images so as to
model their relative scale, translation and rotation. It can be derived as:

E global I ( A),I ( B ) p A,T ( B ) (i A , i B ) log

p A,T ( B ) (i A , i B )
p A (i A ) pT ( B ) (i B )

di A di B .

(4-2-8)

In the local transformation step (refinement), local deformation is


performed, which is represented by a 2D spline function. The
transformation parameters, which are the displacement vectors at a
regular grid to interpolate the spline function (See Fig. 4-2), are
determined by minimizing the objective function in (4-2-8).
In the present work, the global deformation function TG is chosen as
an affine transformation. The model parameters can be obtained by
maximizing the objective function in (4-2-8). Let B' be the transformed
image obtained by the affine transformation after the first step. The local
transformation TL (B' ) , which is a 2D spline function, is parameterized
by the displacement vectors at a uniform grid of control points {C} ,

Pc (m, n) [ Pc1 (m, n), Pc2 (m, n)]T

which

are

indexed

by

m 1,, M , n 1,, N . If (1 ,2 ) is the resolution of the input image,


51

the spacing of the control points in the x and y direction are


1 1 / M and 2 2 / N respectively. The deformation of any pixel

in the image is obtained by spline interpolation of those at the grid points


{C} . Therefore, the deformation of pixel ( x1 , x2 ) , P ( x1 , x2 )
[ P1 ( x1 , x 2 ), P2 ( x1 , x2 )]T where 1 x1 1 , 1 x2 2 , can be written

as:
P ( x1 , x2 ) 3 0 3 0 (l ) (v) Pc (m , n ) ,

(4-2-9)

for l ( x1 / 1 ) x1 / 1 , v ( x2 / 2 ) x2 / 2 , (u ) is the cubic spline function,

(m , n ) | ( , ) [0,3]

are the neighboring

control points of ( x1 , x2 ) . The collection of Pc (m, n) forms the


parameters

of

the

transformation

function

TL (B' )

and

TL ( B' ( x1 , x2 )) B' ( x1 P1 ( x1 , x2 ), x2 P2 ( x1 , x2 )) . By substituting (4-2-

9) into (4-2-8), one gets the local matching data term to be minimized.

E data I ( A),I ( B ) p A,TL ( B ') (i A , i B ) log

p A,TL ( B ') (i A , i B )
p A (i A ) pTL ( B ') (i B )

di A di B .

(4-2-10)

In order to reduce the variance of the variables, an additional


smoothing term and other prior constraints can be added to the data term
above. A popular smoothing term is the L2 norm of the displacement
vectors Esmooth ( m,n ) ( Pc (m, n) Pc (m, n) ) . Since there are only a few
2

model parameters in the affine transformation, the smoothing term is


only required in local matching.
If the pair of images being registered does have distinct geometric
features as correspondences, incorporating these feature information can
greatly improve accuracy and efficiency. In our algorithm, the scale
invariant features can be used as a feature term. These constraints can be
52

conveniently integrated into our registration framework. More precisely,


assuming that the total number of feature points is R, and their
corresponding locations at A and B' are x A,r and x B ',r , r 1,, R ,
respectively, then the following energy term can be incorporated as
feature term:
R

E feature D( x A,r , xB,r ) ,


r 1

(4-2-11)

where D( x A,r , x B ',r ) is an appropriate distance measure such as the


Euclidean distance between x A,r and x B ',r .Therefore the objective
function in the local transformation is
Elocal Edata Esmooth E feature .

The

limited-memory

(4-2-12)

BroydenFletcherGoldfarbShanno

(L-

BFGS) algorithm [Noce 1991] is used to solve for the unconstrained


nonlinear optimization problems. An advantage is that the explicit
evaluation of the Hessian matrix is not required, since it can be
recursively estimated. Moreover, it was found to be much faster than
using the conventional level set method [Huan 2006]. Because there is
no need to compute the whole Hessian matrix, the storage space of LBFGS is less than other conventional algorithms, such as Belief
Propagation. Because of this reason, the L-BFGS method can deal with
large problems such as 1080P resolution images. Meanwhile the LBFGS method can converge to a local optimum in non-convex problems
under mild conditions as demonstrated in [Noce 1991]. This is an
important advantage over other conventional algorithms. Previous
studies have also analyzed and demonstrated the efficiency of the LBFGS [Noce 1991] especially in terms of function evaluations.
53

As mentioned before, the segmented parts are processed one by one


and they will be integrated to form the final depth map for matching one
view to the other at each time instant.

4.3 Depth Map Refinement


Compared with the traditional MI method, the segmentation-MIbased method simplifies the preservation of depth discontinuities, and
the smoothing and inpainting of depth maps. This is illustrated in Fig. 43 where example depth maps obtained by the MI algorithm without
segmentation (Fig. 4-3(a)), with automatic segmentation (Fig. 4-3(b))
and semi-automatic segmentation techniques (Fig. 4-3(c)) are shown.
The depth maps obtained by incorporating the segmentation information,
(Figs. 4-3(b) and (c)), are considerably better than the one without
segmentation

information

(Fig.

4-3(a)).

Moreover,

the

depth

discontinuities at object boundaries and smoothness at flat regions are


seen to be better preserved for the semi-automatic approach, which will
significantly reduce the artifacts during rendering. However, due to
noise, occlusion and lower reliability of the matching process at low
texture areas, the resulting depth maps may still contain errors. These
issues will be addressed below through further refinement of the depth
maps.

54

(a)

(b)

(c)

(d)

(e)
Figure 4-3: (a) is an example depth map obtained by using MI matching without
segmentation information; (b) shows the depth map obtained by using automatic
segmentation MI matching; (c) shows the depth map obtained by using semiautomatic segmentation MI matching. Green areas in (c) are the occlusion areas
detected by our algorithm. (d)-(e) show the refined depth maps of (c) by inpainting
and smoothing (c) using SK-LPR-R-ICI and 25 25 ideal low-pass filter,
respectively.

55

4.3.1 Occlusion Detection and Inpainting


Let the depth map obtained by the MI-based matching algorithm
from the stabilized images I i,t ( x ) to I i1,t ( x ) be ii 1,t ( x ) , where

x1 1,2,,1 x2 1,2,,2 , i=1,,M and t=0,,N. Similarly, a


depth map can be obtained from matching I i1,t ( x ) to I i,t ( x ) , which
gives ii,t 1 ( x ) . If a pixel is not occluded, then the disparity values of the
same pixel in ii 1,t ( x ) and ii,t 1 ( x ) should be similar to each other. If the
absolute value of their difference is larger than a certain threshold, this
pixel is considered to be occluded. In this work, this threshold value is
chosen as two pixels. Therefore, we can obtain a refined depth map

i ,t ( x ) of each image after occlusion detection. For a comprehensive


survey of occlusion detection algorithms, please see [Egna2002].
After these occluded pixels are detected, we need to inpaint the
depth values at these occlusion areas (e.g. the green areas in Fig. 4-3(c)).
The occluded areas are inpainted by interpolation using the samples
inside the corresponding segments, which avoids blurring at depth
discontinuities if the conventional interpolation techniques are used.
4.3.2 Smoothing of Depth Maps
As mentioned earlier, depth map may contain invalid value due to
noise and regions with low texture, etc. Therefore, the depth maps
should be further smoothed to reduce such estimation errors. Here we
adopt 2D LPR with adaptive bandwidth selection [Zhan 2009]. It
enables us to preserve the discontinuity at object boundaries while
performing smoothing at flat areas. More precisely, we treat the depth

56

map as a 2D function Y ( x1 , x2 ) of the coordinate x [ x1 , x2 ]T with


x1 1,2,,1 and x2 1,2,,2 :
Yi m( X i ) ( X i ) i ,

(4-3-1)

where (Yi , X i ) is a set of observations with i 1,, n . X i [ X i ,1 , X i , 2 ]T


is a 2 dimensional explanatory variable. m( X i ) is a smooth function
specifying the conditional mean of Yi given X i , and i is an
independent identically distributed (i.i.d.) additive white Gaussian noise.
The problem is to estimate m( X i ) from the noisy sample Yi . Since
m( X i ) is a smoothing function, we can approximate it locally as a

general degree-p polynomial at a given point x [ x1 , x2 ]T :


p

m( X : x ) k1 ,k2 ( X j x j ) j ,
0 k1 k 2

j 1

(4-3-2)

where { k1 ,k2 : k1 k2 and 0,...p} is the vector of coefficients.


The polynomial coefficient at a location x can be determined by
minimizing the following weighted least square (LS) problem:
n

min K H ( X i x )[Yi m( X i : x )]2 ,

i 1

(4-3-3)

where K H () is a suitably chosen 2D kernel. When x is evaluated at a


series of 2D grid points, we obtain a smoothed depth map from the noisy
depth estimates Yi . Equ. (4-3-3) can be solved using LS method and the
solution is:
LS ( x, h) (T )1 T ,

57

(4-3-4)

where diag{K H ( X 1 x),, K H ( X n x)} is the weight-ing matrix,


1 ( X 1 x ) T

1 ( X 2 x)T
T

Y [Y1 , Y2 , , Yn ] ,

T
1 ( X n x )

vech{( X 1 x )( X 1 x ) T }

vech{( X 2 x )( X 2 x ) T }
,

vech{( X n x )( X n x ) T }

and vech() is the half-vectorization operation. The following Gaussian


kernel is employed in this work:
1
K H (u) h2 ( 2 |det
exp( 12 uT Cu) ,
C 1|)

(4-3-5)

where the positive definite matrix C and scalar bandwidth h determine


respectively the orientation and scale of the smoothing. Since the
Gaussian kernel is not of compact support, it should be truncated to a
sufficient size K K to reduce the arithmetic complexity. Usually C
is determined from the principal component analysis (PCA) of the data
covariance matrix at x . When h is small, noise in the depth map may
not be removed effectively. On the contrary, a large-scale kernel better
suppresses additive noise at the expense of possibly blurring of the depth
maps. Here, we adopt the iterative steering kernel regression (ISKR)
method in [Zhan 2009], which was shown to have a better performance
than the conventional symmetric kernel [Zhan 2009], especially along
image edges. In the ISKR method, the local scaling parameter was
obtained as hi h0 i , where h0 and i are respectively the global
smoothing parameter and the local scaling parameter. The scale selection
process is fully automatic and it can be performed by using the datadriven adaptive scale selection method with the refined intersection of
confidence intervals (R-ICI) rule used in [Zhan 2008]. The resulting
method is called the SK-LPR-RICI algorithm and more details can be
58

(a)

(b)

(c)

(d)

Figure 4-4: (a) and (b) show the renderings obtained by Figs. 4-2(d) and (b); (c) and
(d) are the enlargements of the red boxes.

found in [Zhan 2009]. The depth map smoothed by SK-LPR-RICI is


~
denoted by i ,t ( x ) .
As an illustration, we also smooth the example depth map by a
25 25 ideal low-pass filter and the result is shown in Fig. 4-3(e).

Comparing Fig. 4-3(e) with Fig. 4-3(d), we can see that the discontinuity
of the object boundaries using SK-LPR-R-ICI is well preserved, while
the object boundaries are blurred by the lowpass filter due to its fixed
size and relatively large support for noise suppression. In order to
illustrate the effect of these errors in the depth maps on the rendering
qualities, example renderings are also shown in Figs. 4-4(a) and (c)
according to the depth maps obtained from Figs. 4-2(d) and (b),
respectively. It can be seen that inaccurate depth values produce obvious
distortion of the light pole in Figs. 4-4(c) and (d). By combining this
59

new MI-based depth estimation algorithm with our moveable IBR


system, we can obtain the depth map (Figs. 4-5(a) and 4-5(c)) and
synthesized views (Figs. 4-6(b), Fig. 4-6(b) and (c)) at nearby locations
of the trajectory in indoor/outdoor environment.

60

61

(a)

(b)

(c)
Figure 4-5: Rendering results obtained by the proposed algorithm. (a) shows the
depth maps corresponding to images in (b). The highlighted images in (b) shows the
rendered views from the adjacent views in (b) using depth maps in (a). (c) shows
depth maps at other positions.

62

Figure 4-6: Example rendering results. The first row shows the original images
captured by our M-IBR system. The second and third rows show renderings with a
step-in ratio of about 1.15 to 1.25 times.

4.4 Mutual Information (MI)-based Object Tracking


As mentioned earlier, the MI-based matching algorithm can also be
simply extended to object tracking. At first we segment the object at a
given frame. Then a binary mask Mask ( x1 , x2 ) is obtained where
1 x1 1 , 1 x2 2 . Mask ( x1 , x2 ) is the intensity of the mask and
(1 ,2 ) is the resolution of the mask. The values of pixels in the binary

mask are 1 if they are inside the object. On the contrary, the values of
pixels in the binary mask are 0 if they are outside the object. Only the
part inside the object will take part in the MI optimization. More
63

precisely, TL (B' ) used in local transformation will be replaced by

TL (B' ) Mask ( x1 , x2 ) . The solution procedure is the same as the MIbased matching. Example video tracking results are shown in Fig. 4-7.
More results are shown in Section 4.5.

64

Figure 4-7: Object tracking at different time instances.

4.5 More Results and Comparison


We now present and evaluate further the experimental results of the
proposed algorithm. The testing is performed in an INTEL Core i7 990x
CPU-based computer with 6GB RAM and GTX580 GPU acceleration.
65

The resolution and the frame rate of the videos are 1920 1080i and
25fps, respectively. A demonstration video of our video stabilization
algorithm

can

be

found

at:

http://www.youtube.com

/watch?v=qPuMNjgUoWs.
The segmentation-MI-based matching algorithm has been evaluated
extensively on the stereo test image sets at the Middlebury stereo page
and our outdoor plenoptic video podium. Fig. 4-8 shows the stereo
images, ground truth depth map and depth maps calculated by our
method of the Teddy test images ( 450 375 ) [Scha 2002]. Table 4-1 is a
reproduction of the upper part of the evaluation at the Middlebury stereo
pages. The evaluation value in the table means the percentage of "bad"
pixels, i.e., pixel whose absolute disparity error is greater than some
threshold. A standard threshold of 1 pixel has been used in Table 4-1.
The segmentation-MI-based matching is among the best performing
stereo algorithms at the upper part of the table with the semi-automatic
and automatic versions ranking the fourth and sixth, respectively. The
performance difference between our algorithm and the top algorithm is
very small. Moreover, our algorithm is very stable and insensitive to
versatile data sets such as real data sets. And there are not too many
parameters which need to be selected carefully in our algorithm.

66

(a)

(b)

(c)

(d)

(e)

Figure 4-8: Teddy test images [Scha 2002] and depth maps for comparison. (a)
LEFT image; (b) RIGHT image; (c) ground truth depth map; (d) depth map
calculated by semi-automatic segmentation-based MI matching; (e) depth map
calculated by automatic segmentation-based MI matching.

Algorithm
Adapting BP[Klau 2006]

Rank
6.7

Tsukuba
1.37

Venus
0.21

Teddy Cones
7.06
7.92

CoopRegion[Wang 2008]

6.7

1.16

0.21

8.31

7.18

DoubleBP[Yang 2009]

8.8

1.29

0.45

8.30

8.78

Our Method(Semi-Seg)

9.6

1.30

0.18

5.10

8.88

OutlierConf[Xu 2008]

9.8

1.43

0.26

9.12

8.57

Our Method(Auto-Seg)

12.2

1.30

0.24

7.91

8.88

SubPixDoubleBP[Yang 2007]

13.2

1.76

0.46

8.38

8.73

SurfaceStereo[Bley 2010]

13.8

1.65

0.28

5.10

7.95

Table 4-1: Comparison of the rank using standard threshold of 1 pixel on middlebury
test stereo images.
67

(a)

(b)

(c)

(d)

Figure 4-9: Results for the conference. (a) and (c) are two sample frames. (b) and
(d) are the depth maps of (a) and (c), respectively.

Fig. 4-9 shows more results of another sequence conference


where a person is conducting a conference presentation. The M-IBR
system is used to track the motion of the speaker.
This algorithm also can be applied to automatically extract the
continuous cross-sectional area (CSA) changes in noisy dynamic
ultrasound image sequence [Chen 2012]. The sequence consists of a
series of sentinel ultrasound images of the rectus femoris (RF) muscle
under various testing conditions. The objection is to extract the RF
muscle to study its relationship with torque arc generated. To segment
the RF muscle over time, the first image in the sequence was selected as
reference and the boundary of the RF muscle was outlined using Lazy
snapping [Li 2004]. Then the proposed MI-based matching algorithm
was applied to track the boundaries in the subsequent images. Example
tracking results are shown in Fig. 4-9. It was found that the boundary of
the RF muscle can be tracked satisfactorily even in noisy environment,
68

which demonstrates the usefulness of the proposed algorithm for image


matching applications.

Figure 4-10 Ultrasound images of RF muscle under relaxed condition and at 50%
maximal voluntary contraction (MVC) contraction level and the corresponding
images with outlined boundary contours. The tracked boundaries are highlighted in
green.

4.6 Summary
A new iterative segmentation and mutual-information (MI)-based
algorithm for dense depth map estimation is presented. It supports both
semi-automatic and automatic segmentation methods which rank
69

respectively fourth and sixth in the middlebury comparison. It combines


segmentation, LPR-based depth map smoothing and MI-based matching
algorithm to iteratively estimate dense depth maps. This allows high
quality renderings of outdoor and indoor scenes with improved viewing
freedom to be obtained. It can also be simply extended to object tracking.
Experimental results show that high quality renderings and robust
tracking results in noisy dynamic ultrasound image sequences can be
achieved. One possible future work is to parallelize the whole algorithm
by using graphic processing units (GPU), as distributed computing will
substantially increase the processing speed and facilitate real-time
applications.

70

Chapter 5 3D Reconstruction and Modeling


5.1 Introduction
3D model reconstruction is to compute the geometry of real world
objects or scenes from digital images. Multiview reconstruction
algorithms can be broadly categorized into three classes. The first class
is forming an energy function on the voxel or volume. Then 3D level set
[Kutu 2000], [Slab 2004], [Zeng 2005] and graph cut [Kolm 2001]
methods are used to minimize the energy function. The voxel or volume
will be deformed to fit a surface. The second class is using image-based
methods to estimate depth maps [Szel 1999], [Kolm 2002], [Zitn 2004].
These methods register multiple depth maps together to form a
consistent depth map for 3D reconstruction. The third class consists of
algorithms [Mane 2000], [Tayl 2003] that find a set of correspondent
feature points. These feature points are used to form sparse 3D point
clouds. 3D models are fitted by the sparse point clouds
A full comparison of the results of most reconstruction algorithms
can

be

found

in

http://vision.middlebury.edu/mview/eval/.

The

disadvantages of many algorithms are that too many views are needed in
order to construct a high resolution model and the time required is very
high. In this chapter, a new 3D reconstruction algorithms for objects
using dense depth maps to obtain dense point correspondences from
multiple views for 3D reconstruction is presented. In order to perform
3D reconstruction, the camera calibration which has been presented in
Chapter 3.3.1 is first performed to determine the intrinsic parameters as
well as their extrinsic parameters, i.e. their relative positions and poses.
For the proposed moveable IBR system, the structure-from-motion
71

(SFM) method [Hart 2003] will be used to determine the extrinsic


parameters. Because the system is moveable , the extrinsic parameters
can be simply obtained by the plane-based camera calibration. The
whole 3D reconstruction module consists of 1) point matching, 2) 3D
mesh generation and 3) 3D rendering.
For our two IBR systems, the methods of computing the
correspondent points are somewhat different. In the still IBR system
introduced in Chapter 3.2.1, we only have a few images from multiple
views. Therefore one major technique to find the correspondent points is
the epipolar geometry [Hart 2003] which can constraint the
corresponding points on the conjugated epipolar lines. Meanwhile, via
combining epipolar lines with Scale-invariant feature transform (SIFT)
[Lowe 2004] feature detection, accurate sparse correspondent points can
be located. Then Gabor filter which is insensitive to noise is used to
obtain the dense correspondent points line by line. The main advantage
of this method is that it is relatively simple and fast.
In the moveable IBR system introduced in Chapter 3.2.2, the
system is driven around the object to obtain sufficient correspondences
from different views. Because a lot of data must be integrated together
for 3D reconstruction, the sequential-structure-from-motion (S-SFM)
technique is first adopted to estimate the locations of the M-IBR system
so as to obtain an initial set of fairly reliable 3D point cloud from the 2D
correspondences.

New

segmentation-MI-based

iterative
algorithms

Kalman

filter

are

proposed

(KF)-based
to

fuse

and
the

correspondences from different views and remove possible outliers to


obtain an improved point cloud. More precisely, the proposed algorithm
relies on the KF to track the correspondences across different views so
72

as to suppress possible outliers while fusing correspondences from


different views. With these reliable matched points, the camera
parameters and hence the image correspondences can be further refined
by re-projecting the updated correspondences to successive views to
serve as prior features/correspondences for MI-based matching. By
iterating these processes, an improved point cloud with reliable
correspondences can be recovered. Simulation results show that the
proposed algorithm significantly reduces the adverse effect of the
outliers and generates a more reliable point cloud.
To recover the 3D model from the point cloud, a new robust RBFbased modeling algorithm is proposed to further suppress possible
outliers and generate smooth 3D meshes from the raw 3D point cloud.
Compared with the conventional RBF-based smoothing, it is more
robust and reliable.
After we have reconstructed the 3D model, the texture must be
mapped back to the model for 3D rendering. A novel view-dependent
texture mapping is proposed in this chapter. In the traditional texture
mapping used in computer graphics, the texture is usually a single image.
Even thought they use several images as texture for an object, they just
register the images into a bigger one. In our system, multiple images are
captured in different angles. The lighting, shadow and shading are all
different in different textures. If we just register them together, the
change of lightening, shadow and shading may not be smooth and the
texture will be blurred. View-dependent texture mapping may assign a
weighting factor on different texture maps to make new blended texture
maps depending on the virtual view angle. The new blended texture
maps can change smoothly with the viewing angle.
73

This chapter is organized as follows: Section 5.2 gives an overview


of homogeneous geometry and describes how to compute 3D points
from multiple views. The method of finding matching points in the still
IBR system is presented in Section 5.3. The view dependent texture
mapping algorithm is presented in Section 5.4. Section 5.5 is devoted to
the Kalman filter (KF)-based outlier detection and iterative point cloud
fusion used in the moveable IBR system. RBF modeling and mesh
generation will be studied in Section 5.6. Finally, conclusion is drawn in
Section 5.7

5.2 Homogeneous Geometry


A line in the 2D space can be represented by an equation:
ax by c 0 . Thus the line can also be represented by the vector

[a, b, c]T . Because the vectors [a, b, c]T and k[a, b, c]T can represent the

same line (k 0). An equivalence class of vectors under this equivalence


relationship is known as a homogeneous vector. The set of equivalence
classes of vectors in 3 (except [0,0,0]T ) forms the projective space 2 .
In the same idea, a point X [ x, y]T lies on the line L [a, b, c]T can be
represented as [ x, y,1]T . For any non-zero constant k, the equation
[kx, ky, k ]L 0 is established. The set of vectors [kx, ky, k ]T for different

k values can be represented as the point [ x, y ]T in 2 . An arbitrary


homogeneous vector representative of a point is of the form
X [ x, y, z ]T , z 0 representing the point [ x / z, y / z ]T in 2 . If z 0 ,

the vector X [ x, y,0]T is represented as a point at infinity. Points as


homogeneous vectors are elements of 2 . The homogeneous geometry
is very useful for translation between 2D and 3D. More details can be
found in [Hart 2003].
74

As mentioned in Chapter 3.3.1, the camera matrix has been


calibrated. In the stereo views system, if the projective matrices P1 and
P2 of both cameras are known, we will have
x1 P1 X

(5-2-1)

x2 P2 X

(5-2-2)

where x1 and x 2 are a pair of matching points in two images and X is


the 3D projective point corresponding to x1 ( x1 , y1 ) and x2 ( x2 , y2 ) .
The linear triangulation method [Hart 2003] is proposed to compute the
point X. It rewrites equations (5-2-1) and (5-2-2) as
x1 P1 X 0 ,

(5-2-3)

x2 P2 X 0 ,

(5-2-4)

where is vector cross product.


Then it combines and rewrites the equation (5-2-3) and (5-2-4) as a
linear functions
AX 0 ,

(5-2-5)

x1 p13T p11T

y1 p13T p12T

A
,
x2 p23T p21T

3T
2T
y 2 p 2 p2

(5-2-6)

where p1iT and p2iT is the i-th rows of P1 and P2 respectively.


Because the camera parameters and the point positions may contain
noise, Newton iteration will be used to solve (5-2-5) for X.

75

5.3 Point Matching in the Still IBR System


5.3.1 Epipolar Geometry
The epipolar geometry between two images is the geometry of the
intersection of the image planes with the pencil of planes having the
baseline as axis. The baseline is the line connecting the camera centers
[Hart 2003]. See Fig. 5-1 for example.

x2

x1
e2

e1

C2
C1

Figure 5-1: Epipolar Geometry.

The epipole ( e1 and e2 in Fig. 5-1) is the point of the intersection


of the baseline with the image plane. The epipolar plane (plane
containing C1 , C 2 , X in Fig. 5-1) is a plane containing the baseline. The
epipolar line (lines ( x1 , e1 ) and ( x2 , e2 ) in Fig. 5-1) is the intersection of
an epipolar plane with the image plane. The relationship between the
points in the image and the epipolar line can be described by a matrix F
called fundamental matrix. If the epipolar line is set as a vector L and the
two correspondent points are x1 and x 2 , then
Fx 1 L2

(5-3-1)

76

Fx 2 L1 .

(5-3-2)

Combining (5-3-1) and (5-3-2) gives


x1T Fx 2 0 .

(5-3-3)

A seven points algorithm has been proposed in [Hart 2003] for


estimating the fundamental matrix from corresponding points. Because
the fundamental matrix F is a rank 2 matrix, only seven pairs of
correspondent points are needed to estimated F. Moreover, the random
sample consensus (RANSAC) is applied in this algorithm to make the
estimation more robust. The disadvantage of this algorithm is that it
comes out with three results. The three results need to be examined to
select one with the minimum error.
5.3.2 Finding Correspondent Points
Given two adjacent views for stereo matching, the correspondent
points are estimated with epipolar constraint as follows:
1. Detect the feature points using the Gaussian Scale Invariant
Transform Feature Detector [Lowe 2004]. Then root mean square of
the intensity difference between windows centered at every two
feature points will be employed to measure the similarity to find
correspondent feature points (see Fig. 5-2).
2. Use the maximum likelihood estimation consensus paradigm [Torr
2000] to obtain the fundamental matrix, which is more robust than
RANSAC. The epipolar lines are computed from the fundamental
matrix (see Fig. 5-3). Then rectify the two images to facilitate the
matching over two horizontal lines from the two images (see Fig. 54).

77

3. Use the Gabor filter to extract the correspondent points on the


epipolar lines. More precisely the images will pass through Gabor
filters with different orientations and scales. Then all the filtered
images will be combined to form a measurement matrix M ( x, d ) to
find the correspondent points for a point at x with disparity d at the
epipolar line of the image. The Gabor filter-based measurement will
make the result insensitive to changes in contrast between the two
images. For example, Gabor function centered at x0 0, y0 0
having a frequency f0 and orientation 0 is given by:
g0,0, f0 ,0 ( x, y)
exp( ( x y )) exp(i 2f 0 ( x cos 0 y sin 0 ))
2

(5-3-4)

where determines the spatial frequency bandwidth. The images I1


and I 2 , which pass through the filter, are represented by:
~
I1, f , g 0, 0, f , I1 ,
~
I 2, f , g 0, 0, f , I 2 .

(5-3-5)

According to [Ogal 2007], the measurement of local matching is


defined as
M ( x, d ) W ( x, d ) H ( x, d ) (1 W ( x, d )), where
1
H ( x, d ) cos( f ,d ( x)),
n n
W ( x, d ) exp( P( x, d )),
~ ~
P ( x, d ) I1, f , I 2*, f , ,
m

i cos( f , d ( x ))

~ ~
I1, f , I 2*, f ,
~ ~* .
I1, f , I 2, f ,

78

(5-3-6)

n is the number of orientations used. m is the number of


~
~
frequency and d is the shift range. I 2*, f , is the conjugate of I 2, f , .
f ,d ( x) is the phase difference between the two images using the

responses of a particular filter with frequency f at position x and


disparity d. In our experiments n 4 and m 4 . Because the two
cameras are very close, the disparity of the two matched point is set
in a small range in the x direction rectified. The measurement of the
Gabor filter provides a score of all candidates in the disparity range d.
Moreover the measurement of the Gabor filter is also chosen to
evaluate the confidence of each pair of correspondent points. If the
confidence is higher than a threshold value, we think this pair of
correspondent points is good. If the confidence is lower than the
threshold value, this pair of correspondent points will not be further
considered. The reasons of are usually due to lacking of local feature
in that area and occlusions. After that, the original matched lines will
be separated into several short matched line segments. The points on
each short line segment will be matched propotionally to the
matched short line segment in the other image. Of course, these
matched points are not very accurate. The error caused by these
points will be minimized in the reconstruction step. Up to now, the
matching is still based on two adjacent views. Lets extend it to
multiple view-based matching. For example, point A1 is in the first
view and point A2 is the correspondent point of A1 in the second
view. The confidence of correspondent pair (A1, A2) is defined as
C(A1, A2) which is larger than the threshold value. The subscript of
A means view numbers. Let point A3 is the correspondent point of A2
in the third view and so on. Repeat this step until that C(Ai, Ai+1) is
79

less than the threshold value or all the views have been searched. In
our experiments the threshold value is 0.5. The final averaged
confidence for the correspondent points which pass the threshold for
all views

1 N
C ( Ai , Ai1 ) ,
N 1 i 1

(5-3-7)

where N is the number of views.


4. After all the correspondent points are found, the 3D positions will be
obtained by using the projective matrix which has been presented in
Chapter 5.2. An example point cloud is shown in fig. 5-5. It can be
seen that there is still some noise and outliers in this point cloud. A
robust radial basis function modeling method will be used to smooth
this point cloud and generate the resultant 3D mesh. The details of
this robust RBF will be presented in Chapter 5.5.

(a)

(b)

Figure 5-2 Feature Points Detection. The red points in (a) and (b) are the feature
points. (a) is from the first camera. (b) is from the second camera.

80

Figure 5-3: Epipolar Line.

(a)

(b)

(c)

(d)
Figure 5-4: Rectified images, (a) is the rectified left image, (b) is the rectified right
image. (c) is the part of (a), (d) is the part of (b).

81

Figure 5-5: An initial point cloud extracted with noise and outliers.

5.4 View Dependent Texture Mapping


View dependent texture mapping was first introduced by Debevec
et al. [Debe 1998] for computer graphics. The traditional view dependent
texture mapping is used to obtain novel views of a scene while only
using a set of texture maps and a mesh of the object geometry.
Conventional computer graphics relies on an accurate 3D model and a
few texture maps for animating the appearance of physical objects.
Usually the overlap between each texture maps is very small.
In image based rendering, multiple views of the imaged objects are
available. Therefore, texture maps at adjacent views can be appropriately
combined to form a new texture map. In practice, two images, which are
closest to the present view position, will be chosen to perform texture
mapping in real time (see Fig. 5-6). In addition, the two images will be
multiplied by weighting factors and alpha maps. In our case the
weighting factor is the angle between the view position and the camera
position.

82

Figure 5-6: View Dependent Texture Mapping.

The alpha maps are made by matting in Chapter 3.3.2. The basic
function of view dependent texture is:
Tex( xi , yi ) Tex1 ( xi , yi )

Tex2 ( xi , yi )

(5-4-1)

where ( xi , yi ) are the texture coordinates, Tex( xi , yi ) is the intensity of


the blended texture map, and Texk ( xi , yi ) is the intensity of the texture
map from camera k. and represent the angles between the viewer
and camera 1 and camera 2 shown in Fig. 5-6.
Because of the point cloud generation step in Chapter 5.3, the 3D
points on the model cannot be projected back to the 2D images
consistently since they have been fused and refined. This will result in
deviation of texture coordinates on the two images. After mapping
directly by (5-4-1), the texture will be blurred as shown in left hand side
of Fig. 5-7. To solve this problem, we need to shift the texture
coordinates so as to ensure the two images are blended appropriately.
The blending equation in (5-4-1) can be modified as
Tex( xi , yi ) Tex1 ( xi , yi )

Tex2 ( xi xi , yi yi )

83

, (5-4-2)

where (xi , yi ) denote the shifting values of texture coordinates.


Without loss of generality, here we choose the left view image as the
reference image, and only compute the shifting value for the texture
coordinates of the right view image. The correspondent feature points
obtained in Chapter 5.3 will be used to determine the shifting value. For
example, two set of correspondent feature points are defined as
{FP1 ( x p , y p ) | p 1...} , {FP2 ( xq , yq ) | q 1...} and is the number of

feature points. The shifting values of the correspondent feature points


are defined as: x p xq xq , y p yq yq . Then B-spline interpolation
[Sand 1987] is employed to find the shifting values of the remaining
points from the chosen feature points. To render the image more
naturally when changing view, we further incorporate the viewing angles
in (5-4-2) as
Tex( xi , yi ) Tex1 ( xi

xi , yi

Tex2 ( xi
xi , yi
yi )


yi )

(5-4-3)

In this way, the shift value for each texture coordinate will be
modified gradually according to the users viewing angle. The right hand
side of Fig. 5-7 shows an example of improved texture using (5-4-3).
Consequently, audiences will have better visual experience because the
textures will be changed when users try to observe the object from
different directions.

84

Figure 5-7: View Dependent Texture, Left: blurred texture. Right: texture after the
proposed view dependent texture.

Experimental Results
Using the circular still camera array that we constructed in Chapter
3.2.1, we have captured 7 ancient Chinese artifacts from the University
Museum and Art Gallery at the University of Hong Kong. Fig. 5-8
shows several examples of the reconstructed 3D model of artifacts
captured. The artifacts can also be displayed in a Newsight multiview
TV which can display 9 views at a time. In order to improve the final
rendering effect, the 3D models will be put into a scene with lighting.
Lighting and shadows will make the object more realistic to people.
Although OpenGL provides several APIs to simulate basic lighting, it is
only suitable for ideal point source and it cannot generate shadows.
Therefore, we use the Shadow Field method [Zhou 2005] to relight the
scene. The basic idea of this algorithm is similar to Pre-computed
Radiance Transfer [Sloa 2002]. The environment map will be used as the
global incident light, which makes the scene much realistic. The source
radiance fields for lights and the object occlusion fields for occluders in
the scene are generated. Hence, the interaction of all the object in the
scene can be calculated in real time, i.e. the soft shadow of moving
85

object can be generated quickly. All the data including visibility function
of each vertex are compressed considerably with low error by using
spherical harmonics.
The testing is performed in an INTEL Core i7 990x CPU-based
computer with 4GB RAM and GTX580 GPU acceleration. The
rendering and display are accelerated at a speed of 60 frames per second.
Fig. 5-9 gives example renderings using the reconstructed 3D
model and shadow field [Zhou 2005] using OpenGL, which supports
real-time relighting and object movement with soft shadow.

(a)

(b)

(c)

(d)
86

(e)

(f)

(g)
Figure 5-8: 3D models of Ancient Chinese Artifacts. (a) Dragon Vase, (b) Buddha,
(c) Green Bottle, (d) Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine Glass.

87

Figure 5-9: Rendering Results of Ancient Chinese Artifacts.

5.5 Point Matching and Refinement in the Moveable IBR


system
5.5.1 Structure-from-motion
In order to perform 3D reconstruction for the M-IBR system, the
camera must be calibrated to determine the intrinsic parameters as well
as their extrinsic parameters, i.e. their relative positions and poses. In
this work, we employ the plane-based calibration method [Zhan 2000]
and the structure-from-motion (SFM) method [Hart 2003] to determine
the camera projection matrices, which connect the world coordinate and
the image coordinate, of our M-IBR system. This is accomplished by
88

using a sufficient large checkerboard calibration pattern to determine the


intrinsic parameters and therefore only the extrinsic parameters need to
be computed by structure-from-motion. This greatly reduces the degree
of freedom of the camera projection matrices and hence improves the
accuracy of calibration.
SFM combined with self-calibration is a useful geometry
reconstruction method, which can estimate 3D object positions and
projection matrices without any prior knowledge of the camera motion
and structure of the scene. Sequential methods (S-SFM) and
factorization methods (F-SFM) are two commonly used approaches in
SFM. S-SFM works with each view sequentially by incorporating the
results obtained in previous views. In contrast, F-SFM works by
computing camera pose and scene geometry using all image
measurements simultaneously. F-SFM is in principle more accurate once
it converges to the global minimum, but it requires accurate initialization
in high dimensions and is computationally more expensive. In practice,
sequential methods are usually adopted and the factorization method can
be applied as a refinement if necessary. Moreover, most factorization
methods only assume simplified linear camera models, e.g. orthographic,
weak perspective and para-perspective. Therefore, we shall employ the
S-SFM method in this paper.
Our S-SFM algorithm consists of four major steps: i) tracking of
2D feature points in the whole image sequence using SIFT; ii)
determination of an initial solution for the camera motion (extrinsic
parameters), since the intrinsic parameters are known from calibration;
iii) extending and optimizing the solution for every additional view and
iv) optimization of the camera motion globally using bundle adjustment.
89

A comprehensive summary of the S-SFM algorithm can be found in


[Hart 2003].
After S-SFM, cameras are fully calibrated. In addition, an initial 3D
point cloud of the object in the scene can be obtained. However, in order
to reconstruct a more accurate 3D model, we need to refine the 3D point
cloud to remove outliers etc.
5.5.2 Point Cloud Generation and Refinement: KF-based Outlier
detection and Point Cloud Fusion
After the dense depth map estimation, a set
correspondences

from

multiple

views

are

obtained.

of point
For

3D

reconstruction of stationary objects, the M-IBR system can be driven


around them to obtain more views for reconstruction. Using S-SFM
technique, the camera projection matrices of the M-IBR system can be
estimated. This allows a set of 3D points to be computed from their
correspondences in adjacent views through triangulation [Hart 2003].
More precisely, from the depth map between views i and i+1, one can
get a set of corresponding image points from views i and i+1 and their
3D locations with the help of the estimated camera parameters and
triangulation. To determine more accurate location of a 3D point which
is visible to all cameras, we need to track its correspondences across
multiple views. Suppose that we start with a pair of correspondence
between views 1 and 2. Let its estimated 3D location obtained by
triangulation be z (1) . Using the depth map between views 2 and 3, one
can also determine another correspondence of this point and its
estimated location z (2) . By continuing this operation repeatedly for
subsequent views, we get more estimates z (i ) from views i and i 1 ,
for i 1,, M 1, where M is the total number of views. An example set
90

of 3D points obtained is shown in Fig. 5-10(a). However, outliers may


exist due to errors in segmentation-MI-based matching and estimating
the projection matrices in S-SFM and occlusion, etc. Therefore, the
estimated locations cannot simply be averaged. In this work, a Kalmanfilter-based method is proposed to track the location z (i ) and detect
possible outliers so that the point cloud can be refined by fusing different
views. Moreover, an iterative method for further refining the point cloud
will be introduced.

(a)

(b)

(c)
Figure 5-10: Iterative refinement of point cloud: (a) initial point cloud. (b) point
cloud after outlier detection and Kalman filtering. (c) point cloud after the proposed
iteration method.
91

Kalman filter (KF)-based outlier detection and Point Cloud Fusion


KF is the minimum mean-squares state estimator of the following
linear state-space model with Gaussian innovation and measurement
noise:
x(t ) F (t ) x(t 1) w(t ) ,

(5-5-1)

z (t ) H (t ) x(t ) (t ) ,

(5-5-2)

where x (t ) R n and z (t ) R m are the respectively the state vector and


observation vector at time t. F (t ) R nm and H (t ) R mn are the state
transition and observation matrices respectively, and the innovation
w (t ) R n and measurement (t ) R m noise are zero mean Gaussian

noise

with

covariance

matrix Qw (t ) R nn

and

R (t ) R mm ,

respectively. Assuming that F (t ) , H (t ) , Qw (t ) and R (t ) are known,


the standard KF update for estimating the state x (t ) is given by:
x (t / t 1) F (t ) x (t 1 / t 1) ,

(5-5-3)

P (t / t 1) F (t ) P (t 1 / t 1)F T (t ) Qw (t ) ,

(5-5-4)

e(t ) z (t ) H (t ) x (t / t 1) ,

(5-5-5)

K (t ) P (t / t 1) H T (t ) [ H (t ) P (t / t 1) H T (t ) R (t )]1 ,

(5-5-6)

x (t / t ) x (t / t 1) K (t )e(t ) ,

(5-5-7)

P (t / t ) [ I K (t ) H (t )]P (t / t 1) ,

(5-5-8)

where x (t / ) ( t 1, t ) denotes the estimate of x (t ) given the


measurements {z ( j ), j } and e (t ) represents the prediction error.

92

Here, we associate z (i ) with the i-th observation of state space


model and the true state x as the true location of the 3D point and
assume that the additive noise is zero-mean and Gaussian distributed.
Since the true 3D location across multiple camera views does not change,
the state transition and observation matrices should be F (t ) I 3 and
H (t ) I 3 , respectively. Thus, (5-5-1) and (5-5-2) can be rewritten as:
x(i) x(i 1) w(i) ,
z (i) x(i) (i) ,

(5-5-9)
(5-5-10)

where w (i) and (i ) are Gaussian distributed innovation and


measurement noise with zero mean and variance Qw qI 3 and R rI 3 ,
respectively, and I 3 is the 3x3 identity matrix. The KF updates are then
reduced to:
P (i / i 1) P (i 1 / i 1) qI 3 ,

(5-5-11)

K (i) P (i / i 1)[P (i / i 1) rI 3 ] 1 ,

(5-5-12)

x (i / i) x (i 1) K (i)e(i) ,

(5-5-13)

P (i / i) [ I 3 K (i)]P (i / i 1) ,

(5-5-14)

where i 2,, M 1 . The initial state and covariance are


initialized to x(1) z (1) and P (1 / 1) 10 , respectively, while q r 0.1
denotes the expected variance of the estimation error.
As mentioned earlier, outliers may arise due to error in matching
and estimated camera parameters. We now propose a method to detect
possible outliers at each KF iteration based on the following three
consistency criteria. If they are violated, z (i) is considered as an outlier
and the KF will be terminated.
93

i) Segmentation consistency: at the i-th iteration, z (i ) is reprojected back to a 2D point xi Pi z (i) in view i , where Pi is the
camera projection matrix of view i , which contains the intrinsic
parameters and extrinsic parameters. For notational convenient, we have
dropped the additional subscript t for denoting the t-th time instant. Due
to errors in computing the projection matrixes and triangulation, xi may
lie outside the segment it belongs to. In this case, z (i ) is considered as
an outlier.
ii) Location consistency: the 3D distance between z (i ) and the
predicted location of the KF x (i ) should be relatively small. That is,
|| z (i) x (i) || D for some constant D . If not, it is treated as an outlier.

iii) Intensity consistency: z (i ) is re-projected back to two 2D points

xi Pi z (i) and xi 1 Pi 1 z (i) in views i and i 1 , respectively. They


should have similar intensity values, i.e. | I i ( xi ) I i 1 ( xi 1 ) | I for
some constant I . However, in order to cope with intensity variation, we
employ the normalized cross-correlation (NCC) as a measure for
intensity consistency check. The NCC has a range of [-1, 1] and in this
work, z (i ) is treated as an outlier if its NCC score is smaller than 0.8.
To ensure the reliability of the extracted 3D points, we only include
those points which satisfy the consistency tests above for K consecutive
number of views. K is chosen to be four in this paper since we have at
most seven matches for the eight views. The KF is first applied to the
first view. When it is terminated, say at the i-th iteration, a set of
potential 3D points S F {z (1), z (2),, z (i)}, is obtained. If i is less then
K, we then proceed to the second view and so on until a consecutive of K
94

matched views is found. If so, the matched 3D points can be fused by


computing their mean value. If not, then we shall proceed to the
remaining corresponding points. The process is illustrated in Fig. 5-11
where the projections of the point cloud are also shown. Blue points
denote the inliers. Green points show the points detected by
segmentation consistency in (i). Red points show the points detected by
location and intensity consistency checks in (ii) and (iii). Fig. 5-10(b)
shows the refined point cloud after the outlier detection and Kalman
filtering, where we can see that outlying points are effectively
suppressed. The advantages of the KF-approach are its implementation
simplicity and flexibility where one can process the views sequentially
while performing the consistency checks.
Iterative Refinement of Point Cloud
With more reliable matched points, the camera parameters and
hence the image correspondences can be further improved. This suggests
an iterative method for further refining the point cloud and other
parameters.
More precisely, after Kalman filtering, the 2D matching and 3D
geometry are refined as follows:
i) The fused 3D point cloud is first re-projected to successive views
to serve as prior features/correspondences for MI-based matching. By
adding to (4-2-9) the re-projection correspondences as parts of the
feature term, a more reliable depth map can be computed.
ii) The updated matching result is then used to update the 3D point
cloud using the KF-based outlier detection and point fusion algorithm
introduced above.
95

iii) The process will be repeated until the maximum number of


iteration, LMAX , is reached or no significant improvement of the 3D
geometry can be obtained. To measure the change in the 3D geometry, a
similarity measure of two consecutive 3D point clouds is therefore
needed.
Let the 3D point clouds at the l and l+1 iteration be

M (l ) { p (jl ) , j 1,...,} and M (l 1) { p (jl 1) , j 1,...,} , respectively,


where is the number of estimated 3D points. In this work, the
similarity measure is chosen as the root-mean squared distance (RMSD)
between two point clouds, which is defined as
RMSD

D( pj , p j ) ,
j 1

(5-5-15)

where pj M (l 1) is the point closest to p j M (l ) and D( x, y) is the


Euclidean distance between vectors x and y .
The algorithm can be terminated when the minimum RMSD value
or the maximum number of iterations, LMAX , is reached. Fig. 5-12 plots
RMSD versus the number of iterations for refining the point cloud for

the statue in Fig. 5-11. Fig. 5-10(c) shows the final point cloud obtained
by refining the one at Fig. 5-10(b). Considerable improvement in terms
of the smoothness and number of matched points is observed, which
demonstrates the effectiveness of the proposal iterative refinement
approach.

96

(a)

(b)

(c)
Figure 5-11: (a)-(b) shows the 3D to 2D re-projection at frame 20 and frame 21,
respectively. Blue points are inliers. Green points are outliers detected by the
segmentation consistency check. Red points are the outliers detected by intensity and
location consistency checks. (c) shows the enlargement of the highlight area in (a).
The point cloud is down-sampled for better visualization.

Figure 5-12: Convergence behavior of the root mean square distance (RMSD) versus
the number of iteration for the proposed iterative 3D reconstruction algorithm. The
blue line shows the RMSD values with the KF-based outlier detection. The red line
shows the RMSD values without KF-based outlier detection.
97

5.6 RBF modeling and Mesh generation


After the completion of the iterative refinement procedure, the final
point cloud

~
M

may still contain holes and may not be smooth enough to

get a good mesh. Therefore, further smoothing of the raw 3D point cloud
is necessary. In this paper, we employ the RBF-based modeling for
smoothing and the construction of the 3D mesh. The basic form of a
RBF is:

F ( x) j 1c j p j ( x) i1i ( x xi ) ,

(5-6-1)

where c j is the model coefficient of the polynomial p j (x) , j 1,, ,


which together form a basis of the polynomial part of the RBF; and i is
the RBF coefficient for the RBF ( x xi ) with center xi , i 1,,
where is the number of data points, which is also equal to the RBFs
used in the RBF. Given a set of 3D points { x1 , x 2 ,, x } with values
f [ F ( x1 ), F ( x 2 ),, F ( x )]T and the additional conditions [Beat

2004]:
pT 0 ,

(5-6-2)

where p is the vector containing { p j ( x )} and is the vector


containing the RBF coefficients {i } , the RBF coefficients satisfy the
following equation:

pT

p f

,
0 c 0

(5-6-3)

where [] ji ( x j xi ) , i, j 1,, , and c [c1 , c2 ,, c ]T . By


solving the linear equation in (5-6-3), the RBF coefficients can be
computed. The complexity of solving (5-6-1) is O( 3 ) . The fast
98

evaluation method proposed in [Beat 2001] can reduce the complexity to

O( log ) . The basic idea of this fast RBF algorithm is that exact
interpolation is not needed in practice. Consequently, the value of F ( xi )
is only required to lie in an acceptable range to achieve a given accuracy.
In this paper, we also make use of this property to get rid of possible
outliers left. More precisely, we set an error bar for the RBF
values: i1 F ( xi ) i 2 where for simplicity we set i1 i 2 i .
Consequently, the problem becomes:
minimize T ,
subject to | F ( x i ) | i , p T 0

(5-6-4)

which is recognized as a convex constrained quadratic programming


problem (QPP) and it can be solved readily [Beat 2004]. In this work, i
is chosen as the normalized confidence value obtained from the
matching results. The higher the confidence is, the closer the
reconstructed points are to the original points. By comparing the
reconstructed 3D model (Fig. 5-13(c)) with the one without removing
the outliers and smoothing (Fig. 5-13(a)) and the one using RBF without
outlier removal (Fig. 5-13(b)), significant improvement is obtained.

99

(a)

(b)

(c)

Figure 5-13: 3D reconstruction results (a) without using RBF, (b) using RBF
without outlier detection and (c) using RBF with outlier removal.

Experimental Results
Figs. 5-14 and 5-15 show example rendering using the 3D model
and shadow field using OpenGL, which support real-time relighting.
Since the other side of the object in Fig. 5-15 is invisible, only part of
the object can be recovered.
Using the circular still camera array that we constructed in Chapter
3.2.1, we have captured 7 ancient Chinese artifacts from the University
Museum. More rendering result can be found in our demonstration video
at http://www.youtube.com/watch?v=hZHW5XS9xAg. Moreover, the
SK-LPR-RICI method can be applied to the depth map to estimate
smooth gradient field. Combining this gradient field with the depth map,
a normal field corresponding to the 2D image can be approximated. This
can be used to perform real-time 2D relighting. A demonstration video
of the 2D relighting results can be found at: http://www.youtube.com/
watch?v=5LRdPgnWapo.

100

Figure 5-14: Object-based rendering results of Podium sequences using the


estimated 3D model and shadow field at different lightening conditions.

(a)

(b)

(c)

(d)

Figure 5-15: Object-based rendering results of the conference sequence. (a) and (b)
are the 3D reconstruction result of two time instances. (c) and (d) are the rendering
results of (a) and (b). Note, only partial geometry of the dynamic object is recovered,
since it is partially observable.

101

5.7 Summary
In this chapter, point matching refinement, 3D reconstruction and
rendering methods for the static and moveable IBR systems are
presented. The first one for static IBR system relies on epipolar
constraint and SIFT feature detection to find a set of sparse
correspondent points. Then Gabor filter-based measurement, which is
insensitive to noise, is used to obtain more reliable correspondent points
for interpolating the disparity maps. The other one for moveable IBR
system is more complicated and it uses the sequential-structure-frommotion (S-SFM) technique and the dense estimated depth maps to obtain
the point cloud for 3D reconstruction. A new iterative point cloud
refinement algorithm based on Kalman filter (KF) for outlier removal
and

segmentation-MI-based

algorithm

for

further

refining

the

correspondences and the projection matrices are proposed


Moreover, a new robust RBF-based modeling algorithm is
developed to further suppress possible outliers and generate a smooth 3D
mesh of the objects. View dependent texture and shadow light field are
then incorporated to further improve the rendering quality and user
interactivity. Experimental results show that high quality renderings can
be obtained by using the shadow light field and the 3D model
reconstructed.

102

Chapter 6 Conclusion and Future Research


6.1 Conclusion
The design and construction of still and moveable image-based
rendering system based on multiple camera arrays are studied in this
thesis. Associating processing algorithms for object-based rendering and
3D reconstruction using the two image-based rendering (IBR) system for
improved viewing freedom and environmental modeling have been
presented. These include the issues of the configuration of the IBR
systems, depth estimation, object tracking, 3D model reconstruction, as
well as 3D rendering techniques.

Main contributions resulting from this study are summarized as


follows:
1. Two IBR systems have been studied in this thesis. The first one
consists of a multiple still camera array. It is used to capture ancient
Chinese artifacts. Excellent rendering quality is obtained because of
the high resolution of the camera (Canon 550D). This system can be
used for preservation and dissemination of cultural artifact with high
digital quality. By using this circular camera array, we developed
novel techniques for rendering new views of the artifacts from the
images captured using the object-based approach. The second IBR
system uses a linear camera array consisting of 8 video cameras
mounted on an electrically controllable wheel chair. Its motion can
be controlled manually or remotely by means of additional hardware
circuitry. Because of the mobility of the system, the view range is

103

largely extended and can render large environment and moving


objects.
2. A new combined segmentation-mutual-information (MI)-based
algorithm for dense depth map estimation has been presented. It
relies on segmentation, LPR-based depth map smoothing and MIbased matching algorithm to iteratively estimate the depth map. The
method is very flexible and both semi-automatic and automatic
segmentations can be used. Using the depth maps captured and the
object-based approach, high quality renderings of outdoor scenes
along the trajectory can be obtained, which considerably improved
the viewing freedom. It also can be extended simply to object
tracking. Experimental results show that its performance is reliable
even for noisy videos such as dynamic ultrasound images.
3. Two point matching algorithms for two IBR systems have been
proposed. The first algorithm uses epipolar geometry and Scaleinvariant feature transform (SIFT) as constraint to find sparse
correspondent points. Then Gabor filter is used to obtain the dense
correspondent points. The second algorithm for moveable IBR
system used the sequential-structure-from-motion (S-SFM) which
estimates the location of the moveable IBR system so as to obtain an
initial set of fairly reliable 3D point clouds from the 2D
correspondences. New iterative Kalman filter (KF)-based and
segmentation-MI-based algorithms are proposed to fuse the
correspondences from different views and remove possible outliers
to obtain an improved point cloud. Simulation results show that the
proposed algorithm significantly reduces the adverse effect of the
outliers and generates more reliable point clouds. After the point
104

cloud estimation, a new robust RBF-based modeling algorithm is


proposed to further suppress possible outliers and generate smooth
3D surfaces from the raw 3D point cloud. Finally, view dependent
texture mapping has been proposed to enhance the final rendering
effect. The rendering results can be displayed in a Newsight multiview TVs interactively with the help of GPU acceleration.

6.2 Future Research


The main concern of this thesis is to address the problem of design
and construction of IBR systems and associating processing algorithms
for both static and moveable applications. The results obtained in this
work can be extended in the following aspects.
1. Further research on the design and construction of IBR systems may
focus on improving the mechanical design of the camera array so
that they can be steered to different directions, while the platform or
wheel chair is moving.
2. With the development of parallel computing using GPU and
distributed signal processing, the combined segmentation-mutualinformation (MI)-based algorithm and Radial Basis Function (RBF)
Modeling can be further accelerated. If the processing complexity of
this algorithm can be further reduced, it can be used in many realtime applications, such as industrial problems, etc.
3. Recently, Microsoft has produced a device called KINECT for Xbox
360. Originally KINECT was designed as a game console. It enables
users to control and interact with the Xbox 360 without the need to
touch a game controller, through a natural user interface using
gestures and spoken commands. The depth sensor of KINECT
consists of an infrared laser projector combined with a monochrome
105

CMOS sensor, which captures video data in 3D under any ambient


light conditions. By combining multiple video cameras with multiple
KINECTs, many new applications and algorithms may emerge. By
using the rough real-time depth maps provided by KINECT in real
time as initial input, high resolution depth maps may be generated in
real time. This system also can resolve depth values at texture-less
regions. Moreover it will further accelerate the development of new
and promising applications such as 3D photography, multiview TV
systems,

education,

digital

entertainment,

content

generation/production tools, video surveillance, distance learning,


etc.

106

Appendix I Publications
Journal Papers:
[1] Z.Y. Zhu, S. Zhang, S.C. Chan and H.Y. Shum, Object-Based
Rendering and 3D reconstruction Using a Moveable Image-Based
System,

IEEE Trans. Circuits and Syst. Video Technol., to be

published, 2012.
[2] X. Chen, Y.P. Zheng, J.Y. Guo, Z.Y. Zhu, S.C. Chan and Z.G. Zhang,
Sonomyographic

responses

during

voluntary

isometric

ramp

contraction of the human rectus femoris muscle, European Journal of


Applied Physiology, vol. 112, no. 7, pp. 2603-2614, 2012.
[3] K.T. Ng, Z.Y. Zhu, C.Wang S.C. Chan and H.Y. Shum, Imagebased Rendering of Ancient Chinese Artifacts for Multiview Display
A Multi-camera Approach, IEEE Trans. Multimedia, to be published,
2012
[4] S. C. Chan, Z. Y. Zhu, K. T. Ng, C. Wang, S. Zhang Z. Ye and H. Y.
Shum, The Design and Construction of a Moveable Image-Based
Rendering System and Its Application to Multiview Conferencing,
Journal of Signal Processing Systems, vol. 67, issue 3, pp. 305-316, Jun.
2012.

Conference Papers:

[1] Z. G. Zhang, S. C. Chan and Z. Y. Zhu, A New Two-Stage Method


for Restoration of Images Corrupted by Gaussian and Impulse Noises
using

Local

Polynomial

Regression
107

and

Edge

Preserving

Regularization, in Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp.
948-951.
[2] K. T. Ng, Z. Y. Zhu and S. C. Chan, An approach to 2D-To-3D
Conversion for Multiview Displays, in Proc. IEEE Int. Conf. Info.,
Commu. Signal Processing, Dec. 2009. pp.1-5.
[3] S. C. Chan, Z. Y. Zhu, K. T. Ng, C. Wang, S. Zhang, Z. G. Zhang, Z.
F. Ye and H. Y. Shum, A moveable image-based system and its
applications to multiview audio-visual conferencing, in Proc. IEEE Int.
Symp. Commu. Info. Technol., Tokyo Japan, 2010, pp. 1142 - 1145
[4] Z. Y. Zhu, K. T. Ng , S. C. Chan and H. Y. Shum, Image-Based
Rendering of Ancient Chinese Artifacts for Multi-view Displays using a
Multi-Camera Array Approach, in Proc. IEEE Int. Symp. Circuits Syst.,
May 2010, pp. 948-951.
[5] X. Z. Yao, S. C. Chan, Z. Y. Zhu, K. T. Ng H. Y. Shum, ImageBased

Compression,

Prioritized

Transmission

and

Progressive

Rendering of Circular Light Fields (CLFS) For Ancient Chinese


Artifacts, in Proc. Asia Pacific Conf. Circuits and Systems, Dec 2010.
[6] C. Wang, Z. Y. Zhu, S. C. Chan H. Y. Shum, Realistic and
Interactive Image-based Rendering of Ancient Chinese Artifacts using A
Multiple Camera Array, in IEEE Int. Symp. Circuits Syst., May 2011.

108

References
[Adel 1991]

E. H. Adelson and J. Bergen, The plenoptic function


and the elements of early vision, in Comput. Models
Visual Process., Cambridge, MA: MIT Press, pp. 3-20,
1991.

[Agra 2000]

M. Agrawala, R. Ramamoorthi, A. Heirich and L. Moll.


Efficient image-based methods for rendering soft
shadows, in Proc. Annu. Conf. Comput. Graph.
(SIGGRAPH00), Aug. 2000, pp. 372-384.

[Atta 1999]

G. Attardi, M. Betr, M. Forte, R. Gori, A. Guidazzoli,


S. Imboden and F. Mallegni, 3D facial reconstruction
and visualization of ancient Egyptian mummies using
spiral CT data: Soft tissues reconstruction and textures
application, in Proc. Annu. Conf. Comput. Graph.
(SIGGRAPH99), Aug. 1999, pp. 223239.

[Beat 2001]

R. K. Beatson, W. A. Light and S. Billings, Fast


solution of the radial basis function interpolation
equations: Domain decomposition methods, SIAM J.
Scientific Comput., vol. 22, pp. 1717-1740, Feb. 2001.

[Beat 2004]

R. K. Beatson, J. B. Cherrie, T. J. McLennan, T. J.


Mitchell, J. C. Carr, W. R. Fright and B. C. McCallum,
Surface reconstruction via smoothest restricted range
approximation, Geometric Modeling and Computing,

109

Brentwood, USA: Nashboro Press, pp. 41-52, 2004.


[Bley 2010]

M. Bleyer, C. Rother and P. Kohli, Surface stereo with


soft segmentation, in Proc. IEEE Comput. Soc. Conf.
CVPR, Aug. 2010, pp. 1570-1577.

[Boyk 2001]

Y. Boykov, O. Veksler and R. Zabih, Fast approximate


energy minimization via graph cuts, IEEE Trans.
Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 12221239, 2001.

[Bueh 2001]

C. Buehler, Bosse, M. L., McMillan, S. Gortler and M.


Cohen, Unstructured lumigraph rendering, in Proc.
Annu. Conf. Comput. Graph. (SIGGRAPH01), Aug.
2001, pp. 425-432.

[Chai 2000]

J. X. Chai, X. Tong, S. C. Chan and H. Y. Shum,


Plenoptic sampling, in Proc. Annu. Conf. Comput.
Graph. (SIGGRAPH01), Aug. 2000, pp. 307-318.

[Chan 2003]

S. C. Chan, K. T Ng, Z. F. Gan, K. L. Chan and H. Y.


Shum, The data compression of simplified dynamic
light fields, in Proc. IEEE Int. Conf. Acoustics, Speech,
and Signal Processing, vol. 3, Apr. 2003, pp. 653-656.

[Chan 2004]

S. C. Chan, K. T. Ng, Z. F. Gan, K. L. Chan and H. Y.


Shum, The plenoptic videos: capturing, rendering and
compression, in Proc. IEEE Int. Symp. Circuits Syst.,
vol. 3, May 2004, pp. 905-908.

[Chan 2005]

S. C. Chan, K. T. Ng, Z. F. Gan, K. L. Chan and H. Y.


110

Shum. The plenoptic videos, IEEE Trans. Circuits


Syst. Video Technol., vol. 15, no. 12, pp. 1650-1659,
Dec. 2005.
[Chan 2007]

S. C. Chan, H. Y. Shum and K. T. Ng, Image-based


rendering and synthesis: technological advances and
challenges, IEEE Signal Process. Mag., vol. 24, no. 7,
pp. 22-33, Nov. 2007.

[Chan 2009]

S. C. Chan, Z. F. Gan, K. T. Ng, K. L. Ho and H. Y.


Shum, An object-based approach to image/video-based
synthesis and processing for 3-D and multiview
televisions, IEEE Trans. Circuits Syst. Video Technol.,
vol. 19, no. 6, pp. 821-831, 2009.

[Chan 2010]

S. C. Chan, Z. Y. Zhu, K. T. Ng, C. Wang, S. Zhang, Z.


G. Zhang and H. Y. Shum, A movable image-based
system and its applications to multiview audio-visual
conferencing, in Proc. IEEE Int. Symp. Commu. Info.
Technol., Oct. 2010, pp. 1142-1145.

[Chan 2012]

S. C. Chan, Z. Y. Zhu, K. T. Ng, C. Wang, S. Zhang Z.


Ye and H. Y. Shum, The Design and Construction of a
Moveable Image-Based Rendering System and Its
Application to Multiview Conferencing, Journal of
Signal Processing Systems, vol. 67, issue 3, pp. 305-316,
Jun. 2012.

[Chen 1993]

S. E. Chen and L. Williams, View interpolation for


image synthesis, in Proc. Annu. Conf. Comput. Graph.
111

(SIGGRAPH93), Aug. 1993, pp. 279-288.


[Chen 1995]

S. E. Chen, QuickTime VR an image-based approach


to virtual environment navigation, in Proc. Annu. Conf.
Comput. Graph. (SIGGRAPH95), Aug. 1995, pp. 2938.

[Chen 2012]

X. Chen, Y.P. Zheng, J.Y. Guo, Z.Y. Zhu, S.C. Chan


and Z.G. Zhang, Sonomyographic responses during
voluntary isometric ramp contraction of the human
rectus femoris muscle, European Journal of Applied
Physiology, vol. 112, no. 7, pp. 2603-2614, 2012.

[Debe 1996]

P. E. Debevec, C. J. Taylor and J. Malik, Modeling and


rendering architecture from photographs: a hybrid
geometry and image-based approach, in Proc. Annu.
Conf. Comput. Graph (SIGGRAPH96), Aug. 1996, pp.
11-20.

[Debe 1998]

P. E. Debevec, Y. Yu and G. Borshukov. Efficient


view-dependent image-based rendering with projective
texture-mapping,

Eurographics

Workshop

on

Rendering, pp. 150-116, 1998.


[Dura 2005]

F. Durand, N. Holzschuch, C. Soler, E. Chan and F. X.


Sillion, A frequency analysis of light transport, in
Proc. Annu. Conf. Comput. Graph (SIGGRAPH05), Jul.
2005, pp. 1115-1126.

[Egna 2002]

G. Egnal and R. P. Wildes, Detecting binocular half112

occlusions: Empirical comparisons of five approaches,


IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8,
pp. 1127-1133, Aug. 2002.
[Felz 2004]

P. F. Felzenszwalb and D. P. Huttenlocher, Efficient


Graph-Based Image Segmentation, Int. J. Comput.
Vision, vol. 59, no. 2, Sep. 2004.

[Flie 2007]

M. Flierl and B. Girod, Multiview video compression:


exploiting

inter-image

similarities,

IEEE

Signal

Processing Magazine: Special Issue on MVI and 3DTV,


vol. 24, no. 6, pp. 66-76, 2007.
[Fuji 1996]

T. Fujii, T. Kimoto and M. Tanimoto, Ray space


coding for 3D visual communication, in Proc. Picture
coding Symp., Mar. 1996, pp. 447-451.

[Gan 2005]

Z. F. Gan, S. C. Chan, K. T. Ng and H. Y. Shum, An


object-based approach to plenoptic videos, in Proc.
IEEE Int. Symp. Circuits Syst., May, 2005, pp. 34353438.

[Gers1939]

A. Gershun, The light field, Translated by P. Moon


and G. Timoshenko in Journal of Mathematics and
Physics, vol. XVIII, MIT, pp. 51-151, 1936.

[Gort 1996]

S. J. Gortler, R. Grzeszczuk, R. Szeliski and M. F.


Cohen, The lumigraph, in Proc. Annu. Conf. Comput.
Graph. (SIGGRAPH96), Aug. 1996, pp. 43-54.

[Guo 2008]

X. Guo, Y. Lu, F. Wu, D. Zhao and W. Gao, "Wyner


113

Ziv-Based Multiview Video Coding," IEEE Trans.


Circuits and Syst. Video Technol., vol. 18, no. 6, pp.
713-724, Jun. 2008.
[Hart 2003]

R. Hartley and A. Zisserman, Multiple view geometry in


computer vision, Cambridge, UK : Cambridge Univ.
Press, 2003.

[Hu 2007]

R. Hu, R. J. Shi, I. F. Shen and W. B. Chen, Video


stabilization using scale-invariant features, in Proc. Int.
Conf. Info. Visualization, Jul. 2007, pp. 871-876.

[Huan 2006]

X. L. Huang, N. Paragios and D. N. Metaxas, Shape


registration in implicit spaces using information theory
and free form deformations, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 28, no. 8, pp. 1303-1318, 2006.

[Ikeu 2003]

K. Ikeuchi, A. Nakazawa, K. Hasegawa and T. Ohishi,


The Great Buddha Project: Modeling Cultural Heritage
for VR Systems through Observation, in Proc. the 2nd
IEEE/ACM Int. Symp. Mixed Augmented Reality, Oct.
2003, pp 7-16.

[Ikeu 2012]

K. Ikeuchi (Ed), Encyclopedia of Computer Vision,


Springer, to appear in 2011.

[IVS]

www.ivs-tech.com

[Klau 2006]

A. Klaus, M. Sormann and K. Karner, Segment-based


stereo matching using belief propagation and a selfadapting dissimilarity measure, in Proc.IEEE Int.Conf.
114

Pattern Recognit., vol. 3, Sep. 2006, pp. 15-18.


[Kolm 2001]

V.Kolmogorov and R. Zabih, Computation visual


correspondence with occlusions using graph cuts, in
Proc. Int. Conf. Comput. Vision, vol.2, Jul. 2001, pp.
508-515.

[Kolm 2002]

V. Kolmogorov and R. Zabih, Multi-camera scene


reconstruction via graph cuts, in Proc. Euro. Conf.
Computer Vision, 2002, pp. 82-96.

[Konr 2007]

J. Konrad and M. Halle, 3-D Displays and signal


processing, IEEE Signal Processing magazine, vol. 24,
no. 6, pp. 97-111, Nov. 2007.

[Kutu 2000]

K. Kutulakos and S. Seitz, A theory of shape by space


carving, Int. J. Comput. Vision, vol.38, no. 3, pp. 199218, 2000.

[Lalo 1999]

P. Lalonde and A. Fournier, Interactive rendering of


wavelet projected light fields, in Proc. Conf. Graphics
Interface, 1999, pp. 107-114.

[Levo 1996]

M. Levoy and P. Hanrahan, Light field rendering, in


Proc. Annu. Conf. Comput. Graph. (SIGGRAPH96),
Aug. 1996, pp. 31-42.

[Levo 2002]

M. Levoy et al, The digital Michelangelo project: 3D


scanning of large statues, in Proc. Annu. Conf. Comput.
Graph (SIGGRAPH02), Aug. 2002, pp. 131-144.

115

[Lhui 2003]

M. Lhuillier and L. Quan. Image-based rendering by


joint view triangulation, IEEE Trans. Circuits and Syst.
Video Technol., vol. 13, no. 11, pp. 1051-1063, Nov.
2003.

[Li 2004]

T. Li, J. Sun, C. K. Tang and H. Y. Shum, Lazy


snapping, in Proc. Annu. Conf. Comput. Graph.
(SIGGRAPH04), Aug. 2004, pp. 303-308.

[Lian 2008]

C. K. Liang, T. H. Lin, B. Y. Wong, C. Liu and H. H.


Chen,

Programmable

aperture

photography:

multiplexed light field acquisition, in Proc. Annu. Conf.


Comput. Graph. (SIGGRAPH08), Aug. 2008, pp. 1-10.
[Lowe 2004]

D. G. Lowe, Distinctive image features from scaleinvariant keypoints, Int. J. Comput. Vision, vol. 2, no.
60, pp. 91-110, 2004.

[Magn 2000] M. Magnor and B. Girod, Data compression for lightfield rendering, IEEE Trans. Circuits and Syst. Video
Technol., vol. 10, no. 3, pp. 338-343, Apr. 2000.
[Magn 2003] M. Magnor, P. Ramanathan and B. Girod, Multi-view
coding for image-based rendering using 3-D scene
geometry, IEEE Trans. Circuits and Syst. Video
Technol., vol. 13, no. 11, pp. 1092-1106, Nov. 2003.
[Mane 2000]

A. Manessis, A. Hilton, P. Palmer, P. McLauchlan and


X. Shen, Reconstruction of scene models from sparse
3D structure, in Proc. IEEE Comput. Soc. Conf. CVPR,
116

vol. 1, 2000, pp. 666-673.


[Mats 2005]

Y. Matsushita, E. Ofek and H. Y. Shum, Full-frame


video stabilization, in Proc. IEEE Comput. Soc. Conf.
CVPR, June 2005, vol. 1, pp. 50-57.

[Mats 2006]

Y. Matsushita, E. Ofek, W. Ge, X. Tang and H. Y.


Shum, Full-frame video stabilization with motion
inpainting, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 28, no. 7, pp. 1150-1163, June 2006.

[McMi 1995] L. McMillan and G. Bishop, Plenoptic modeling: An


image-based rendering system, in Proc. Annu. Conf.
Comput. Graph (SIGGRAPH95), Aug. 1995, pp. 39-46.
[Ng 2004]

R. Ng, R. Ramamoorthi, and P. Hanrahan. Triple


product wavelet integrals for all-frequency relighting,
in Proc. Annu. Conf. Comput. Graph (SIGGRAPH04),
Aug. 2004, pp. 477-487.

[Ng 2009]

K. T. Ng, Z. Y. Zhu and S. C. Chan, An approach to


2D-To-3D Conversion for Multiview Displays, in
Proc. IEEE Int. Conf. Info., Commu. Signal Processing,
Dec. 2009. pp.1-5.

[Ng 2010]

K. T. Ng, Q. Wu, S. C. Chan and H. Y. Shum, ObjectBased Coding for Plenoptic Videos, IEEE Trans.
Circuits Syst. Video Technol., vol. 20, no. 4, pp. 548562, Apr. 2010.

[Ng 2012]

K.T. Ng, Z.Y. Zhu, C.Wang S.C. Chan and H.Y. Shum,
117

Image-based Rendering of Ancient Chinese Artifacts


for Multiview Display A Multi-camera Approach,
IEEE Trans. Multimedia, to be published, 2012.
[Noce 199]

J. Nocedal and S. J. Wright, Numerical Optimization,


Springer, 1999.

[Ogal 2007]

A. S. Ogale and Y. Aloimonos, "A roadmap to the


integration of early visual modules," Int. J. Comp.
Vision, vol. 72, pp. 9-25, Apr 2007.

[Pele 1997]

S. Peleg and J. Herman, Panoramic mosaics by


manifold projection, in Proc. IEEE Comput. Soc. Conf.
CVPR, Jun. 1997, pp. 338-343.

[Pete 2001]

L. Peter and W. Strasser, The wavelet stream:


Interactive multi resolution light field rendering, in
Proc Eurographics Rendering Workshop, 2001, pp. 262273.

[PGDT]

PG

DRIVE

VR2

CONTROLLER

URL

http://www.pgdt.com/products/vr2/index.html
[Rama 2005] R. Ramamoorthi, D. Mahajan and P. Belhumeur, A
first-order analysis of lighting, shading, and shadow,
ACM Trans. Graphics, vol. 26, no. 1, Jan. 2007.
[Rata 1998]

K. Ratakonda, Real-time digital video stabilization for


multi-media applications, in Proc. IEEE Int. Symp.
Circuits Syst., vol. 4, May 1998, pp. 69-72

118

[Sand 1987]

D. T. Sandwell, "Biharmonic Spline Interpolation of


Geos-3 and Seasat Altimeter Data," Geophysical
Research Letters, vol. 14, pp. 139-142, Feb. 1987.

[Scha 2002]

D. Scharstein and R. Szeliski, A taxonomy and


evaluation of dense two-frame stereo correspondence
algorithms, Int. J. Comput. Vision, vol. 47, no. 1/2/3,
pp. 7-42, Apr.-Jun. 2002.

[Shad 1998]

J. Shade, S. Gortler, L. W. He and R. Szeliski, Layered


depth images, in Proc. Annu. Conf. Comput. Graph.
(SIGGRAPH98), Jul. 1998, pp. 231-242.

[Shum 1999] H. Y. Shum and L. W. He, Rendering with concentric


mosaics, in Proc. Annu. Conf. Comput. Graph.
(SIGGRAPH99), Aug. 1999, pp. 299-306.
[Shum 2004] H. Y. Shum, J. Sun, S. Yamazaki, Y. Li and C. K. Tang,
Pop-up

light

field:

An

interactive image-based

modeling and rendering system, ACM Trans. on


Graphics, vol. 23, no. 2, pp. 143-162, Apr. 2004.
[Shum 2007] H. Y. Shum, S. C. Chan and S. B. Kang, Image-based
rendering, NY: Springer-Verlag, 2007.
[Silv 1986]

B. W. Silverman, Density Estimation for Statistics and


Data Analysis, London, Chapman Hall, 1986.

[Slab 2004]

G. Slabaugh, B. Culbertson, T. Malzbender and M.


Stevens, Methods for volumetric reconstruction of
visual scenes, Int. J. Comput. Vision, vol. 57, no. 3, pp.
119

179-199, 2004.
[Sloa 2002]

P. Sloan, J. Kautz and J. Snyder. Precomputed radiance


transfer for real-time rendering in dynamic, lowfrequency lighting environment, in Proc. Annu. Conf.
Comput. Graph. (SIGGRAPH02), Jul. 2002, pp. 527536.

[Sun 2005]

J. Sun, Y. Li, S.B. Kang and H.Y. Shum, Symmetric


stereo matching for occlusion handling, in Proc. IEEE
Comput. Soc. Conf. CVPR, vol. 2, Aug. 2005, pp. 399406.

[Szel 1997]

R. Szeliski and H. Y. Shum, Creating full view


panoramic image mosaics and environment maps, in
Proc. Annu. Conf. Comput. Graph (SIGGRAPH97),
Aug. 1997, pp. 251-258.

[Szel 1999]

R. Szeliski, A multi-view approach to motion and


stereo, in Proc. IEEE Comput. Soc. Conf. CVPR, vol. 1,
1999, pp. 157-163.

[Tagu 2008]

Y. Taguchi, B. Wilburn and L. Zitnick, Stereo


reconstruction with mixed pixels using adaptive oversegmentation. in Proc. IEEE Comput. Soc. Conf.
CVPR, vol. 1, no. 12, Aug. 2008, pp. 2720-2727.

[Tayl 2003]

C. J. Taylor, Surface reconstruction from feature based


stereo, in Proc. Int. Conf. Comput. Vision, vol.1 Oct.
2003, pp. 184-190.
120

[Tong 2003]

X. Tong and R. M. Gray, Interactive rendering from


compressed light fields, IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 11, pp. 1080-1091, Nov.
2003.

[Torr 2000]

P. H. S. Torr and A. Zisserman, MLESAC: A new


robust estimator with application to estimating image
geometry, Computer Vision and Image Understanding,
vol. 78, pp. 138-156, Apr 2000.

[Truc 1998]

E. Trucco and A. Verri, Introductory techniques for 3-D


computer vision, Upper Saddle River, NJ: Prentice Hall,
1998.

[USBI]

URL: http://www.devasys.com/usbi2cio.htm

[Veer 2007]

A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan


and J. Tumblin, Dappled photography: Mask enhanced
cameras for heterodyned light fields and coded aperture
refocusing, in Proc. Annu. Conf. Comput. Graph
(SIGGRAPH07), Jul. 2007, vol. 26, issue 3, no. 69.

[Wang 2008] Z. Wang and Z. Zheng, A region based stereo matching


algorithm using cooperative optimization, in Proc.
IEEE Comput. Soc. Conf. CVPR, vol. 1, no. 12, Aug.
2008, pp. 887-894.
[Wang 2011] C. Wang, Z. Y. Zhu, S. C. Chan and H. Y. Shum,
Realistic and Interactive Image-based Rendering of
Ancient Chinese Artifacts using A Multiple Camera
121

Array, in IEEE Int. Symp. Circuits Syst., May 2011.


[Weij 2006]

J. van de Weijer, T. Gevers, and A. D. Bagdanov,


Boosting color saliency in image feature detection,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, pp.
150-156, Jan 2006.

[Wong 2002] T. Wong, C. Fu, P. Heng and C. Leung, The plenoptic


illumination function, IEEE Trans. Multimedia, vol. 4,
no.3, pp. 361-371, 2002
[Wood 2000] D. N. Wood, D. I. Azuma, K. Aldinger, B. Curless, T.
Duchamp, D. H. Salesin and W. Stuetzle, Surface light
fields for 3D photography, in Proc. Annu. Conf.
Comput. Graph (SIGGRAPH00), Aug. 2000, pp. 287296.
[Wu 2005]

Q. Wu, K. T. Ng, S. C. Chan and H. Y. Shum, On


object-based compression for a class of dynamic imagebased representations, in Proc. Int. conf. Image
Process, Sep. 2005, pp. 405-408.

[Xu 2008]

L. Xu and J. Jia, Stereo matching: an outlier confidence


approach, in Proc. Euro. Conf. Computer Vision, vol.
5305, Oct. 2008, pp. 775-787.

[Yang 2007]

Q. Yang, R. Yang, J. Davis and D. Nistr, Spatialdepth super resolution for range images, in Proc. IEEE
Comput. Soc. Conf. CVPR, Jun. 2007, pp. 1-8.

[Yang 2009]

Q. Yang, L. Wang, R. Yang, H. Stewnius and D.


122

Nistr,

Stereo

correlation,

matching

hierarchical

with

belief

color-weighted

propagation

and

occlusion handling, IEEE Trans. Pattern Anal. Mach.


Intell., vol. 31, no. 1, pp. 492-504, Mar. 2009.
[Yao 2010]

X. Z. Yao, S. C. Chan, Z. Y. Zhu, K. T. Ng and H. Y.


Shum,

Image-Based

Compression,

Prioritized

Transmission and Progressive Rendering of Circular


Light Fields (CLFS) For Ancient Chinese Artifacts, in
Proc. Asia Pacific Conf. Circuits and Systems, Dec
2010.
[Zeng 2005]

G. Zeng, S. Paris, L. Quan and F. Sillion, Progressive


surface reconstruction from images using a local prior,
in Proc. Int. Conf. Comput. Vision, vol.1, 2005, pp.
1230-1237.

[Zhan 1999]

Z. Zhang, Flexible camera calibration by viewing a


plane from unknown orientations, in Proc. Int. Conf.
Comput. Vision, vol. 1, 1999, pp. 666 - 673.

[Zhan 2000]

Z. Zhang, A flexible new technique for camera


calibration, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 22, no. 11, pp. 1330-1334, 2000.

[Zhan 2005]

L. Zhang and S. Seitz, Parameter estimation for MRF


stereo, in Proc. IEEE Comput. Soc. Conf. CVPR, vol. 2,
Aug. 2005, pp. 288-295.

[Zhan 2008]

Z. G. Zhang, S. C. Chan, K. L. Ho and K. C. Ho, On


123

Bandwidth Selection in Local Polynomial Regression


Analysis and Its Application to Multi-resolution
Analysis of Non-uniform Data, J. Signal Process. Syst.
Signal Image and Video Technol., vol. 52, no. 3, pp.
263-280, 2008.
[Zhan 2009]

Z. G. Zhang, S. C. Chan and Z. Y. Zhu, A New TwoStage Method for Restoration of Images Corrupted by
Gaussian and Impulse Noises using Local Polynomial
Regression and Edge Preserving Regularization, in
Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp.
948-951.

[Zhou 2005]

K. Zhou, Y. H. Hu, S. Lin, B. N. Guo and H. Y. Shum,


Precomputed shadow fields for dynamic scenes, in
Proc. Annu. Conf. Comput. Graph (SIGGRAPH05), Jul
2005, pp. 1196-1201.

[Zhu 2010]

Z. Zhu, K. T. Ng, S. Chan and H. Shum, Image-Based


Rendering of Ancient Chinese Artifacts for MultiviewDisplays - a Multi-Camera Approach, in Proc.
IEEE Int. Symp. Circuits Syst., May 2010, pp. 32523255.

[Zhu 2012]

Z.Y. Zhu, S. Zhang, S.C. Chan and H.Y. Shum, ObjectBased Rendering and 3D reconstruction Using a
Moveable Image-Based System, IEEE Trans. Circuits
and Syst. Video Technol., to be published, 2012.

[Zitn 2004]

C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder


124

and R. Szeliski, High-quality video view interpolation


using a layered representation, in Proc. Annu. Conf.
Comput. Graph (SIGGRAPH04), Aug. 2004, pp. 600608.
[Zwic 2007]

M. Zwicker, A. Vetro, S. Yea, W. Matusik, H. Pfister


and

F.

Durand,

Resampling,

antialiasing,

and

compression in multiview 3-D display, IEEE Signal


Processing Magazine: Special Issue on MVI and 3DTV,
24(6), pp. 88-96, Nov. 2007.

125

You might also like