Professional Documents
Culture Documents
Advisor(s)
Author(s)
Zhu, Zhenyu;
Citation
Issued Date
URL
Rights
2012
http://hdl.handle.net/10722/173910
by
ZHU Zhenyu ()
B.Eng.
Ph. D. Thesis
III
Declaration
Zhu Zhenyu
August 2012
IV
Acknowledgement
Contents
DECLARATION .................................................................................. IV
ACKNOWLEDGEMENT ..................................................................... V
CONTENTS .......................................................................................... VI
LIST OF FIGURES ............................................................................... X
LIST OF TABLES .............................................................................. XV
LIST OF ABBREVIATIONS .......................................................... XVI
CHAPTER 1 INTRODUCTION ........................................................... 1
1.1 BACKGROUND .................................................................................. 1
1.2 THESIS OUTLINE .............................................................................. 4
CHAPTER 2 REVIEW OF BASIC TOPICS IN IMAGE-BASED
RENDERING .......................................................................................... 9
2.1 INTRODUCTION ................................................................................ 9
2.2 REVIEW OF PLENOPTIC FUNCTION ................................................... 9
2.2.1 Basic Theory .......................................................................... 10
VI
VII
IX
List of Figures
Figure 1-1
Figure 2-1
10
Figure 2-2
Forward mapping.............
16
Figure 2-3
19
21
Figure 3-2
24
Figure 3-3
25
Figure 3-4
26
Figure 3-5
27
Figure 3-1
Figure 3-6
indoor
video
from
camera
1
to
camera .........
29
30
31
Planar patten............................................................
33
36
43
Figure 3-12 Video Stabilization result. The first row shows the
original images captured by our system; the
second row shows the stabilized images without
video completion; the third row shows the
completed results.
43
Figure 3-7
Figure 3-8
Figure 3-9
Figure 4-1
48
Figure 4-2
51
Figure 4-3
55
59
62
63
Figure 4-7
65
Figure 4-8
67
68
Figure 4-4
Figure 4-5
Figure 4-6
Figure 4-9
in green............................................................
69
Figure 5-1
Epipolar Geometry...................................................
76
Figure 5-2
80
Figure 5-3
Epipolar Line...
81
Figure 5-4
81
82
Figure 5-6
83
Figure 5-7
85
87
88
91
Figure 5-5
Figure 5-8
Figure 5-9
Figure 5-11
97
97
100
101
101
XIV
List of Tables
Table 2-1
12
Table 4-1
67
XV
List of Abbreviations
BRDF
BP
Belief Propagation
CLF
CPU
CSA
Cross-Sectional Area
DCP
DCT
DSCs
DSP
Fig.
Figure
fps
GC
Graph Cut
GPU
HD
High Definition
IBR
Image-Based Rendering
i.i.d.
ISKR
JVT
KF
Kalman Filter
LAN
L-BFGS
Limited-memory Broyden-Fletcher-Goldfarb-Shanno
LDIs
LPR
LS
Least Square
MCU
Micro-Controller Unit
MI
Mutual Information
M-IBR
MRF
MVC
PCA
PRT
QPP
RANSAC
RBF
RF
Rectus Femoris
R-ICI
RMSD
SCLF
SFM
Structure-From-Motion
SPIHT
S-SFM
Sequential-Structure-From-Motion
XVII
Chapter 1 Introduction
1.1 Background
Image-based rendering/representation (IBR) [Chen 1995], [Debe
1996], [Gort1996], [Levo 1996], [McMi 1995], [Pele 1997], [Szel 1997],
[Shad 1998], [Shum 1999] is a promising technology for rendering new
views of scenes from a collection of densely sampled images or videos.
It has potential applications in virtual reality, immersive television and
visualization systems. Central to IBR is the plenoptic function [Adel
1911], which describes the intensity of each light ray in the world as a
function of visual angle, wavelength, time, and viewing position. The
plenoptic function is thus a 7-dimensional function of the viewing
position (Vx , Vy , Vz ) , the azimuth and elevation angles ( , ) , time ,
and wavelengths . Traditional images and videos are just 2D and 3D
special cases of the plenoptic function. In principle, one can reconstruct
any views in space and time if sufficient number of samples of the
plenoptic function is available. The rendering of novel views can
therefore be viewed as the reconstruction of the plenoptic function from
its samples. Image-based representations are usually densely sampled
high dimensional data with large data sizes, but their samples are highly
correlated. Because of the multidimensional nature of image-based
representations and scene geometry, much research has been devoted to
the efficient capturing, sampling, rendering and compression of IBR.
Depending on the functionality required, there is a spectrum of IBR
as shown in Fig. 1-1. They differ from each other in the amount of
geometry information of the scenes/objects being used. At one end of the
spectrum, like traditional texture mapping, we have very accurate
1
Less images
Less geometry
More geometry
Rendering with
no geometry
Rendering with
implicit geometry
Lumigraph
Light field
Mosaicking
View morphing
Rendering with
explicit geometry
Layered-depth images
Texture-mapped models
3D warping
View-dependent geometry
View-dependent texture,
Shadow light field
also urgent issues waiting for satisfactory solution in order for IBR to
establish itself as an essential media for communication and presentation.
All of these motivate us to study the design and construction of the
image-based rendering systems based on plenoptic videos. The system
can potentially provide improved viewing freedom to users and ability to
cope with moving and static objects for 3D reconstruction.
l(x,y,z,,)
(x,y,z)
Figure 2-1: Light field describes the amount of light in radiance along light rays
traveling in every direction through every point in empty space [Ikeu 2012].
7D
plenoptic
function
is
usually
defined
as
l (Vx ,Vy ,Vz , , , , ) . (Vx ,Vy ,Vz ) is the viewing position. and are
There are several camera systems which are usually used in the
image-based rendering for capturing. For static scene, one camera can be
rotated around the camera centre at a given position V with different
elevation and azimuth angles. The plenoptic function is then simplified
to a panorama lV ( , ) . The spherical camera array can provide other
panoramas representation because the captured image can be projected
to a cylinder. If a multiple video camera array is employed instead, a
panoramic video can be obtained. And the plenoptic function will be
simplified to a 3D panorama lV ( , , ) for dynamic scenes. The close
relationship between plenoptic function and image-based rendering was
due to McMillan and Bishop [McMi 1995] who proposed plenoptic
modeling using the 5D complete plenoptic function for static scene
l (Vx ,Vy ,Vz , , ) .
commonly
used
parameterization
is
the
two-plane
Dimension Year
View space
Name
1991
Free
Plenoptic function
1995
Free
Plenoptic modeling
1996
Bounding box
Light field/Lumigraph
1999
Bounding circle
Concentric Mosaics
1994
Fixed point
Cylindrical/Spherical panorama
13
captured in this way can be refocused after they have been taken. The
light field video can be obtained in a similar way.
Multiple camera systems are usually used to achieve large disparity
in dynamic scenes. Much research effort has been devoted to the
construction of 2D camera arrays. To simplify the capturing hardware,
light field captured on line segments and circular arc have also been
reported in Section 2.2.
(2-4-1)
computed using the depth of X. Given x r and r , one can compute the
exact position of x t on the target screen and transfer the color
accordingly. Gaps or holes may exist due to magnification. Disocclusion
and splatting techniques have been proposed to solve this problem. The
painters algorithm is frequently used to avoid the problem that multiple
pixels from the reference view are mapped to the same pixel in the target
image.
Layered techniques usually separate the scene into a group of
planar layers consisting of a 3D plane with texture and optionally a
transparency map. The layers can be thought of as a continuous set of
polygonal models, which are amenable to conventional texture mapping
and view-dependent texture mapping. Usually, each layer is rendered
using either point-based or polygon meshes as in monolithic rendering
15
xt
xr
et
er
Ct
Cr
xt
xr
et
er
Ct
Cr
For
dynamic scenes, the huge amount of data and vast amount of viewpoints
to be provided present one of the major challenges to IBR. Advanced
algorithms for processing and manipulation of the high dimensional
representation to achieve such functions as object extraction, model
completion, scene inpainting, etc. are all major challenges to be
addressed. Finally, the efficient transmission, compression and display
of dynamic IBR and models are also urgent issues waiting for
satisfactory solution in order for IBR to establish itself as an essential
media for communication and presentation. All of these motivate us to
study the design and construction of new image-based rendering systems
based on plenoptic videos. The system can potentially provide improved
viewing freedom to users and ability to cope with moving and static
objects and perform 3D reconstruction.
2.5 Summary
In this chapter, the basic topics in image-based rendering have been
reviewed. The plenoptic function which serves an important concept for
describing visual information in our world was introduced. Then a brief
review on light field was given. In fact, how to achieve high quality
rendering and display light field with a wide range of viewing positions
in large scale environmental will be studied in Chapters 3 and 4. Finally,
some rendering techniques including point-based, layer-based, and
monolithic methods are discussed. An extension of these rendering
techniques will be further studied in Chapter 5.
18
(a)
(b)
(c)
Figure 2-3: Example renderings using (a) forward mapping in point rendering [Chan
2005], (b) layered representation (with two layers dancer and background) [Chan
2009], (c) monolithic rendering using 3D polygonal mesh (left) and rendering results
(right) [Zhu 2010].
19
20
Figure 3-1: Plenoptic videos: Multiple linear camera array of 4D simplified dynamic
light field with viewpoints constrained along line segments. The camera arrays
developed at [Chan 2009]. Each consists of 6 JVC video cameras.
including
camera
calibration,
color-tensor-based
24
(a)
(b)
Figure 3-3: Snapshots: (a)Buddha (b) Dragon Vase.
25
26
27
processing. The system is built from Analog Device DSP and real-time
compression at a bit rate of 400kbps.
Before the cameras can be used for depth estimation, they must be
calibrated to determine the intrinsic parameters as well as their extrinsic
parameters, i.e. their relative positions and poses. This can be
accomplished by using a sufficient large checkerboard calibration
pattern. We follow the plane-based calibration method [Zhan 2000] to
determine the projective matrix of each camera, which connects the
world coordinate and the image coordinate. The projection matrix of a
camera allows a 3D point in the world coordinate be translated back to
the corresponding 2D coordinate in the image captured by that camera.
This will facilitate depth estimation. Fig. 3-6 shows snapshots of an
outdoor and indoor videos captured by the proposed system called
podium and presentation, respectively. The resolution of these realscene videos is 19201080i with 25frames per second (fps) in 24-bit
RGB format. The system flow of the proposed moveable IBR system is
summarized in Fig. 3-7. Firstly we need to stabilize the video to reduce
the shaky motion frequently encountered in typical moveable IBR
systems. Then, a novel view synthesis algorithm using a new
segmentation and mutual-information (MI)-based algorithm for dense
depth map estimation is used to iteratively estimate the depth map.
Finally we need to reconstruct the 3D model using a new 3D
reconstruction algorithm which utilizes sequential-structure-from-motion
(S-SFM) technique and the dense depth maps estimated previously. A
new robust radial basis function (RBF)-based modeling algorithm is
used to further suppress possible outliers and generate smooth 3D
meshes of objects.
28
(a)
(b)
Figure 3-6: Snapshots of the plenoptic videos at a given time instance: (a) is the
Podium outdoor video from camera 1 to camera 4 and (b) is the Presentation
indoor video from camera 1 to camera 4.
29
Video
stablization
SegmentationMI-based Depth
Estimation
Depth Map
Refinement
3D
Reconstruction
Image-based
Rendering
3.3 Pre-Processing
3.3.1 Still Camera System
In order to speed up the whole processing procedure on the
proposed still camera system, some pre-processing needs to be done at
start. At first, all the cameras need to be calibrated. Because this system
is still, intrinsic and extrinsic parameters of cameras can obtained
precisely by following the plane-based calibration method. The proposed
still camera system will only focus on the objects we are interested. The
objects will be segmented out of the images for reducing the noise from
background.
3.3.1.1 Camera Calibration
In computer vision, the link between the 3D real world points and
image pixels is the camera parameters. The camera parameters contain
the extrinsic parameters and intrinsic parameters. Estimation of the
extrinsic and intrinsic parameters is called camera calibration [Truc
30
Pc R( Pw T ) .
ZW
(3-3-1)
Zc
R, T
XW
Xc
Pc
PW
YW
Yc
Figure 3-8 Relationship between the world coordinate and the camera coordinate.
d
fy
0
cx
cy ,
(3-3-2)
31
XW
xc
y CRI | T YW ,
c
ZW
zc
1
(3-3-3)
(3-3-4)
(3-3-5)
32
f xT f y
,
f yT f y
(3-3-6)
(3-3-7)
34
s1 ( f xT f / || f ||) f / || f || .
(3-3-8)
h (ci cb ) / ci cb .
(3-3-9)
H ( f xT h / || h ||) h / || h || .
(3-3-10)
location, greatly improves the visual quality of mixing the objects onto
other backgrounds.
(a)
(b)
Figure 3-10:(a) Extraction results using color-tensor-based method. Left: original,
middle: hard segmentation, right: after matting. (b) Close up of segmentations in
(a). Left: hard segmentation, Right: after matting.
that the ground surfaces may not be smooth and the whole mechanical
structure can vibrate considerably during movement. In our M-IBR
system, the shaky motion of the camera array of the outdoor
environment seems to come from the roughness of the ground surfaces
and the vibration of the mechanical structure during the movement.
Besides, the video captured may also appear shaky when the system is
moving and about to settle down in indoor environment. To reduce these
annoying effects, video stabilization [Hu 2007], [Mats 2005], [Mats
2006], [Rata 1998] is frequently employed to eliminate the undesired
motion fluctuation in the captured videos.
As mentioned above, our M-IBR system was driven steadily during
capturing. Therefore, the undesired motion fluctuation will usually
appear as high frequency components compared to the intentional
motion. As a result, the problem of video stabilization can also be
viewed as the removal of high frequency components in the estimated
velocity. To this end, one needs to estimate the global motion of the
camera, say by mean of optical flow on the video sequence, so that this
annoying high frequency local motion can be removed to stabilize the
videos.
The proposed video stabilization algorithm is divided into three
major steps as follows. 1) Global motion estimation: firstly, the
geometric transformation between a location x [ x1 , x2 ]T in a frame
with that in an adjacent frame, x ' , is modeled by an affine transformation
a2
. In homogeneous coordinate, x h [ x1 , x2 ,1]T , T can be
a4
37
A t
Th
. T is estimated from the tracked features in adjacent video
0
1
(3-3-11)
G( x) ( 2 ) 1 e x
/ 2 2
(3-3-12)
Yi
given
m( x0 )
m ( p ) ( x0 )
2
( x x0 )
( x x0 ) p
2!
p!
(3-3-13)
0 1 ( x x0 ) p ( x x0 ) p ,
where x is in the neighborhood of x0 and k (k 0,1, , p) is the k-th
polynomial coefficient. The coefficient vector [ 0 , 1 ,, p ]T at
location x0 can be obtained by solving the following weighted leastsquares (WLS) regression problem:
40
i 1
k 0
min{ K h ( X i x0 )[Yi k ( X i x0 ) k ]2 } ,
(3-3-14)
where
1 ( X 1 x0 )
1 ( X 2 x0 )
X
1 ( X n x0 )
( X 1 x0 ) P
( X 2 x0 ) P
,
( X n x0 ) P
(3-3-15)
y Y1 Y2 Yn
and
kernel K (u) (3 / 4)(1 u ) , and the bandwidth parameter set for R-ICI
2
42
(a)
(b)
Figure 3-11: Motion smoothing results for horizontal (Translation-x) and vertical
(Translation-y) directions. The original motion path and the smoothed motion path
with different methods are shown. In (a)-(b), the blue dotted lines correspond to the
shaky original motion path. Green and black lines correspond to the smoothed
motion path using the method in [Mats 2005] with a small and a large kernel sizes
respectively.
Figure 3-12: Video Stabilization result. The first row shows the original images
captured by our system; the second row shows the stabilized images without video
completion; the third row shows the completed results.
43
3.4 Summary
The design and construction of the two proposed IBR systems and
their associated processing flow are presented. The first still IBR system
is designed for capturing ancient Chinese artifacts. This system can be
used for preservation and dissemination of cultural artifacts with high
digital quality. The second moveable IBR system is designed for moving
object and larger environment, which can potentially provide improved
viewing freedom to users. Moreover, some pre-processing techniques
including camera calibration, color-tensor-based segmentation and
matting are presented. After camera calibration, we can obtain the
camera matrix for subsequent 3D reconstruction. Color-tensor-based
segmentation is insensitive to the shadow and shading. The natural
matting can be adopted to improve the rendering quality when objects
are mixed on other backgrounds. Experimental results show that these
algorithms work well with some shadow and shading. These pilot
studies provide useful experience for the design and construction of
similar and more general IBR systems.
44
Graph
Cuts(GC)-based
[Boyk
2001]
and
Belief
46
47
(a)
(b)
Figure 4-1: Segmentation results using the level-set-based tracking method. (a) is the
initial segmentation obtained by lazy snapping, (b) is the initial segmentation
obtained by graph cut method.
48
as
p ( x, y )
dxdy
I ( X ; Y ) Y X p( x, y ) log
p X ( x ) pY ( y )
where
of X and Y as follows:
I ( X ; Y ) H ( X ) H (Y ) H ( X , Y ) ,
(4-2-1)
H ( p A (i A )) I ( A) p A (i A ) log p A (i A )di A
(4-2-2)
H ( pT ( B ) (i B )) I ( B ) pT ( B ) (i B ) log pT ( B ) (i B )di B
I ( A),I ( B ) p A,T ( B ) (i A , i B ) log pT ( B ) (i B )di A di B
(4-2-3)
H ( p A,T ( B ) (i A , i B ))
I ( A),I ( B ) p A,T ( B ) (iA , iB ) log p A,T ( B ) (iA , iB )diAdiB .
(4-2-4)
where i A and i B are the intensity valuables of A and T(B) and their
ranges are I (A) and I (T ( B)) I ( B) , respectively. The latter follows
from the fact the T () is a disparity transformation which does not
change the range of the intensity values. To proceed further, one needs
to determine the corresponding pdfs. A powerful method for
approximating the pdfs is the kernel method [Silv 1986] which
approximates the pdfs directly from the image data as follows:
p A (iA )
1
G1 (iA I A ( x1 , x2 ))dx1dx2 ,
V
pT ( B ) (iB )
(4-2-5)
1
G1 (iB IT ( B ) ( x1 , x2 )))dx1dx2
V
(4-2-6)
p A,T ( B ) (iA , iB )
(4-2-7)
1
G2 (iA I A ( x1 , x2 ), iB IT ( B ) ( x1 , x2 ))diAdiB ,
V
where G1 ( x)
1
2
e x
/( 2 )
2
, G2 ( x1 , x2 )
x2 x2
1( 1 2 )
1
2 1 2
12 22
, I A ( x1 , x2 ) and
p A,T ( B ) (i A , i B )
p A (i A ) pT ( B ) (i B )
di A di B .
(4-2-8)
which
are
indexed
by
as:
P ( x1 , x2 ) 3 0 3 0 (l ) (v) Pc (m , n ) ,
(4-2-9)
(m , n ) | ( , ) [0,3]
of
the
transformation
function
TL (B' )
and
9) into (4-2-8), one gets the local matching data term to be minimized.
p A,TL ( B ') (i A , i B )
p A (i A ) pTL ( B ') (i B )
di A di B .
(4-2-10)
(4-2-11)
The
limited-memory
(4-2-12)
BroydenFletcherGoldfarbShanno
(L-
information
(Fig.
4-3(a)).
Moreover,
the
depth
54
(a)
(b)
(c)
(d)
(e)
Figure 4-3: (a) is an example depth map obtained by using MI matching without
segmentation information; (b) shows the depth map obtained by using automatic
segmentation MI matching; (c) shows the depth map obtained by using semiautomatic segmentation MI matching. Green areas in (c) are the occlusion areas
detected by our algorithm. (d)-(e) show the refined depth maps of (c) by inpainting
and smoothing (c) using SK-LPR-R-ICI and 25 25 ideal low-pass filter,
respectively.
55
56
(4-3-1)
m( X : x ) k1 ,k2 ( X j x j ) j ,
0 k1 k 2
j 1
(4-3-2)
i 1
(4-3-3)
57
(4-3-4)
1 ( X 2 x)T
T
Y [Y1 , Y2 , , Yn ] ,
T
1 ( X n x )
vech{( X 1 x )( X 1 x ) T }
vech{( X 2 x )( X 2 x ) T }
,
vech{( X n x )( X n x ) T }
(4-3-5)
(a)
(b)
(c)
(d)
Figure 4-4: (a) and (b) show the renderings obtained by Figs. 4-2(d) and (b); (c) and
(d) are the enlargements of the red boxes.
Comparing Fig. 4-3(e) with Fig. 4-3(d), we can see that the discontinuity
of the object boundaries using SK-LPR-R-ICI is well preserved, while
the object boundaries are blurred by the lowpass filter due to its fixed
size and relatively large support for noise suppression. In order to
illustrate the effect of these errors in the depth maps on the rendering
qualities, example renderings are also shown in Figs. 4-4(a) and (c)
according to the depth maps obtained from Figs. 4-2(d) and (b),
respectively. It can be seen that inaccurate depth values produce obvious
distortion of the light pole in Figs. 4-4(c) and (d). By combining this
59
60
61
(a)
(b)
(c)
Figure 4-5: Rendering results obtained by the proposed algorithm. (a) shows the
depth maps corresponding to images in (b). The highlighted images in (b) shows the
rendered views from the adjacent views in (b) using depth maps in (a). (c) shows
depth maps at other positions.
62
Figure 4-6: Example rendering results. The first row shows the original images
captured by our M-IBR system. The second and third rows show renderings with a
step-in ratio of about 1.15 to 1.25 times.
mask are 1 if they are inside the object. On the contrary, the values of
pixels in the binary mask are 0 if they are outside the object. Only the
part inside the object will take part in the MI optimization. More
63
TL (B' ) Mask ( x1 , x2 ) . The solution procedure is the same as the MIbased matching. Example video tracking results are shown in Fig. 4-7.
More results are shown in Section 4.5.
64
The resolution and the frame rate of the videos are 1920 1080i and
25fps, respectively. A demonstration video of our video stabilization
algorithm
can
be
found
at:
http://www.youtube.com
/watch?v=qPuMNjgUoWs.
The segmentation-MI-based matching algorithm has been evaluated
extensively on the stereo test image sets at the Middlebury stereo page
and our outdoor plenoptic video podium. Fig. 4-8 shows the stereo
images, ground truth depth map and depth maps calculated by our
method of the Teddy test images ( 450 375 ) [Scha 2002]. Table 4-1 is a
reproduction of the upper part of the evaluation at the Middlebury stereo
pages. The evaluation value in the table means the percentage of "bad"
pixels, i.e., pixel whose absolute disparity error is greater than some
threshold. A standard threshold of 1 pixel has been used in Table 4-1.
The segmentation-MI-based matching is among the best performing
stereo algorithms at the upper part of the table with the semi-automatic
and automatic versions ranking the fourth and sixth, respectively. The
performance difference between our algorithm and the top algorithm is
very small. Moreover, our algorithm is very stable and insensitive to
versatile data sets such as real data sets. And there are not too many
parameters which need to be selected carefully in our algorithm.
66
(a)
(b)
(c)
(d)
(e)
Figure 4-8: Teddy test images [Scha 2002] and depth maps for comparison. (a)
LEFT image; (b) RIGHT image; (c) ground truth depth map; (d) depth map
calculated by semi-automatic segmentation-based MI matching; (e) depth map
calculated by automatic segmentation-based MI matching.
Algorithm
Adapting BP[Klau 2006]
Rank
6.7
Tsukuba
1.37
Venus
0.21
Teddy Cones
7.06
7.92
CoopRegion[Wang 2008]
6.7
1.16
0.21
8.31
7.18
DoubleBP[Yang 2009]
8.8
1.29
0.45
8.30
8.78
Our Method(Semi-Seg)
9.6
1.30
0.18
5.10
8.88
OutlierConf[Xu 2008]
9.8
1.43
0.26
9.12
8.57
Our Method(Auto-Seg)
12.2
1.30
0.24
7.91
8.88
SubPixDoubleBP[Yang 2007]
13.2
1.76
0.46
8.38
8.73
SurfaceStereo[Bley 2010]
13.8
1.65
0.28
5.10
7.95
Table 4-1: Comparison of the rank using standard threshold of 1 pixel on middlebury
test stereo images.
67
(a)
(b)
(c)
(d)
Figure 4-9: Results for the conference. (a) and (c) are two sample frames. (b) and
(d) are the depth maps of (a) and (c), respectively.
Figure 4-10 Ultrasound images of RF muscle under relaxed condition and at 50%
maximal voluntary contraction (MVC) contraction level and the corresponding
images with outlined boundary contours. The tracked boundaries are highlighted in
green.
4.6 Summary
A new iterative segmentation and mutual-information (MI)-based
algorithm for dense depth map estimation is presented. It supports both
semi-automatic and automatic segmentation methods which rank
69
70
be
found
in
http://vision.middlebury.edu/mview/eval/.
The
disadvantages of many algorithms are that too many views are needed in
order to construct a high resolution model and the time required is very
high. In this chapter, a new 3D reconstruction algorithms for objects
using dense depth maps to obtain dense point correspondences from
multiple views for 3D reconstruction is presented. In order to perform
3D reconstruction, the camera calibration which has been presented in
Chapter 3.3.1 is first performed to determine the intrinsic parameters as
well as their extrinsic parameters, i.e. their relative positions and poses.
For the proposed moveable IBR system, the structure-from-motion
71
New
segmentation-MI-based
iterative
algorithms
Kalman
filter
are
proposed
(KF)-based
to
fuse
and
the
[a, b, c]T . Because the vectors [a, b, c]T and k[a, b, c]T can represent the
(5-2-1)
x2 P2 X
(5-2-2)
(5-2-3)
x2 P2 X 0 ,
(5-2-4)
(5-2-5)
x1 p13T p11T
y1 p13T p12T
A
,
x2 p23T p21T
3T
2T
y 2 p 2 p2
(5-2-6)
75
x2
x1
e2
e1
C2
C1
(5-3-1)
76
Fx 2 L1 .
(5-3-2)
(5-3-3)
77
(5-3-4)
(5-3-5)
i cos( f , d ( x ))
~ ~
I1, f , I 2*, f ,
~ ~* .
I1, f , I 2, f ,
78
(5-3-6)
less than the threshold value or all the views have been searched. In
our experiments the threshold value is 0.5. The final averaged
confidence for the correspondent points which pass the threshold for
all views
1 N
C ( Ai , Ai1 ) ,
N 1 i 1
(5-3-7)
(a)
(b)
Figure 5-2 Feature Points Detection. The red points in (a) and (b) are the feature
points. (a) is from the first camera. (b) is from the second camera.
80
(a)
(b)
(c)
(d)
Figure 5-4: Rectified images, (a) is the rectified left image, (b) is the rectified right
image. (c) is the part of (a), (d) is the part of (b).
81
Figure 5-5: An initial point cloud extracted with noise and outliers.
82
The alpha maps are made by matting in Chapter 3.3.2. The basic
function of view dependent texture is:
Tex( xi , yi ) Tex1 ( xi , yi )
Tex2 ( xi , yi )
(5-4-1)
Tex2 ( xi xi , yi yi )
83
, (5-4-2)
xi , yi
Tex2 ( xi
xi , yi
yi )
yi )
(5-4-3)
In this way, the shift value for each texture coordinate will be
modified gradually according to the users viewing angle. The right hand
side of Fig. 5-7 shows an example of improved texture using (5-4-3).
Consequently, audiences will have better visual experience because the
textures will be changed when users try to observe the object from
different directions.
84
Figure 5-7: View Dependent Texture, Left: blurred texture. Right: texture after the
proposed view dependent texture.
Experimental Results
Using the circular still camera array that we constructed in Chapter
3.2.1, we have captured 7 ancient Chinese artifacts from the University
Museum and Art Gallery at the University of Hong Kong. Fig. 5-8
shows several examples of the reconstructed 3D model of artifacts
captured. The artifacts can also be displayed in a Newsight multiview
TV which can display 9 views at a time. In order to improve the final
rendering effect, the 3D models will be put into a scene with lighting.
Lighting and shadows will make the object more realistic to people.
Although OpenGL provides several APIs to simulate basic lighting, it is
only suitable for ideal point source and it cannot generate shadows.
Therefore, we use the Shadow Field method [Zhou 2005] to relight the
scene. The basic idea of this algorithm is similar to Pre-computed
Radiance Transfer [Sloa 2002]. The environment map will be used as the
global incident light, which makes the scene much realistic. The source
radiance fields for lights and the object occlusion fields for occluders in
the scene are generated. Hence, the interaction of all the object in the
scene can be calculated in real time, i.e. the soft shadow of moving
85
object can be generated quickly. All the data including visibility function
of each vertex are compressed considerably with low error by using
spherical harmonics.
The testing is performed in an INTEL Core i7 990x CPU-based
computer with 4GB RAM and GTX580 GPU acceleration. The
rendering and display are accelerated at a speed of 60 frames per second.
Fig. 5-9 gives example renderings using the reconstructed 3D
model and shadow field [Zhou 2005] using OpenGL, which supports
real-time relighting and object movement with soft shadow.
(a)
(b)
(c)
(d)
86
(e)
(f)
(g)
Figure 5-8: 3D models of Ancient Chinese Artifacts. (a) Dragon Vase, (b) Buddha,
(c) Green Bottle, (d) Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine Glass.
87
from
multiple
views
are
obtained.
of point
For
3D
(a)
(b)
(c)
Figure 5-10: Iterative refinement of point cloud: (a) initial point cloud. (b) point
cloud after outlier detection and Kalman filtering. (c) point cloud after the proposed
iteration method.
91
(5-5-1)
z (t ) H (t ) x(t ) (t ) ,
(5-5-2)
noise
with
covariance
matrix Qw (t ) R nn
and
R (t ) R mm ,
(5-5-3)
P (t / t 1) F (t ) P (t 1 / t 1)F T (t ) Qw (t ) ,
(5-5-4)
e(t ) z (t ) H (t ) x (t / t 1) ,
(5-5-5)
K (t ) P (t / t 1) H T (t ) [ H (t ) P (t / t 1) H T (t ) R (t )]1 ,
(5-5-6)
x (t / t ) x (t / t 1) K (t )e(t ) ,
(5-5-7)
P (t / t ) [ I K (t ) H (t )]P (t / t 1) ,
(5-5-8)
92
(5-5-9)
(5-5-10)
(5-5-11)
K (i) P (i / i 1)[P (i / i 1) rI 3 ] 1 ,
(5-5-12)
x (i / i) x (i 1) K (i)e(i) ,
(5-5-13)
P (i / i) [ I 3 K (i)]P (i / i 1) ,
(5-5-14)
i) Segmentation consistency: at the i-th iteration, z (i ) is reprojected back to a 2D point xi Pi z (i) in view i , where Pi is the
camera projection matrix of view i , which contains the intrinsic
parameters and extrinsic parameters. For notational convenient, we have
dropped the additional subscript t for denoting the t-th time instant. Due
to errors in computing the projection matrixes and triangulation, xi may
lie outside the segment it belongs to. In this case, z (i ) is considered as
an outlier.
ii) Location consistency: the 3D distance between z (i ) and the
predicted location of the KF x (i ) should be relatively small. That is,
|| z (i) x (i) || D for some constant D . If not, it is treated as an outlier.
D( pj , p j ) ,
j 1
(5-5-15)
the statue in Fig. 5-11. Fig. 5-10(c) shows the final point cloud obtained
by refining the one at Fig. 5-10(b). Considerable improvement in terms
of the smoothness and number of matched points is observed, which
demonstrates the effectiveness of the proposal iterative refinement
approach.
96
(a)
(b)
(c)
Figure 5-11: (a)-(b) shows the 3D to 2D re-projection at frame 20 and frame 21,
respectively. Blue points are inliers. Green points are outliers detected by the
segmentation consistency check. Red points are the outliers detected by intensity and
location consistency checks. (c) shows the enlargement of the highlight area in (a).
The point cloud is down-sampled for better visualization.
Figure 5-12: Convergence behavior of the root mean square distance (RMSD) versus
the number of iteration for the proposed iterative 3D reconstruction algorithm. The
blue line shows the RMSD values with the KF-based outlier detection. The red line
shows the RMSD values without KF-based outlier detection.
97
~
M
get a good mesh. Therefore, further smoothing of the raw 3D point cloud
is necessary. In this paper, we employ the RBF-based modeling for
smoothing and the construction of the 3D mesh. The basic form of a
RBF is:
F ( x) j 1c j p j ( x) i1i ( x xi ) ,
(5-6-1)
2004]:
pT 0 ,
(5-6-2)
pT
p f
,
0 c 0
(5-6-3)
O( log ) . The basic idea of this fast RBF algorithm is that exact
interpolation is not needed in practice. Consequently, the value of F ( xi )
is only required to lie in an acceptable range to achieve a given accuracy.
In this paper, we also make use of this property to get rid of possible
outliers left. More precisely, we set an error bar for the RBF
values: i1 F ( xi ) i 2 where for simplicity we set i1 i 2 i .
Consequently, the problem becomes:
minimize T ,
subject to | F ( x i ) | i , p T 0
(5-6-4)
99
(a)
(b)
(c)
Figure 5-13: 3D reconstruction results (a) without using RBF, (b) using RBF
without outlier detection and (c) using RBF with outlier removal.
Experimental Results
Figs. 5-14 and 5-15 show example rendering using the 3D model
and shadow field using OpenGL, which support real-time relighting.
Since the other side of the object in Fig. 5-15 is invisible, only part of
the object can be recovered.
Using the circular still camera array that we constructed in Chapter
3.2.1, we have captured 7 ancient Chinese artifacts from the University
Museum. More rendering result can be found in our demonstration video
at http://www.youtube.com/watch?v=hZHW5XS9xAg. Moreover, the
SK-LPR-RICI method can be applied to the depth map to estimate
smooth gradient field. Combining this gradient field with the depth map,
a normal field corresponding to the 2D image can be approximated. This
can be used to perform real-time 2D relighting. A demonstration video
of the 2D relighting results can be found at: http://www.youtube.com/
watch?v=5LRdPgnWapo.
100
(a)
(b)
(c)
(d)
Figure 5-15: Object-based rendering results of the conference sequence. (a) and (b)
are the 3D reconstruction result of two time instances. (c) and (d) are the rendering
results of (a) and (b). Note, only partial geometry of the dynamic object is recovered,
since it is partially observable.
101
5.7 Summary
In this chapter, point matching refinement, 3D reconstruction and
rendering methods for the static and moveable IBR systems are
presented. The first one for static IBR system relies on epipolar
constraint and SIFT feature detection to find a set of sparse
correspondent points. Then Gabor filter-based measurement, which is
insensitive to noise, is used to obtain more reliable correspondent points
for interpolating the disparity maps. The other one for moveable IBR
system is more complicated and it uses the sequential-structure-frommotion (S-SFM) technique and the dense estimated depth maps to obtain
the point cloud for 3D reconstruction. A new iterative point cloud
refinement algorithm based on Kalman filter (KF) for outlier removal
and
segmentation-MI-based
algorithm
for
further
refining
the
102
103
education,
digital
entertainment,
content
106
Appendix I Publications
Journal Papers:
[1] Z.Y. Zhu, S. Zhang, S.C. Chan and H.Y. Shum, Object-Based
Rendering and 3D reconstruction Using a Moveable Image-Based
System,
published, 2012.
[2] X. Chen, Y.P. Zheng, J.Y. Guo, Z.Y. Zhu, S.C. Chan and Z.G. Zhang,
Sonomyographic
responses
during
voluntary
isometric
ramp
Conference Papers:
Local
Polynomial
Regression
107
and
Edge
Preserving
Regularization, in Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp.
948-951.
[2] K. T. Ng, Z. Y. Zhu and S. C. Chan, An approach to 2D-To-3D
Conversion for Multiview Displays, in Proc. IEEE Int. Conf. Info.,
Commu. Signal Processing, Dec. 2009. pp.1-5.
[3] S. C. Chan, Z. Y. Zhu, K. T. Ng, C. Wang, S. Zhang, Z. G. Zhang, Z.
F. Ye and H. Y. Shum, A moveable image-based system and its
applications to multiview audio-visual conferencing, in Proc. IEEE Int.
Symp. Commu. Info. Technol., Tokyo Japan, 2010, pp. 1142 - 1145
[4] Z. Y. Zhu, K. T. Ng , S. C. Chan and H. Y. Shum, Image-Based
Rendering of Ancient Chinese Artifacts for Multi-view Displays using a
Multi-Camera Array Approach, in Proc. IEEE Int. Symp. Circuits Syst.,
May 2010, pp. 948-951.
[5] X. Z. Yao, S. C. Chan, Z. Y. Zhu, K. T. Ng H. Y. Shum, ImageBased
Compression,
Prioritized
Transmission
and
Progressive
108
References
[Adel 1991]
[Agra 2000]
[Atta 1999]
[Beat 2001]
[Beat 2004]
109
[Boyk 2001]
[Bueh 2001]
[Chai 2000]
[Chan 2003]
[Chan 2004]
[Chan 2005]
[Chan 2009]
[Chan 2010]
[Chan 2012]
[Chen 1993]
[Chen 2012]
[Debe 1996]
[Debe 1998]
Eurographics
Workshop
on
[Egna 2002]
[Flie 2007]
inter-image
similarities,
IEEE
Signal
[Gan 2005]
[Gers1939]
[Gort 1996]
[Guo 2008]
[Hu 2007]
[Huan 2006]
[Ikeu 2003]
[Ikeu 2012]
[IVS]
www.ivs-tech.com
[Klau 2006]
[Kolm 2002]
[Konr 2007]
[Kutu 2000]
[Lalo 1999]
[Levo 1996]
[Levo 2002]
115
[Lhui 2003]
[Li 2004]
[Lian 2008]
Programmable
aperture
photography:
D. G. Lowe, Distinctive image features from scaleinvariant keypoints, Int. J. Comput. Vision, vol. 2, no.
60, pp. 91-110, 2004.
[Magn 2000] M. Magnor and B. Girod, Data compression for lightfield rendering, IEEE Trans. Circuits and Syst. Video
Technol., vol. 10, no. 3, pp. 338-343, Apr. 2000.
[Magn 2003] M. Magnor, P. Ramanathan and B. Girod, Multi-view
coding for image-based rendering using 3-D scene
geometry, IEEE Trans. Circuits and Syst. Video
Technol., vol. 13, no. 11, pp. 1092-1106, Nov. 2003.
[Mane 2000]
[Mats 2006]
[Ng 2009]
[Ng 2010]
K. T. Ng, Q. Wu, S. C. Chan and H. Y. Shum, ObjectBased Coding for Plenoptic Videos, IEEE Trans.
Circuits Syst. Video Technol., vol. 20, no. 4, pp. 548562, Apr. 2010.
[Ng 2012]
K.T. Ng, Z.Y. Zhu, C.Wang S.C. Chan and H.Y. Shum,
117
[Ogal 2007]
[Pele 1997]
[Pete 2001]
[PGDT]
PG
DRIVE
VR2
CONTROLLER
URL
http://www.pgdt.com/products/vr2/index.html
[Rama 2005] R. Ramamoorthi, D. Mahajan and P. Belhumeur, A
first-order analysis of lighting, shading, and shadow,
ACM Trans. Graphics, vol. 26, no. 1, Jan. 2007.
[Rata 1998]
118
[Sand 1987]
[Scha 2002]
[Shad 1998]
light
field:
An
interactive image-based
[Slab 2004]
179-199, 2004.
[Sloa 2002]
[Sun 2005]
[Szel 1997]
[Szel 1999]
[Tagu 2008]
[Tayl 2003]
[Tong 2003]
[Torr 2000]
[Truc 1998]
[USBI]
URL: http://www.devasys.com/usbi2cio.htm
[Veer 2007]
[Xu 2008]
[Yang 2007]
Q. Yang, R. Yang, J. Davis and D. Nistr, Spatialdepth super resolution for range images, in Proc. IEEE
Comput. Soc. Conf. CVPR, Jun. 2007, pp. 1-8.
[Yang 2009]
Nistr,
Stereo
correlation,
matching
hierarchical
with
belief
color-weighted
propagation
and
Image-Based
Compression,
Prioritized
[Zhan 1999]
[Zhan 2000]
[Zhan 2005]
[Zhan 2008]
Z. G. Zhang, S. C. Chan and Z. Y. Zhu, A New TwoStage Method for Restoration of Images Corrupted by
Gaussian and Impulse Noises using Local Polynomial
Regression and Edge Preserving Regularization, in
Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp.
948-951.
[Zhou 2005]
[Zhu 2010]
[Zhu 2012]
Z.Y. Zhu, S. Zhang, S.C. Chan and H.Y. Shum, ObjectBased Rendering and 3D reconstruction Using a
Moveable Image-Based System, IEEE Trans. Circuits
and Syst. Video Technol., to be published, 2012.
[Zitn 2004]
F.
Durand,
Resampling,
antialiasing,
and
125