Professional Documents
Culture Documents
Chapter 1
INTRODUCTION
Embedded smart cameras have made a dramatic shift towards distributed surveillance
systems by combining sensing, processing and communicating on a single platform. A critical
issue in embedded smart cameras is resource limited, which poses great challenging in designing
fast and efficient vision algorithms. Therefore, it should be very important to consider the vision
algorithms efficiency, memory requirements and portability to an embedded processor during
the algorithm design.
Face detection has been one of the most studied topics in the computer vision literature,
and is the step stone to all facial analysis algorithm . As a fundamental computer vision problem,
the goal of face detection is, given an arbitrary image, to determine whether or not there are any
faces in the images and, if present, return the image location and extent of each
face .
Most of the face detection approaches can be categorized as knowledge-based, featurebased, template-based and appearance-based methods . However, little attentions have been paid
to the algorithms efficiency in processing time as well as meeting the real-time requirement in
some resource-limited applications. For example, Hsu Etal propose a face detection method in
colour image needs 540 seconds to process a 640x480 image on a 1.7GHz CPU. Rowley Etal
present a neural network-based upright frontal face detection system takes approximately 383
seconds to process a 320x240 image. The so-called most successful and fastest Viola-Jones
detector can process 384x288 images at the speed of 15 FPS (Frames Per Second) on a
conventional desktop and 2 FPS on a low power 200 mips Strong ARM processors, nevertheless
their image size is too small to be preferable (a 640x480 resolution is common used) and the 2
FPS on Strong Arm is apparently unacceptable in a real-time application based on embedded
smart cameras. To obtain the real-time performance in video streams, several optimized face
detectors appeared. Most of them use the Viola-Jones face detector and optimize the software
and/or hardware implementation to improve the systems performance. Table 1.1 summarizes
some embedded system-oriented implementations.
1
Dept. of Electronics and Communication Engineering, MBCET.
TABLE 1.1
IMPLEMENTATION OF VIOLA JONES DETECTOR
2
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 2
DESIGN GOALS
2.1 PROBLEM DEFINITION
In the traditional algorithm design , more attention are drawn to the detection accuracy
performance rather than the processing efficiency as well as resource-limited conditions. On the
other hand, for some hardware or/and software optimized implementations in Table I, their FPS
are moderate. However, all of these implementations platforms are ASIC, DSP and FPGA who
are highly specialised and customised processors. In a real smart camera networks application,
every camera mote may have changing task with the varying situation. So, our goal was to
design a light-weight face detector on embedded smart camera with general purpose processor,
where it consumes little resource and achieves real-time and acceptable detection performance.
Experimental results demonstrate the P-FADs resource-aware properties that could process a
VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded smart camera
while still hold the acceptable detection accuracy compared to the Viola-Jones haar detectors
OpenCV implementation. Moreover, P-FAD is not customised/optimised for any given hardware
platform, so its resource-aware properties can also be ported to other general-purposed smart
camera platforms.
4
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 3
PYRAMID-LIKE FACE DETECTION SCHEME
In this section, hierarchical framework for face detection on embedded smart camera is
briefly introduced and then focus on tackling the challenging issues in constructing the
hierarchical scheme. P-FAD is a hierarchical detection scheme. More specifically, as shown in
Fig.3.1, P-FAD is consists of five layers: skin detection, contour point detection, dynamic group,
region merge and filter, and haar face detection. P-FAD first uses a relatively coarse skin
detection to detect the skin then through contour point detection, dynamic group, region merge
and filter, P-FAD shifts operating unit from pixels to contour points, regions and face candidates;
and the finally results are refined by the haar face detection. The hierarchical detection scheme is
tailored to implement real time detection scheme with low computation and storage overhead,
where the operating units decrease dramatically from top to down while the operations on each
unit are increasing. It could make pixel manipulation as few as possible to make a significant
shift in time cost and guarantee the detection accuracy through further complex process. Thus,
PFAD has an inverted pyramid-like appearance on the scales of every layers operate units while
the processs complexity is increasing with a pyramid-like shape. To achieve low overhead and
high detection accuracy, the following critical issues should be answered in P-FAD:
Derive the efficiency detection regions with operating on full images as few as possible.
Achieve a robust detector with high accuracy by considering the changing environment
such as illumination and different individuals.
5
Dept. of Electronics and Communication Engineering, MBCET.
Fig 3.1.
6
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 4
LAYERS OF PYRAMID LIKE FACE DETECTION SCHEME
4.1 SKIN DETECTION
In this implementation of the scheme, there was no interface to scale the graphics engine
frequency, so the only metric that came into play is the processor frequency. Additional knobs
would definitely have a more profound effect on the energy consumption as well as on peak
power. An algorithm for predicting the EM (such as a weighted average of the previous samples)
as an enhanced way of fine-tuning was not developed, but instead the failure of traditional
schemes in some scenarios where the proposed mechanism succeeds was showcased.
Skin detection is the first layer with pixel manipulation of P-FAD. Because the pixel
manipulation accounts for the most processing time in image process, a crucial issue in skin
detection is to consider the process complexity. To reduce processing time significantly, the basic
design principle is to present a relatively coarse but highly time-saving skin detection. In P-FAD,
skin detection is based on skin-colour information as skin colour provides computationally
effective yet, robust information against rotations, scaling and partial occlusions. Further, the
skin colour is modelled in CbCr subset of YCbCr colour space. CbCr subset can eliminate the
luminance effect and provide nearly best performance among different colour spaces To classify
a pixel as a skin-pixel or none-skin-pixel, we choose the widely used Gaussian mixture models
(GMM), which has relatively simplified parameters without losing accuracy, to represent the
skin-colour distribution through its probability distribution function (PDF) in CbCr subspace,
defined as:
7
Dept. of Electronics and Communication Engineering, MBCET.
where t denotes the frame index, x(t) is a two-dimension colour vector in CbCr subspace,
ni(x(t),i,t,i,t) is the ith Single Gaussian model (SGM) component contributes to mixed model
with a weight i. The SGM is a elliptical (two-dimension)
Gaussian joint probability distribution function, determined by its mean vector
and the
covariance matrix i,t .At last, the pixel with the colour vector x(t) can be judged as a skin-colour
pixel or not through comparing the p(x(t)) with a predefined
threshold. The main difficulties to implement the GMM in P-FAD are
the following:
The fixed Gaussians parameters i,t ,i,t obtained by offline training procedure from a
large face dataset is not robust to the changing environment.
The computation overhead is high in Eq. (1) on every pixels for judging a pixel as a skin
colour pixel or not.
8
Dept. of Electronics and Communication Engineering, MBCET.
length of axis
rectangles width and height, denote as Wcr(t) and Wcb(t) respectively. Obviously, the centre
position of rectangle is the same with approximate ellipses.
Where D is a diagonal matrix and d11, d22 represent the elements of matrix D. We can use
threshold
10
Dept. of Electronics and Communication Engineering, MBCET.
11
Dept. of Electronics and Communication Engineering, MBCET.
12
Dept. of Electronics and Communication Engineering, MBCET.
wheret = [t(1) _ _ _ t(n)]T , t(x) denotes the time consumption if the start stage is x and there are
n stages in a cascade structure. Nreject and Naccept are the number of rejected and accepted subwindows in the whole cascade structure respectively. t reject andt accept are the expectation time
consumption to reject and accept a sub-window respectively.
where pij in Matrix P denotes the probability of the sub window starting from the i-th stage
would be rejected in stage j, aij in matrix A is the sum of features from stage i to j. Assume that
the time consumption is proportional (k times) to the number of processed features. The _(_)
maps the matrix to a vector whose elements are the elements in matrixs dialog. Thus, the
detectors time consumption is determined by two set of arguments: F = [F(1) _ _ _ F(n)]T , F(x)
is the number of features in stage x; and P = [P(1) _ _ _ P(n)]T , P(x) is the probability to reject
13
Dept. of Electronics and Communication Engineering, MBCET.
the non-face sub-windows in stage x. From the OpenCV 2.3 baseline face detector, we can getF,
the cascade detectors feature quantity distribution in every stage. Details can be seen in [8]. P(x)
is assumed to be linearly increasing from 50% to 99% which is obedient to the cascade structure
[11]. Practically scanning conditions is simulated using above formula and also get an
implementation on a 2.2GHz notebook. Simulation and experimental results in Fig. 3 show that
the time cost function is convex for the start stage. Then the minimum value of start stage can be
only obtained at the first and last stage. In P-FAD, according to Fig. 3, the optimal start stage is
depend on the ratio = Nreject/Naccept. The overhead rate = (ttotal(1)-ttotal(n))/(ttotal(1)+ttotal(n)) is
defined as the effect of choosing first stage to start, while negative value means the choice could
save time. Fig. 4 shows that when the is extremely large, which meets the traditional ViolaJones detector situation, the is near to -1 indicates choosing first stage to start is undoubtedly
saving most processing time. However, when is smaller than 12, the will be positive.
Thus, in P-FAD, when the number of non-face sub-windows is few, choosing the last
stage is optimal for time-consuming. Based on the following discussion, we may implement a
modified Viola-Jones detector in P-FAD according to the online estimation of . Note that the
total number of sub windows is determined by its scan strategy before the classifying, so we only
need to estimate the Naccept. In our implementation, just assume Naccept approximately equals to
the number of face candidates, which is reasonable when checking the statistical data in Table
III. Obviously, further work could be done to improve the estimations accuracy.
FIG 4.2 . Viola Jones detector's time cost for different start stages
14
Dept. of Electronics and Communication Engineering, MBCET.
15
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 5
P-FAD'S ALGORITHMIC COMPLEXITY
In this section, the computation complexity of our scheme is concluded. First of all, note
that the schemes whole computation is mainly determined by the P-FADs first and second layer
while last three layer can be omitted for their extremely fewer operating units. Specifically,
Layer 1 and Layer 2 are pixel manipulation and they are completed simultaneously in a single
image scan which could reduce the repeat access to memory. Suppose the image size is N, layer
1 needs N or 2N memory access to get the CbCr value, N to 4N comparison instructions to run
our simplified rectangle judgment and N instructions to link the second layer. Layer 2 needs at
most N/ (usually = 10) memory access to store the contour points and 3N normal instructions.
As a result, pixel manipulation totally needs N to 2.1N memory access as well as 5N to 8N
normal instruction. Secondly, the time consumption in Layer 3 and Layer 4 is extremely low for
its dynamic properties and relatively much fewer operating units, see Table II. At last, the time
consumption of the Viola-Jones detector in P-FAD is reduced significantly because there are only
hundreds sub-windows to be classified. In the traditional Viola-Jones detector, the number of
sub-windows is nearly O(N^2) for its scaling and shifting, and the processing time is
proportionally to it. Moreover, our modified Viola-Jones detector can reduce the time
consumption further. in P-FAD is O(N), which is similar with the simple image process
functions, and much lower than the traditional Viola-Jones detector with computation overhead
O(N^2).
16
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 6
EXPERIMENTAL RESULTS
Face detection scheme on our embedded smart camera platforms and a 2.20GHz
notebook respectively to evaluate its fast-processing as well as robust performance. The
embedded smart camera platform consists of a Intel Xscale microprocessor PXA270 and a image
sensor OV9655, which is based on the CITRIC architecture.
First,schemes adaptive GMM algorithm is evaluated on a video sequence. The skin-tone
detections PD (Probability of Detection) and FA (probability of False Alarm) among adaptive
GMM algorithm and two fixed rectangle model in Fig.5. The six selected frames are
corresponding to the picture 2, 3, 5, 8, 10, 16 respectively in Fig.6. It can be seen that last three
frames are darker than the first three frames because a man stood by the windows, and different
17
Dept. of Electronics and Communication Engineering, MBCET.
persons in different frame have various pose. Adaptive model can be robust to this kind of
18
Dept. of Electronics and Communication Engineering, MBCET.
Chapter 6
CONCLUSIONS
P-FAD, a hierarchical framework for reducing the computing and storage cost of face
detection in embedded cameras was implemented. The goal was to reduce the pixel manipulation
19
Dept. of Electronics and Communication Engineering, MBCET.
without compromising the detection performance. This goal was met by devising a 3-stage
coarse, shift and refine process, which shifts the operating unit from pixel to contour points,
regions and face candidates and reserves more complex processing for the given promising units.
The experimental results exhibit the P-FAD schemes resource-aware properties that could
process a VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded
smart camera while still hold the acceptable detection accuracy compared to the Viola-Jones haar
detectors OpenCV implementation.
REFERENCES
[1] Qiang Wang, Jing Wu, Chengnian Long and Bo LI, "P-FAD: Real-time Face Detection
Scheme on Embedded Smart Cameras",Shanghai Jiao Tong University, Shanghai, China
[2] L. Acasandreni and A. Barriga, Accelerating Viola-Jones face detection for embedded and
SoC environments, in Proc. ICDSC Conf., 2011.
20
Dept. of Electronics and Communication Engineering, MBCET.
21
Dept. of Electronics and Communication Engineering, MBCET.