You are on page 1of 5

Video Object Segmentation by Hierarchical

Localized Classification of Regions


Chenguang Zhang, Haizhou Ai
Dept. of Computer Science and Technology
Tsinghua University, Beijing, P.R. China
zhangcg06@mails.tsinghua.edu.cn, ahz@mail.tsinghua.edu.cn

AbstractVideo Object Segmentation (VOS) is to cut out a


selected object from video sequences, where the main difficulties
are shape deformation, appearance variations and background
clutter. To cope with these difficulties, we propose a novel
method, named as Hierarchical Localized Classification of Regions (HLCR). We suggest that appearance models as well as
the spatial and temporal coherence between frames are the
keys to break through bottleneck. Locally, in order to identify
foreground regions, we propose to use Hierarchial Localized
Classifiers, which organize regional features as decision trees.
In global, we adopt Gaussian Mixture Color Models (GMMs).
After integrating the local and global results into a probability
mask, we can achieve the final segmentation result by graph cut.
Experiments on various challenging video sequences demonstrate
the efficiency and adaptability of the proposed method.
Index Termsvideo object segmentation, classification, tracking, graph cut

I. I NTRODUCTION
In computer vision, Video Object Segmentation (VOS) is
an attractive task which has many applications, such as video
edit, video composition, object recognition, etc. Generally, a
VOS system mainly faces two basic problems in computer
vision, object tracking and segmentation. There are numerous
algorithms to solve object tracking [1], such as mean shift [2],
particle filter [3], online boosting [4], random forest [5], etc.
There are also a great deal of works on object segmentation,
such as level set methods [6], graph cut [7] and grab cut [8].
It is well known that, for a VOS system, dealing with
general video sequences is an extremely challenging objective,
due to the factors from appearance variations, irregular motion
and background clutter. On the basis of object tracking and
segmentation, various approaches have been proposed for VOS
in recent years. Li et al. [9] directly extend the traditional
graph cut [7] algorithm from 2D image to 3D image sequence,
and optimize the global energy function to yield segmentation
result. Apart from the limitation of heavily relying on Gaussian
mixture color models, this 3D graph cut method is quite timeconsuming and does not allow user interaction. Afterward,
localized color and shape models are introduced by Xue [10] in
Video SnapCut system, which shows increased discriminative
ability and proves to be more efficient. However, due to
unexpected errors of optical flow when the object is occluded
by itself or others, it is not reliable to perform classification
on object boundary and shift local window. An alternative
method by Brendel et al. [11], focusing on tracking region

across frames, is attractive for its computational benefit and


spatial-temporal coherence. However, suffering from failure
of matching the contour of regions, this method lacks of the
ability to deal with complex deformation of non-rigid object.
Meanwhile, Niebles et al. [12] demonstrate how to combine
model-based information (e.g. part-based detection result for
human) and appearance approaches to extract human body
regions. Nevertheless, for general objects, high performance
detectors are usually not available, which limits the generalization of that method.
Inspired by previous works of localized windows [10] and
tracking regions [11], we propose a novel method, named
as Hierarchical Localized Classification of Regions (HLCR),
for video object segmentation. The main contribution of our
approach is to overcome the limitations of directly shifting
local windows and unreliable region tracking, by taking the
spatial-temporal relationship between corresponding regions
in neighboring frames as inference strategy.
The rest of this paper is organized as follows. In Section II,
we first give a formulation, and then show a brief overview of
our system. Section III introduces the whole pipeline of our
approach. Experimental results on different video sequences
are presented in Section IV. Finally, in Section V, we offer a
conclusion of our method, followed by a discussion about the
future work.
II. P ROBLEM F ORMULATION AND S YSTEM OVERVIEW
Given an input video sequence I = {I0 , I1 , . . . , IN 1 }, the
VOS system is initialized by a selected key frame Ik with
known foreground mask F (Ik ). The output of a typical VOS
system is to label out the foreground mask M (It ) for each
frame It .
Taking the foreground mask in a particular frame as input,
as illustrated in Fig. 1, our system is designed to generate
the foreground mask in the next frame. With the help of
Regional Back-Track Method for motion estimation, we can
assign regions to a series of Hierarchical Localized Classifiers,
to predict potential foreground and background regions locally.
Combining the classification result with Gaussian Mixture
Color Models (GMMs), we can produce a probability mask,
followed by an optimization based on the mask to yield final
segmentation results with graph cut [7] algorithm.

Fig. 1.

Outline of our approach

III. O UR A PPROACH
First of all, the initial foreground mask F (Ik ) is provided
by user. Since video frames are spatial-temporally cohesive,
we can propagate the foreground mask between neighboring
frames. From the reference frame (Fig. 2(a)) to the target
frame (Fig. 2(b)), bidirectional propagations are both feasible.
Without loss of generality, the following analysis only explains
the forward direction, which is from frame It to frame It+1 .
Naturally, using the selected key frame Ik as the first reference
frame and repeatedly applying this procedure of propagation,
we can get foreground masks in all frames.
For computational benefit as well as distinctiveness and robustness, each frame is over-segmented into SLIC superpixels
[13], which convert the original pixel-connected graph GP
(Fig. 2(b)) to a regional-connected graph GR (Fig. 2(c) ).

where Rb is in frame t, cRa denotes the center of region Ra ,


vRa denotes the averaged motion vector for all pixels in region
Ra and Dif f (Ra , Rb ) denotes the difference between region
Ra and Rb . Obviously, a larger would be more robust to
optical flow errors while more risky to introduce mistaken
regions. On the other hand, is highly related to the radius
rRa , since the center of large regions drift easier than small
ones. Consequently, in our experiments, is set as rRa and
Dif f (Ra , Rb ) is set as the Euclid Distance between the mean
color of two regions.
A key issue of Regional Back-Track Method is how
to convert Dif f (Ra , Rb ) to a binary decision. Traditional
methods, such as selecting a global threshold or using Chisquare test, are very tricky and unstable. Here, inspired by
the Statistical Region Merging Method [14], we choose the
independent bounded difference inequality as the decision
function. (Treating each pixel in Ra as a bounded independent
random variable.) As a result, the predicate logic is shown
below.
(
B(Ra , Rb ) =

|Ra Rb |

if

otherwise

b2 (Ra ) + b2 (Rb )

(2)
To summarize, for an arbitrary region Ra in frame t + 1,
Regional Back-Track Method provide the best match region
Rb in frame t if they are essentially corresponded. Otherwise,
this method would mark Ra as a mismatched region.
B. Hierarchical Localized Classifiers

(a) Frame It

(b) Frame It+1

(e) Classification

(f) GMMs Prob.

Fig. 2.

(c) SLIC Regions (d) Optical Flow

(g) Graph Prob.

(h) Seg Result

An example of the procedure of processing a single frame

A. Regional Back-Track Method


For a region in frame t + 1, Regional Back-Track Method
is introduced to find out the best matching region in frame t,
and determine whether they are essentially corresponded.
There is no doubt that pixel-level optical flow (Fig. 2(d) )
is not reliable when heavy occlusion happens. Although it is
claimed in [10] that flow averaging approach in local window
could generate more robust result, it still produces meaningless
motion vector when there are no really matched regions.
Based on this observation, we suggest that a reliable region
track method should not only be insensitive to minor optical
flow errors, but also judge whether the matched regions are
essentially corresponded. For arbitrary region Ra in frame t +
1, Regional Back-Track Method is defined as
BackT rack(Ra ) =

min

kcRa vRa cRb k

Dif f (Ra , Rb )

(1)

In this section, we introduce Hierarchical Localized Classifiers to evaluate the probability of that a region in frame t + 1
belongs to foreground.
Localized classifiers for VOS system are introduced in
Video SnapCut System [10], in which a series of overlapping
local windows are created along foreground boundary with
fixed size and then propagate through frames. However, due to
a large boundary variation and local window drift, that method
is limited when facing topology changes. In addition, since
the size of local window is fixed, we definitely sacrifice the
ability to benefit from multi-scale space. To overcome these
limitations, we propose a new solution called Hierarchical
Localized Classifiers.
Given a foreground mask M (It ) and the corresponding
foreground bounding box B(It ) in reference frame t, we
define a potential searching box S(It ) by extending B(It )
for a fixed ratio ( = 0.3 in our experiments), using the
following equations.

center(S(It )) = center(B(It ))
height(S(It )) = (1 + )height(B(It ))
(3)

width(S(It )) = (1 + )width(B(It ))
Next, we build a hierarchical quad-tree structure by splitting
the searching box S(It ), in which each tree node corresponds
to a local window. The partition rules are shown in Fig.
3. Then, we generate a localized classifier L(Wi ) for each

window Wi , trained by all inner regions which have already


been labeled as foreground or background according to the
foreground mask M (It ). Here, we build a multi-dimensional
feature vector f (R) = (r, g, b, y, u, v, cx, cy) for region R,
where (r, g, b, y, u, v) denotes the average value of all pixels
in region R in RGB and YUV color space and (cx, cy) denotes
the center of region R. If Wi contains both foreground and
background regions, we use a decision tree for classification.
Otherwise, the localized classifier L(Wi ) is degenerated into
a constant function (Return 1 if it contains only foreground,
and return 0 if not.).

Fig. 3. Hierarchical Localized Classifiers based on quad-tree partition. If a


local window is larger than a fixed size and contains both foreground and
background regions, e.g. Wi , we split it into four sub-windows. Otherwise,
the partition terminates here and this window turns out to be leaf node, e.g.
Wj . For each window Wi , a localized classifier L(Wi ) is trained by all the
inside regions.

As for prediction, instead of shifting local windows, we


prefer to assign each region Ra in frame t + 1 to a series
of windows {Wi0 , Wi1 , . . . , Win1 } in frame t. Recall the
Regional Back-Track method introduced in section III-A,
assuming we have found the best match region Rb in frame t
(if not, we will discuss how to handle the mismatched Ra
later in section III-C), Rb should be covered by a unique
leaf node of the quad-tree partition. Tracing back to all
the ancient nodes in the quad-tree, we can get a series
of windows {Wi0 , Wi1 , . . . , Win1 }. For each window Wik ,
we use the pre-trained localized classifier L(Wik ) to predict
whether Ra is belong to foreground or not. (Note here we use
(r, g, b, y, u, v, cx vxRa , cy vyRa ) as the feature vector,
where (vxRa , vyRa ) is the averaged motion vector of Ra .)
To produce the final classification result qRa , we need integrating the localized classifiers together, using this equation:
Pn1
k qk
(4)
qRa = Pk=0
n1
k=0 k
where qk denotes the binary prediction of L(Wik ) and k
denotes the weight of classifier L(Wik ). Obviously, the classifiers with high confidence should be weighted more than
those with low confidence. Therefore, in our experiments, the
classification ratio on training set is used as k .
In summary, for an arbitrary region Ra in frame t+1 which
finds corresponding region Rb in frame t, the Hierarchical
Localized Classifiers make an integrated prediction of the
probability that Ra will be included in the foreground mask.
C. Combined Probability Mask and Iterative Refinement
Combined Probability Mask is introduced to integrate localized classification result with global GMMs. As a result,

we can use graph cut algorithm to optimize the segmentation


result.
For graph cut method, we need to optimize the following
energy function
X
X
E=
Ed (Ri ) +
Ec (Ri , Rj )
(5)
i

i6=j

where Ed (Ri ) is data energy and Ec (Ri , Rj ) is regional


connection energy. In our framework, Ec (Ri , Rj ) is the color
difference between region Ri and Rj , which is the same as
traditional graph cut method [7], and Ed (Ri ) is the combined probability of Global Gaussian Mixture Color Models
(GMMs) and Hierarchical Localized Classifiers predictions,
which is shown as follows.
GMMs are widely used in segmentation and tracking tasks
and turn out to be quite effective. In our system, both foreground and background GMMs are acquired by clustering
regions in the reference frame t according to the given mask.
Note that directly updating foreground GMMs is very risky.
Considering the initial foreground mask provided by user
input in the key frame is extremely important, we suggest
that a combination of foreground in the initial key frame
and reference frame is quite necessary. In general, though
the discrimination ability of Hierarchical Localized Classifiers
is better than GMMs, it may suffer from the risk of overfitting and is incapable of handling mismatched regions in
section III-A. Consequently, we combine these two responses
to generate a more reliable foreground probability p(Ra ),
using the formula shown below.
1) If Ra has a corresponding region Rb in frame t, then
p(Ra ) =

qf g (Ra ) qRa
.
qf g (Ra ) qRa + qbg (Ra ) (1 qRa )

(6)

2) Otherwise, Ra is mismatched. Since qRa is not available,


we have
qf g (Ra )
p(Ra ) =
(7)
qf g (Ra ) + qbg (Ra )
where qf g (Ra ) is probability that Ra is in foreground GMMs,
qbg (Ra ) is probability that Ra is in background GMMs and
qRa is the classification response in section III-B.
Given the combined probability p(Ra ) as data energy
Ed (Ri ), we can solve this two-label graph cut problem
through max-flow method. However, since complex videos
often contain unexpected noise, the combined probability
p(Ra ) may drift in a few regions. Therefore, we apply a
iterative refinement to the graph cut result, which is shown
as following.
1) Perform Graph Cut based on the combined probability
p(Ra ) to get foreground regions.
2) Perform the max-connected component detection for
foreground regions to filter false alarmed regions.
3) Update the foreground and background GMMs and the
combined probability p(Ra ). Repeat Step 1) and 2) until
converge.
In our experiments, repeating for only 2 or 3 times, the
iterative refinement will produce a convincing result.

IV. E XPERIMENTS
Currently, since there is no standard datasets for video
segmentation, in our experiments, the testing datasets are
collected from [15] and [12]. The first video clip is waterskiing
from [15], 97 frames, 544 280. The second one is diving
from [15], 179 frames, 880488. The third one is skating from
[15], 573 frames, 552 310. The fourth one is dancing from
[12], 138 frames, 320 240. Note that these videos are very
challenging in terms of dynamic camera, background clutter,
blurred motion, object shadows, etc.
We quantitatively analysis our approach on these test
datasets. We randomly select 10 frames from each video clip
for evaluation and label out the true foreground manually. The
metric is standard F -M easure, which is defined as below.
2 P recision Recall
(8)
F -M easure =
P recision + Recall
where P recision is the probability that an auto-segmented
foreground pixel is a true foreground pixel and Recall is the
probability that a true foreground pixel is detected.
Since there is no available source code or executable binary
for current VOS method, such as [10] and [11], we choose
to use Grab Cut [8] algorithm for comparison, where we
draw foreground bounding boxes for several times and select
the best one for each frame. Table. I sums up the achieved
comparisons, from which we can see that our approach is
much better than Grab cut. Note that our method works very
well when handling visually similar foreground and background (such as dark legs and black background in Fig. 4(d)),
which improves F -M easure by as much as twenty percentage
points. Some examples are shown in Fig. 4, which demonstrate
that our method significantly improves the subjective quality
of segmentation.
TABLE I
E XPERIMENTAL R ESULTS
Vide Clip
Water-skiing

Diving

Skating

Dancing

Method
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/

P recision
0.753
0.938
0.185
0.823
0.914
0.091
0.956
0.973
0.017
0.873
0.946
0.073

Recall
0.911
0.849
-0.062
0.849
0.950
0.101
0.905
0.919
0.014
0.620
0.947
0.327

F -M easure
0.836
0.891
0.067
0.836
0.931
0.096
0.930
0.945
0.015
0.725
0.947
0.221

In terms of complexity, our method only takes about 300


milliseconds for each frame on an Intel core quad 2.40 GHz
CPU with 3GB memory. With the help of the initial labeled
foreground mask and a reliable frame-by-frame inference
strategy, our method can deal with very complex videos. Nevertheless, our method fails when unexpected sudden change
of foreground appearance occurs.
V. C ONCLUSION
In this paper, we propose a novel method to regard VOS
as a problem of tracking and classifying regions in local

windows. Regional Back-Track Method, which is based on


optical flow, is applied to track regions across frames. The
Hierarchical Localized Classifiers are introduced for the prediction of potential foreground regions. Combined probability
mask based on classification results and GMMs is used for
graph cut algorithm with iterative refinement, which produces
reliable segmentation results. Experiments on various videos
demonstrate its great performance.
In current version, we only use single frame propagation
in this paper, which may lead to unexpected drifts in certain
extreme scenario. Although the foreground GMMs in the
initial key frame are used as global constraints, which enhance
the stability of our method, we believe that multi-frames
propagation will benefit more from spatial temporal space.
Another potential work is extending this work to multi-object
cutout, which has more extensive application prospect. We
expect to investigate these issues in our future work.
ACKNOWLEDGMENT
This work is supported by National Science Foundation of
China under grant No.61075026.
R EFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, Object tracking: A survey, ACM
Comput. Surv., vol. 38, no. 4, p. 13, 2006.
[2] D. Comaniciu and P. Meer, Mean shift analysis and applications, in
The Proceedings of IEEE International Conference on Computer Vision,
vol. 2, 1999, pp. 1197 1203.
[3] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, An adaptive colorbased particle filter, Image Vision Comput., vol. 21, no. 1, pp. 99110,
2003.
[4] H. Grabner and H. Bischof, On-line boosting and vision, in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2006, pp. 260 267.
[5] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, On-line
random forests, in IEEE International Conference on Computer Vision
Workshops, 2009, pp. 1393 1400.
[6] A. reza Mansouri and J. Konrad, Motion segmentation with level sets,
in IEEE International Conference on Image Processing, 1999, pp. 126
130.
[7] Y. Boykov and M. pierre Jolly, Interactive graph cuts for optimal
boundary and region segmentation of objects in n-d images, in IEEE
International Conference on Computer Vision, 2001, pp. 105112.
[8] C. Rother, V. Kolmogorov, and A. Blake, Grab cut: interactive foreground extraction using iterated graph cuts, ACM Transactions on
Graphics, vol. 23, pp. 309314, 2004.
[9] Y. Li, J. Sun, and H. yeung Shum, Video object cut and paste, ACM
Transactions on Graphics, vol. 24, pp. 595600, 2005.
[10] X. Bai, J. Wang, D. Simons, and G. Sapiro, Video snapcut: robust video
object cutout using localized classifiers, vol. 28, 2009.
[11] W. Brendel and S. Todorovic, Video object segmentation by tracking
regions, in IEEE International Conference on Computer Vision, 2009,
pp. 833 840.
[12] J. C. Niebles, B. Han, A. Ferencz, and F. fei Li, Extracting moving
people from internet videos, in European Conference on Computer
Vision, 2008, pp. 527540.
[13] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
Slic superpixels, EPFL, Tech. Rep., jun 2010.
[14] R. Nock and F. Nielsen, Statistical region merging, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 26, pp. 14521458,
2004.
[15] M. Grundmann, V. Kwatra, M. Han, and I. Essa, Efficient hierarchical
graph based video segmentation, in IEEE International Conference on
Computer Vision and Pattern Recognition, 2010, pp. 21412148.

(a) Water-skiing Sequence on Frame 27, 48, 57, 67

(b) Diving Sequence on Frame 35, 64, 83, 122

(c) Skating Sequence on Frame 12, 18, 63, 111

(d) Dancing Sequence on Frame 5, 20, 101, 130


Fig. 4. Experimental Results. From left to right, 1st row: Original Key Frame Image, Segmentation Results of Our Approach; 2nd row: Initial Labeled
Foreground Mask, Segmentation Results of Grab Cut [8]. Please zoom in to check for more segmentation details.

You might also like