Professional Documents
Culture Documents
I. I NTRODUCTION
In computer vision, Video Object Segmentation (VOS) is
an attractive task which has many applications, such as video
edit, video composition, object recognition, etc. Generally, a
VOS system mainly faces two basic problems in computer
vision, object tracking and segmentation. There are numerous
algorithms to solve object tracking [1], such as mean shift [2],
particle filter [3], online boosting [4], random forest [5], etc.
There are also a great deal of works on object segmentation,
such as level set methods [6], graph cut [7] and grab cut [8].
It is well known that, for a VOS system, dealing with
general video sequences is an extremely challenging objective,
due to the factors from appearance variations, irregular motion
and background clutter. On the basis of object tracking and
segmentation, various approaches have been proposed for VOS
in recent years. Li et al. [9] directly extend the traditional
graph cut [7] algorithm from 2D image to 3D image sequence,
and optimize the global energy function to yield segmentation
result. Apart from the limitation of heavily relying on Gaussian
mixture color models, this 3D graph cut method is quite timeconsuming and does not allow user interaction. Afterward,
localized color and shape models are introduced by Xue [10] in
Video SnapCut system, which shows increased discriminative
ability and proves to be more efficient. However, due to
unexpected errors of optical flow when the object is occluded
by itself or others, it is not reliable to perform classification
on object boundary and shift local window. An alternative
method by Brendel et al. [11], focusing on tracking region
Fig. 1.
III. O UR A PPROACH
First of all, the initial foreground mask F (Ik ) is provided
by user. Since video frames are spatial-temporally cohesive,
we can propagate the foreground mask between neighboring
frames. From the reference frame (Fig. 2(a)) to the target
frame (Fig. 2(b)), bidirectional propagations are both feasible.
Without loss of generality, the following analysis only explains
the forward direction, which is from frame It to frame It+1 .
Naturally, using the selected key frame Ik as the first reference
frame and repeatedly applying this procedure of propagation,
we can get foreground masks in all frames.
For computational benefit as well as distinctiveness and robustness, each frame is over-segmented into SLIC superpixels
[13], which convert the original pixel-connected graph GP
(Fig. 2(b)) to a regional-connected graph GR (Fig. 2(c) ).
|Ra Rb |
if
otherwise
b2 (Ra ) + b2 (Rb )
(2)
To summarize, for an arbitrary region Ra in frame t + 1,
Regional Back-Track Method provide the best match region
Rb in frame t if they are essentially corresponded. Otherwise,
this method would mark Ra as a mismatched region.
B. Hierarchical Localized Classifiers
(a) Frame It
(e) Classification
Fig. 2.
min
Dif f (Ra , Rb )
(1)
In this section, we introduce Hierarchical Localized Classifiers to evaluate the probability of that a region in frame t + 1
belongs to foreground.
Localized classifiers for VOS system are introduced in
Video SnapCut System [10], in which a series of overlapping
local windows are created along foreground boundary with
fixed size and then propagate through frames. However, due to
a large boundary variation and local window drift, that method
is limited when facing topology changes. In addition, since
the size of local window is fixed, we definitely sacrifice the
ability to benefit from multi-scale space. To overcome these
limitations, we propose a new solution called Hierarchical
Localized Classifiers.
Given a foreground mask M (It ) and the corresponding
foreground bounding box B(It ) in reference frame t, we
define a potential searching box S(It ) by extending B(It )
for a fixed ratio ( = 0.3 in our experiments), using the
following equations.
center(S(It )) = center(B(It ))
height(S(It )) = (1 + )height(B(It ))
(3)
width(S(It )) = (1 + )width(B(It ))
Next, we build a hierarchical quad-tree structure by splitting
the searching box S(It ), in which each tree node corresponds
to a local window. The partition rules are shown in Fig.
3. Then, we generate a localized classifier L(Wi ) for each
i6=j
qf g (Ra ) qRa
.
qf g (Ra ) qRa + qbg (Ra ) (1 qRa )
(6)
IV. E XPERIMENTS
Currently, since there is no standard datasets for video
segmentation, in our experiments, the testing datasets are
collected from [15] and [12]. The first video clip is waterskiing
from [15], 97 frames, 544 280. The second one is diving
from [15], 179 frames, 880488. The third one is skating from
[15], 573 frames, 552 310. The fourth one is dancing from
[12], 138 frames, 320 240. Note that these videos are very
challenging in terms of dynamic camera, background clutter,
blurred motion, object shadows, etc.
We quantitatively analysis our approach on these test
datasets. We randomly select 10 frames from each video clip
for evaluation and label out the true foreground manually. The
metric is standard F -M easure, which is defined as below.
2 P recision Recall
(8)
F -M easure =
P recision + Recall
where P recision is the probability that an auto-segmented
foreground pixel is a true foreground pixel and Recall is the
probability that a true foreground pixel is detected.
Since there is no available source code or executable binary
for current VOS method, such as [10] and [11], we choose
to use Grab Cut [8] algorithm for comparison, where we
draw foreground bounding boxes for several times and select
the best one for each frame. Table. I sums up the achieved
comparisons, from which we can see that our approach is
much better than Grab cut. Note that our method works very
well when handling visually similar foreground and background (such as dark legs and black background in Fig. 4(d)),
which improves F -M easure by as much as twenty percentage
points. Some examples are shown in Fig. 4, which demonstrate
that our method significantly improves the subjective quality
of segmentation.
TABLE I
E XPERIMENTAL R ESULTS
Vide Clip
Water-skiing
Diving
Skating
Dancing
Method
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/
Grab Cut
Our Method
+/
P recision
0.753
0.938
0.185
0.823
0.914
0.091
0.956
0.973
0.017
0.873
0.946
0.073
Recall
0.911
0.849
-0.062
0.849
0.950
0.101
0.905
0.919
0.014
0.620
0.947
0.327
F -M easure
0.836
0.891
0.067
0.836
0.931
0.096
0.930
0.945
0.015
0.725
0.947
0.221