Professional Documents
Culture Documents
(5 )
(6 )
3)
(2 )
4)
(3 )
(1 )
7)
(3 )
4)
)
(19
(1 5
(11
(20
(1 0
23
(1 5
18
(2 2
l (3
23
(1 8
27
(1 8
(1 6
(1 7
y(
h(
l (2
d(
s(
s(
s(
ss
sm
ry
ure
on
ia
ica
en
s
ed
a rt
tos
o ry
al
ng
alt
sic
on
gy
ing
nt
ed
ve
nta
ce
e
vie
litic
o rt
foo
ati
a li
on
nm
me
ra t
ph
sin
he
olo
mi
igi
mu
tra
au
_m
gg
teg
en
uc
me
sp
u rn
mo
co
ati
po
lite
gra
re l
ga
iro
lop
bu
hn
blo
fer
ed
am
ca
uc
cu
_ jo
nv
bio
tec
ve
n
eo
lt_
ed
do
tre
co
_e
en
de
to-
fau
vid
ins
the
iz
b_
au
de
cit
ma
we
Figure 1: Average F score achieved using all audio-visual descriptors and genre classification ”performance”
for the best run, i.e. SVM with linear kernel (graph on top).
of some of the proposed genres, the genre specific content est on audio-color-action, 0.121 for SVM linear on audio-
is captured mainly with the textual information. Therefore, color-action (best run), 0.103 for SVM linear on audio and
our tests focused mainly on genres with specific audio-visual 0.038 SVM linear on color-action. This is mainly due to
contents, like ”art”, ”food”, ”cars”, ”sports”, etc. (for which the limited training data set compared to the diversity of
we provided a representative number of examples). test sequences and to the inclusion of the genres for which
Tests were performed using a cross-validation approach. we obtain 0 precision (i.e. audio-visual information is not
We use for training p% of the existing sequences (randomly discriminant, see Figure 1). The fact that MAP provides
selected and uniformly distributed with respect to genre) only an overall average precision over all genres makes us
and the remainder for testing. Experiments were repeated unable to conclude on the genres which are better suited to
for different combinations between training and testing (e.g. be retrieved with audio-visual information and which fail to.
1000 repetitions).
Figure 1 presents average Fscore = 2 · P · R/(P + R) ratio 4. CONCLUSIONS AND FUTURE WORK
(where P and R are average precision and recall, respec- The proposed descriptors performed well for some of the
tively over all repetitions) for p = 50%, descriptor set = genres, however to improve the classification performance
audio-color-action (i.e. the descriptor set which provided the a more consistent training database is required. Also, our
most accurate results) and various classification approaches approach is more suitable for classifying genre patterns from
(see Weka at http : //www.cs.waikato.ac.nz/ml/weka/). the global point of view, like episodes from a series being not
The number in the brackets represent the number of test able to detect a genre related content within a sequence.
sequences used for each genre. Future tests will consist on preforming cross-validation on
From the global point of view, the best results are ob- all the 2375 sequences (development + test sets).
tained with SVM and a linear kernel (depicted in Orange),
followed by k-NN (k = 3, depicted in Dark Green) and FT
(Functional Trees, depicted in Cyan). At genre level, the
5. ACKNOWLEDGMENTS
best accuracy is obtained for genres with particular audio- Part of this work has been supported under the Financial
visual signatures. The graph on top presents a measure Agreement EXCEL POSDRU/89/1.5/S/62557.
of the individual genre classification ”performance” which is
computed as the Fscore times the number of test sequences 6. REFERENCES
used. An Fscore obtained for a greater number of sequences [1] D. Brezeale, D.J. Cook, ”Automatic Video
is more representative than one obtained for only a few (val- Classification: A Survey of the Literature,” IEEE
ues are normalized with respect to 1 for visualization pur- Trans. on Systems, Man, and Cybernetics, Part C:
pose). The proposed descriptors provided good discrimina- Applications and Reviews, 38(3), pp. 416-430, 2008.
tive power for genres like (the number in the brackets is [2] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S.
Fscore ): ”food and drink” (0.757), ”travel” (0.633), ”politics” Schmiedeke, G.J.F. Jones, ”Overview of MediaEval
(0.552), ”web development and sites” (0.697), while at the 2011 Rich Speech Retrieval Task and Genre Tagging
bottom end are genres whose contents are less reflected with Task”, MediaEval 2011 Workshop, Pisa, Italy, 2011.
audio-visual information, e.g. ”citizen journalism”, ”busi- [3] K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, ”Using
ness”, ”comedy” (see Figure 1). Block-Level Features for Genre Classification, Tag
Results on test data. For the final official runs, clas- Classification and Music Similarity Estimation,”
sification was performed on 1727 sequences with training MIREX-10, Utrecht, Netherlands, 2010.
performed on the previous data set (648 sequences). The [4] B. Ionescu, C. Rasche, C. Vertan, P. Lambert, ”A
overall results obtained in terms of MAP (Mean Average Contour-Color-Action Approach to Automatic
Precision) are less accurate than the previous results, thus: Classification of Several Common Video Genres”,
0.077 for k-NN on audio-color-action, 0.027 for RandomFor- AMR (LNCS 6817), Linz, Austria, 2010.