You are on page 1of 28

1.

INTRODUCTION
This chapter will introduce the basis of video and image processing. The image or
video is stored only as a set of pixels with RGB values in computer. The
computer knows nothing about the meaning of these pixel values. The
content of an image is quite clear for a person. However it is not so easy for
a computer. !or example it is a piece of cake to recogni"e yourself in an
image or video even in a crowd. But this is extremely difficult for computer.
The preprocessing is to help the computer to understand the content of image
or video. #hat is the so$called content of image or video% Here content
means features of image or video or their ob&ects such as color texture
resolution and motion. 'b&ect can be viewed as a meaningful component in
an image or video picture. !or example a moving car a flying bird a person
are all ob&ects. There are a lot of techniques for image and video processing.
This chapter starts with an introduction to general image processing
techniques and then talks about video processing techniques. The reason we
want to introduce image processing first is that image processing techniques
can be used on video if we treat each picture of a video as a still image.
(xample) *olor)
+
2. BACKGROUND
, few years ago the problems of representation and retrieval of visual media
were confined to speciali"ed image databases -geographical medical pilot
experiments in computeri"ed slide libraries. in the professional applications of the
audiovisual industries -production broadcasting and archives. and in computeri"ed
training or education. The present development of multimedia technology and
information highways has put content processing of visual media at the core of key
application domains) digital and interactive video large distributed digital libraries
multimedia publishing. Though the most important investments have been targeted at
the information infrastructure -networks servers coding and compression delivery
models multimedia systems architecture. a growing number of researchers have
reali"ed that content processing will be a key asset in putting together successful
applications. The need for content processing techniques has been made evident from
a variety of angles ranging from achieving better quality in compression allowing
user choice of programs in video$on$demand achieving better productivity in video
production providing access to large still image databases or integrating still images
and video in multimedia publishing and cooperative work.
(xample) Texture)
/
*ontent of image includes resolution color intensity and texture. 0mage
resolution is &ust the si"e of image in term of display pixels. *olor is represented using
RGB color model in computer. !or each pixel on the screen there are three bytes
-RGB color component. to represent its color. (ach color component is in the range
of 1 to /22. 0ntensity is the gray level information of pixels represented by one byte.
The intensity value is in the range of 1 to /22. Texture characteri"es local variations
of image color or intensity. ,lthough texture$based methods have been widely used in
computer vision and graphics there is no single commonly accepted definition of
texture. (ach texture analysis method defines texture according to its own model. #e
consider texture as a symbol of local color or intensity variation. 0mage regions that
are detected to have a similar texture have similar pattern of local variation of color or
intensity.
3.COMMON IMAGE PROCESSING TECHNIQUES
3.1 Dithering:
3ithering is a process of using a pattern of solid dots to simulate shades of gray.
3ifferent shapes and patterns of dots have been employed in this process but the
effect is the same. #hen viewed from a great enough distance that the dots are not
discernible the pattern appears as a solid shade of gray.
3.2 Er!in:
(rosion is the process of eliminating all the boundary points from an ob&ect leaving
the ob&ect smaller in area by one pixel all around its perimeter. 0f it narrows to less
than three pixels thick at any point it will become disconnected -into two ob&ects. at
that point. 0t is useful for removing from a segmented image ob&ects that are too small
to be of interest. 4hrinking is a special kind of erosion in that single$pixel ob&ects are
left intact. This is useful when the total ob&ect count must be preserved. Thinning is
another special kind of erosion. 0t is implemented in a two$step process. The first step
will mark all candidate pixels for removal. The second step actually removes those
candidates that can be removed without destroying ob&ect connectivity.
5
3.3 Di"#tin:
3ilation is the process of incorporating into the ob&ect all the background pixels that
touch it leaving it larger in area by that amount. 0f two ob&ects are separated by less
than three pixels at any point they will become connected -merged into one ob&ect. at
that point. 0t is useful for filling small holes in segmented ob&ects. Thickening is a
special kind of dilation. 0t is implemented in a two$step process. The first step marks
all the candidate pixels for addition. The second step adds those candidates that can be
added without merging ob&ects.
3.$ O%ening:
The process of erosion followed by dilation is called opening. 0t has the effect of
eliminating small and thin ob&ects breaking ob&ects at thin points and generally
smoothing the boundaries of larger ob&ects without significantly changing their area.
3.& C"!ing:
The process of dilation followed by erosion is called closing. 0t has the effect of filling
small and thin holes in ob&ects connecting nearby ob&ects and generally smoothing
the boundaries of ob&ects without significantly changing their area.
3.' (i"tering:
0mage filtering can be used for noise reduction image sharpening and image
smoothing. By applying a low$pass or high$pass filter to the image the image can be
smoothed or sharpened respectively. 6owpass filter is used to reduce the amplitude of
high$frequency components. 4imple lowpass filters apply local averaging. The gray
level at each pixel is replaced with the average of the gray levels in a square or
rectangular neighborhood. Gaussian Lowpass Filter applies Fourier transform to the
image. Highpass filter is used to increase the amplitude of high$frquency components.
7
3.) Seg*ent#tin:
0t is useful for detecting a set in which all the pixels are ad&acent or touching. #ithin
each region there are some common features among the pixels such as color
intensity or texture. #hen a human observer views a scene his visual system will
automatically segment the scene for him or her. The process is so fast and efficient
that one sees not a complex scene but rather a collection of ob&ects. However
computer must laboriously isolate the ob&ects in an image by breaking the image into
sets of pixels each of which is the image of one ob&ect.
0mage segmentation can be approached from three ways. The first approach is called
region approach in which each pixel is assigned to a particular ob&ect or region. 0n
the boundary approach only the boundaries that exist between the regions are
located. The third is called edge approach where people try to identify edge pixels
and then link them together to form the required boundaries.
3.+ O,-e.t Re.gnitin:
The most difficult part of image processing is ob&ect recognition. ,lthough there are
many image segmentation algorithms that can segment image into regions with some
continuous feature it is still very difficult to recogni"e ob&ects from these regions.
There are several reasons for this. !irst image segmentation is an ill$posed task and
there is always some degree of uncertainty in the segmentation result. 4econd an
ob&ect may contain several regions and how to connect different regions is another
problem. ,t present no algorithm can segment general images into ob&ects
automatically with high accuracy. 0n the case that there is some prior knowledge
about the foreground ob&ects or background scene the accuracy of ob&ect recognition
could be pretty good. 8sually the image is first segmented into regions according to
the pattern of color or texture. Then separate regions will be grouped to form ob&ects.
The grouping process is important for the success of ob&ect recognition. !ull
automatical grouping only occurs when the prior knowledge about the foreground
ob&ects or background scene exists. 0n the other cases human interaction may be
required to achieve good accuracy of ob&ect recognition.
2
$. BASIS O( /IDEO PROCESSING
$.1 Cntent 0 Digit#" /i1e
Generally speaking there is much similarity between digital video and image. (ach
picture of video can be treated as a still image. ,ll the techniques applicable to images
can also be applied to video pictures. However there are still different. The most
significant difference is that video has temporal information and uses motion
estimation for compression. 9ideo is a meaningful group of pictures that tells a story
or something else. 9ideo pictures can be grouped as a shot. , video shot is a set of
pictures taken in one camera break. #ithin each shot there can be one or more key
pictures. :ey picture is a representative of the content of a video shot. !or a long
video shot there may be multiple key pictures. 8sually video processing segments
video into separate shots selects key pictures from these shots and then generates
features of these key pictures. The features -color texture ob&ect. of key pictures are
searched in video query.
9ideo processing includes shot detection key picture selection feature generation
and ob&ect extraction.
$.1.1 Sht Dete.tin:
4hot detection is a process to detect camera shots. , camera shot consists of one or
more pictures taken in one camera break. The general approach to shot detection has
been the definition of a difference metric. 0f the differences between two pictures are
above the metric then there is a shot between them. ,n algorithm can be proposed for
this. This algorithm uses binary search to detect shot which makes it very fast and
achieve good performance as well.
;
$.1.2 Ke2 Pi.t3re Se"e.tin:
,fter shot detection each shot is represented by at least one key picture. The choice
of key picture could be as simple as a particular picture in the shot) the first the last
or the middle. However in situations such as long shot no single picture can
represent the content of the entire shot. <B0* -<uery by 0mage *ontent. uses a
synthesi"ed key picture created by seamlessly mosaicking all the pictures in a given
shot using the computed motion transformation of the dominant background. This
picture is an authentic depiction of all background captured in the whole shot. 0n
*B0R -*ontent based 0mage Retrieval. system key picture selection is a simple
process that usually chooses the first and last pictures of a shot as key pictures.
$.1.3 (e#t3re Gener#tin:
,fter key picture selection features of key pictures such as color texture intensity
are stored as indexes of the video shot. 8sers can perform traditional search by using
keyword querying and content$based query by specifying a color intensity or texture
pattern. 'nly the generated features will be searched against and the retrieval can be
in real time.
$.2.$ O,-e.t E4tr#.tin:
3uring the process of shot detection and key picture selection the ob&ects in the video
are also extracted using image segmentation techniques or motion information.
4egmentation$based techniques are mainly based on image segmentation. ,nd ob&ects
are recogni"ed and tracked by segmentation pro&ection. =otion$based techniques
make use of motion vectors to distinguish ob&ects from background and keep track of
their motion. 0t is a very difficult problem. ,nd the new standard being developed will
talk about how to get ob&ects in the video and encode them separately into different
layers. Hopefully this process is not manual and it is also unrealistic to expect it to be
full automatical.
>
&.CURRENT RESEARCH ABOUT /IDEO INDE5ING
AND RETRIE/A6
9ideo indexing and retrieval is a very active research area. 0n the field of digital
video computer$assisted content$based indexing is a critical technology and currently
a bottleneck in the productive use of video resources. 'nly an indexed video can
effectively support retrieval and distribution in video editing production video$on$
demand and multimedia information systems. To achieve this we need algorithms
and systems that provide the ability to store and retrieve video in a way that allows
flexible and efficient search based on content. 0n this chapter we will talk about some
important aspects about the state of art progress in video indexing and retrieval. 0t is
organi"ed as follows)
9ideo ?arsing
9ideo 0ndexing and Retrieval
'b&ect Recognition and =otion Tracking
&.1 /i1e P#r!ing
The first step of video processing is video parsing. 9ideo parsing is a process to
segment video stream into generic shots. These shots are the elementary index unit in
a video database &ust like a word in a text database. Then each of these shots will be
represented by one or more key pictures. 'nly these key pictures are stored into the
@
video database. There are several tasks in video parsing including shot detection and
key picture selection.
&.1.1 Sht Dete.tin in 7i1e %#r!ing:
The first step of video parsing is shot detection. 4hot detection algorithms usually
belong to two classes) -+. those based on global representations like colorAintensity
histograms without any local information and -/. those based on measuring local
difference like intensity change. The former are relatively insensitive to motion but
can miss shots when scenes look quite different but have similar distributions. The
latter are sensitive to moving ob&ects and camera. 4ome systems combine the
advantages of the two classes of detection by using a mixed method. <B0* is one of
these systems.
&.1.2 Ke2 Pi.t3re Se"e.tin in 7i1e %#r!ing:
The next step after shot detection is key picture selection. (ach shot has at least one
key picture. :ey picture can best represent the visual content of video. The number of
key pictures for each shot can be constant or adaptive to shot content. The first picture
is selected as a key pictureB and subsequent pictures are compared against this
candidate. , two$threshold technique similar to the one described above is applied to
identify a picture significantly different from the candidate. This new picture is
considered another key picture and the subsequent pictures are compared against this
new candidate. 8sers can control the density of key pictures by ad&usting the two
threshold values.
&.1.3 (e#t3re Gener#tin in 7i1e %#r!ing:
,fter key picture selection features of key pictures such as color texture intensity
are stored as indexes of the video shot. 8sers can perform traditional search by using
keyword querying and content$based query by specifying a color intensity or texture
pattern. 'nly the generated features will be searched against and the retrieval can be
in real time.
C
&.2 /i1e In1e4ing #n1 Retrie7#"
,fter each ob&ect in the video shot has been segmented and tracked their features
such as color texture motion can be obtained and stored in a feature database. The
resulting database is a simple feature value pair and the actual query is performed on
this feature database. !or each feature there is a function to calculate the distance
between query ob&ect and tracked ob&ects in the video database. The total distance is a
weighted sum of these distances. 0f the total distance is below a certain threshold then
it is returned as a possible matching.
There are also some image processing system such as Dahoo 0mage 4urfer
*ategory 6ist #eb 4eer #eb 4eek 9isual4eek 8*BEs query all images 6ycos and
=0TFs ?hoto book. 4ome of them are mainly based on keyword searching. !irst the
images are assigned one or more keywords manually and catori"ed into different
groups such as photos arts people animals plants. 8sers can then browse through
the separate category that may be interesting to them.
&.2.1 E4#*%"e! 0 !*e i*#ge %r.e!!ing !2!te*! :
GGDahoo 0mage 4urfer *ategory 6ist -D04*6.FF and GG6ycosFF. D04*6 system also
provides visual search function which is based on color distribution matching. USBs
query all image presents several interesting ideas such as GGblob worldFF and GGbody
planFF. Blob world is a region. #hile blob world does not exist completely in the
HthingH domain it recogni"es the nature of images as combinations of ob&ects and
querying and learning in blob world are more meaningful than they are with simple
HstuffH representations. The (xpectation$=aximi"ation -(=. algorithm is used to
perform automatic segmentation based on image features. ,fter segmentation each
region is shown as an elliptic blob. Body plan is an algorithm for image segmentation.
, body plan is a sophisticated model of the way a horse is put togetherB as a result the
program is capable of recogni"ing horses in different aspects. =0TEs ?hoto book
allows users to perform texture modeling face recognition shape matching brain
matching and interactive segmentation and annotation. #eb 4eek allows users to
draw a query that depicts the spatial relations between ob&ects.
+1
&.3 O,-e.t Re.gnitin #n1 Mtin Tr#.8ing:
This is such an important topic. 0n video the credibility of ob&ect recognition is higher
than that in still image because there is more information available. The most valuable
information is motion vectors. The motion vectors of a moving ob&ect has some
intrinsic patterns that conform to a motion model. There are some papers talking
about ob&ect recognition using affine motion model that allow the determination of
the occurred transformation between images.
'.BASIC REQUIREMENTS
'.1 /i1e D#t# M1e"ing

0n a conventional database management system -3B=4. access to data is based on
distinct attributes of well$defined data developed for a specific application. !or
unstructured data such as audio video or graphics similar attributes can be defined.
, means for extracting information contained in the unstructured data is required.
Iext this information must be appropriately modeled in order to support both user
queries for content and data models for storage.
!ig) !irst 4tage in 9ideo 3ata ,daptation) 3ata =odeling
!rom a structural perspective a motion picture can be modeled as data
consisting of a finite$length of synchroni"ed audio and still images. This model is a
++
simple instance of the more general models for heterogeneous multimedia data
ob&ects. 3avenport described the fundamental film component as the shot) a
contiguously recorded audioAimage sequence. To this basic component attributes
such as content perspective and context can be assigned and later used to formulate
specific queries on a collection of shots. 4uch a model is appropriate for providing
multiple views on the final data schema and has been suggested by 6ip man and
Bender 4mith and 3avenport use a technique called stratification for aggregating
collections of shots by contextual descriptions called strata. These strata provide
access to frames over a temporal span rather than to individual frames or shot
endpoints. This technique can then be used primarily for editing and creating movies
from source shots. 0t also provides a quick query access and a view of desired blocks
of video. Because of the linearity of the medium we cannot get a coherent description
of an item but as a result of the stratification method the related information is lumped
together. The linear integrity of the raw footage is erased resulting in contextual
information which relates the shot with the environment. Rowet has developed a
video$on$demand system for video data browsing. 0n this system the data are modeled
based on a survey of what users would query for. Three types of indices were
identified to satisfy the user queries. The first is a textual bibliographic index which
includes information about the video and the individuals involved in the making of
the video. The second is a textual structural index of the hierarchy of movie i.e.
segment scene and shots. The third is a content index which includes keyword
indices for the audio track ob&ect indices for significant ob&ects and key images in the
video which represent important events.
The above model does not utili"e the semantics associated with video data.
3ifferent video data types have different semantics associated with them. #e must
take advantage of this fact and model video data based on the semantics associated
with each data type.
'.2 /i1e in1e4ing
Video annotation or indexing is the process of attaching content based labels to video.
9ideo indexing is the process of extracting from the video data the temporal location
of a feature and its value.
+/
'.2.1 Nee1 0 7i1e in1e4ing:
0ndexing video data is essential for providing content based access. 0ndexing has
typically been viewed either from a manual annotation perspective or from an image
sequence processing perspective. The indexing effort is directly proportional to the
granularity of video access. ,s applications demand finer grain access to video
automation of the indexing process becomes essential. Given the current state of art in
computer vision pattern recognition and image processing reliable and efficient
automation is possible for low level video indices like cuts and image motion
properties etc.
(xisting work on content based video access and video indexing can be grouped
into three main categories
'.2.1.1 High "e7e" in1e4ing:
The work by 3avis is an excellent instance of high level indexing. This approach uses
a set of predefined index terms for annotating video. The index terms are organi"ed
based on a high level ontological category like action time space etc.
The high level indexing techniques are primarily designed from the perspective
of manual indexing or annotation. This approach is suitable for dealing with small
quantities of new video and for accessing previously annotated databases.
'.2.1.2 69 "e7e" in1e4ing:
These techniques provide access to video based on properties like color texture
etc. These techniques can be classified under the label of low level indexing.
The driving force behind these groups of techniques is to extract data features
from the video data organi"e the features based on some distance metric and to use
similarity based matching to retrieve the video. Their primary limitation is the lack of
semantics attached to the features.
'.2.1.3 D*#in !%e.i0i. in1e4ing:
These techniques use the high level structure of video to constrain the low level video
feature extraction and processing. These techniques are effective in their intended
+5
domain of application. The primary limitation of these techniques is their narrow
range of applicability.
!igure + shows a conceptual architecture for content$based video indexing and search
-!igure +..
!igure +) ?arsing representation and organi"ation of video content
'.3 /i1e 1#t# M#n#ge*ent
#e here want to know how to extract contents from segmented video shots and then
index them effectively so that users can retrieve and browse a large amount of video
collections. =anagement of sequential video streams includes three steps i.e. parsing
content extraction J indexing and retrieval J browsing.
9ideo parsing is the process of detecting scene changes or the boundaries
between camera shots in a video stream the video stream is segmented into generic
clips. These clips are the elemental index units in a video database &ust like a word in
a text database. Then each of these clips will be represented visually by their key
frames. To reduce the requirements for mass amount of storage only these key frames
will be stored into the database. There are two type of transitions abrupt transitions or
camera break and gradual transitions e.g. fade$in fade$out dissolve and wipe.
0ndexing this tags video clips when the system inserts them into the database.
The tag includes information based on a knowledge model that guides the
classification according to the semantic primitives of the images. 0ndexing is thus
driven by the image itself and any semantic descriptors provided by the model. Two
+7
types of indices text$based and image$based are needed. The text$based index is
typed in by human operator based on the key frames using a content logger. The
image$based index is automatically constructed based on the image features extracted
from the key frames.
Retrieval and browsing where users can access the database through queries
based on text andAor visual examples or browse it through interaction with displays of
meaningful icons. 8sers can also browse the results of a retrieval query. 0t is
important that both retrieval and browsing appeal to the userFs visual intuition. By
visual query users want to find video shots that look similar to a given example. 0n
concept query users want to find video shots by the presence of specific ob&ects or
events. 9isual query can be reali"ed by directly comparing low level visual features
like color texture shape and temporal variance of video shots or their representative
frames -i.e. key frames.. 'n the other hand the concept query depends on ob&ect
detection tracking and recognition. 4ince fully automatic ob&ect extraction is still
impossible some extent of user interaction is necessary in this process =anual
indexing labour can be greatly reduced with the help of video analysis techniques.
).IMAGE PROCESSING TECHNIQUES (OR /IDEO
CONTENT E5TRACTION
The increase in the diversity and availability of electronic information led to
additional processing requirements in order to retrieve relevant and useful data) the
accessibility problem. This problem is even more relevant for audiovisual
information where huge amounts of data have to be searched indexed and processed.
=ost of the solutions for this type of problems point towards a common need) to
extract relevant information features for a given content domain. , process which
underlies two difficult tasks) deciding what is relevant and extracting it. 0n fact while
content extraction techniques are reasonably developed for text ideo data still is
essentially opaque. 3espite its obvious advantages as a communication medium the
lack of suitable processing and communication supporting platforms has delayed its
+2
introduction in a generali"ed way. This situation is changing and new video based
applications are being developed.
).1 T"8it 7er7ie9
ideo!"L is basically a library for video content extraction. 0ts components extract
relevant features of video data and can be reused by different applications. The ob&ect
model includes components for video data modelling and tools for processing and
extracting video content but currently the video processing is restricted to images.
,t the data modelling level the more significant concepts are the following)
#mages for representing the frame data a numerical matrix whose values can be
colors color map entries etc.
!olor$aps which map entries into a color space allowing an additional indexation
levelB
#mage%isplay!onertes and #mage#&'andlers that converts images in the specific
formats of the platforms and vice$versa.
The ob&ect model of ideo!"L is a subset of a more complete model which also
includes concepts such has shots shot sequences and views *oncepts which are
modelled in a distinct toolkit that provides functionalities for indexing browsing and
playing annotated video segments.
, shot ob&ect is a discrete sequence of images with a set of temporal attributes
such as frame rate and duration and represents a video segment. , shot sequence
ob&ect groups several shots using some semantic criteria. 9iews are used to visuali"e
and browse shots and shot sequences.
).2 Te*%r#" !eg*ent#tin t"!
'ne of the most important tasks for video analysis is to specify a unit set in which the
video temporal sequence may be organi"ed. The different video transitions are
important for video content identification and for the definition of the semantics of the
video language making their detection one of the primary goals to be achieved. The
basic assumption of the transition detection procedures is that the video segments are
spatially and temporally continuous and thus the boundary images must suffer
significant content changes. *hanges which depend on the transition type and can be
+;
measured. The original problem is reduced to the search of suitable difference
quantification metrics whose maximums identify with great probability the
transition temporal locations.
).3 C3t 1ete.tin
The process of detecting cuts is quite simple mainly because the changes in content
are very visible and they always occur instantaneously between consecutive frames.
The implemented algorithm simply uses one of the quantification metrics and a cut is
declared when the differences are above a certain threshold. Thus its success is
greatly dependent on the metric suitability. The results obtained by applying this
procedure to some of our metrics are presented next. The thresholds selection was
made empirically while trying to maximi"e the success of the detection -minimi"ing
simultaneously the false and missed detections.. The captured video segment belongs
to an outdoors news report so its transitions are not very KartisticL -mainly cuts..
There are several well known strategies that usually improve this detection. !or
instance the use of adaptive thresholds increases the flexibility of the threshold
allowing the adaptation of the algorithm to diverse video content ,n approach that
was used with some success in previous work while trying to reduce some of the
lacks of the metrics specific behavior was simply to produce a weighted average of
the differences obtained with two or more metrics. ?re$processing images using noise
filters or lower resolution operators are also quite usual tasks offering means for
reducing image noise and also the processing complexity. The distinctive treatment of
image regions in order to eliminate some of the more extreme values remarkably
increases the detection accuracy especially when there are only a few ob&ects moving
on the captured scene.
).$ Gr#13#" tr#n!itin 1ete.tin
Gradual transitions such as fades dissoles and wipes cause more gradual changes
which evolve during several images. ,lthough the obtained differences are less
distinct from the average values and can have similar values to the ones caused by
camera operations there are several successful procedures which were adapted and
are currently supported by the toolkit.
+>
).$.1 T9in:C*%#ri!n #"grith* :
This algorithm was developed after verifying that in spite of the fact that the first and
last transition frames are quite different consecutive images remain very similar.
Thus as in the cuts detection this procedure uses one of the difference metrics but
instead of one it has two thresholds) one higher for cuts and another for the gradual
transitions. #hile this algorithm &ust detects gradual transitions and distinguish them
from cuts there are other approaches which also classify fades dissoles and wipes.
).$.2 E1ge-C*%#ri!n #"grith*:
This algorithm analyses both edge change fractions exiting and entering. 3istinct
gradual transitions generate characteristic variations of these values. !or instance a
fade in always generates an increase in the entering edge fractionB conversely a fade
out causes an increase in the exiting edge fractionB a dissolve has the same effect as a
fade out followed by a fade in.
).& C#*er# %er#tin 1ete.tin
,s distinct transitions give different meanings to ad&acent video segments the
possible camera operations are also relevant for content identification . !or example
that information can be used to build salient stills and select key frames or segments
for video representation. ,ll the methods which detect and classify camera operations
start from the following observation) each one generates global characteristic changes
in the captured ob&ects and background. !or example when a pan happens they move
hori"ontally in the opposite direction of the camera motionB the behaviour of the tilts
is similar but in the vertical axisB "ooms generate convergent or divergent moves.
).&.1 5:r#2 ,#!e1 *eth1:
This approach basically produces fingerprints of the global motion flow. ,fter
extracting the edges each image is reduced to its hori"ontal and vertical pro&ections a
column and a row that roughly represent the hori"ontal and vertical global motions
which are usually referred to as the x(ray images.
+@
).' 6ighting .n1itin! .h#r#.teri;#tin
6ight effects are usually mentioned in the cinema language grammar as they
contribution is essential for the overall video content meaning. The lighting conditions
can be easily extracted by observing the distribution of the light intensity histogram)
its mode mean and average are valuable in characteri"ing its distribution type and
spread. These features also allow the quantification of the lighting variations once the
similarity of the images is determined.
).) S.ene !eg*ent#tin
4cene segmentation refers to the image decomposition in its main components)
ob&ects background captions etc. 0t is a first step for the identification and
classification of the scene main features and its tracking during all the sequence. The
simplest implemented segmentation method is the amplitude threshold which is quite
successful when the different regions have distinct amplitudes. 0t is particularly useful
procedure for binari"ing captions. 'ther methods are described below.
).).1 Regin:,#!e1 !eg*ent#tin:
Region$based segmentation procedures find out various regions in an image which
have similar features. 'ne of such algorithms is the split and merge algorithm that
first divides the image in atomic homogeneous regions and then merges the similar
ad&acent regions until they are sufficiently different. Two distinct metrics are needed)
one for measuring the initial regions homogeneity -the variance or any other
difference measure. and another for quantifying the ad&acent regions similarity -the
average median mode etc...
).).2 Mtin:,#!e1 !eg*ent#tin:
The main idea in motion$based segmentation techniques is to identify image regions
with similar motion behaviours. These properties are determined by analysing the
temporal evolution of the pixels. This process is carried out in the frequency image
produced for all the image sequence. #hen more constant pixels are selected for
example the final image is the background causing the motion removal. 'nce the
+C
background is extracted the same principle can be used to extract and track motion or
ob&ects.
).).3 S.ene #n1 ,-e.t 1ete.tin:
The process of detecting scenes or scene regions -ob&ects. is in certain way the
opposite process of transition detection) we want to find images regions whose
differences are below a certain threshold. ,s a consequence this procedure uses
difference quantification metrics. These functions can be determined for the entire
image or a hierarchical growing resolution calculation can be performed to accelerate
the process. ,nother tested algorithm also hierarchical is based in the hausdorff
distance. 0t retrieves all the possible transformations -translation rotation etc..
between the edges of two images. ,nother way of extracting ob&ects is by
representing their contours. The toolkit uses a polygonal line approach to represent
contours as a set of connected segments. The ending of a segment is detected when
the relation between the current segment polygonal area and its length is beyond a
certain threshold.
).).$ C#%tin e4tr#.tin:
Based on an existing caption extraction method a new and more effective procedure
was implemented. ,s the captions are usually artificially added to images the first
step of this procedure is extracting high$contrast regions. This task is performed by
segmenting the edge image whose contours have been previously dilated by a certain
radius. These regions are then sub&ected to a certain caption$characteristic si"e
constraints based on the x$rays -pro&ections of edge images. propertiesB &ust the
hori"ontal cluster remains. The resulting image is segmented and two different images
are produced) one with black background for lighter text and another with white
background for darker text. The process is complete after binari"ing both images and
proceeding to more dimensional region constraints.
/1
).+ E1ge 1ete.tin
Two distinct procedures for edge detection were implemented) -+. gradient module
threshold where the image vectors are obtained using the 4obel operatorB -/. the
canny filter considered the optimum detector which analyses the representatively of
gradient module maximums and thus producing thinner contours. ,s the differential
operators amplify high frequency "ones it is common practice to pre$process the
images using noise filters a functionality also supported by the toolkit in the form of
several smoothing operators) the median filter the average filter and a Gaussian
filter.
/+
+. APP6ICATIONS
+.1 /i1e.e" #%%"i.#tin!:
/i1e ,r9!er:
This application is used to visuali"e video streams. The browser can load a stream and
split it in its shot segments using cut detection algorithms. (ach shot is then
represented in the browser main window by an icon that is a reduced form of its first
frame the shots can be played using several view ob&ects.
<e#ther Dige!t)
The #eather 3igest application generates HT=6 documents from T9 weather
forecasts. The temporal sequence of maps presented on the T9 is mapped to a
sequence of images in the HT=6 page. This application illustrates the importance of
information models.
Ne9! #n#"2!i!:
Iews analysis developed a set of applications to be used by social scientists in content
analysis of T9 news. The analysis was in filling forms including news items duration
sub&ects etc. which our attempts to automate. The system generates HT=6 pages
with the images and *49 -*omma 4eparated 9alues. tables suitable for use in
spreadsheets such as (xcel. ,dditionally these HT=6 pages can be also used for
news browsing and there also is a &ava based tool for accessing this information.
//
+.2 COBRA M1e"
0n order to explore video content and provide a framework for automatic extraction of
semantic content from raw video data we propose the *ontent Based Retrieval
-*'BR,. video data model. The model is independent of featureAsemantic extractors
providing flexibility by using different video processing and pattern recognition
techniques for that purposes. The feature grammar is exploited to describe the low$
level persistent meta$data. The grammar also describes the dependencies between the
extractors.
,t the same time it is in line with the latest development in =?(G$>
distinguishing four distinct layers within video content) the raw data the feature the
ob&ect and the event layer. The ob&ect and event layers are concept layers consisting
of entities characteri"ed by prominent spatial and temporal dimensions respectively.
To provide automatic extraction of concepts )ob*ects and eents+ from visual
features -which are extracted using existing videoAimage processing techniques. the
*'BR, video model is extended with ob&ect and event grammars. These grammars
are aimed at formali"ing the descriptions of these high$level concepts as well as
facilitating their extraction based on features and spatial$temporal reasoning.
This rule$based approach results in the automatic mapping from features to
high$level concepts. However we still have the problem of creating ob&ect and event
rules manually. This might be very difficult especially in the case of certain ob&ect
rules which require extensive user familiarity with features and ob&ect extraction
techniques.
,s the model also provides a framework for stochastic modeling of events we
have chosen to exploit the learning capability of 'idden $ar,o $odels -H==s. to
recogni"e events in video data automatically.
/5
Figure 1 - Video sequences
Figure 2 - Video shots
Figure 3 - Principal color detection
Figure 4 - Detected player
4(6(*T
vi.frame seq
FROM player pl, video vi
WHERE
Event:vi.frame.e=((Appr_net_!":
((e#:$layer_near_t%e_net,e&:a'(%and_sli'e),(),(),(),(*efore
(e&,e#,n),n+,-)),e#.o#.name=e&.o#.nam=.!ampras.)/
Query 1 - Video query
The WHERE clause of the query shown in the table above constrains player
profiles on only documents that contain videos with the event ,pproaching the net
with backhand slice stokeF. This new event description defined inside the query
/7
demonstrates how complex events can be defined dynamically based on previously
extracted events and spatial$temporal relations. The first one of two predefined events
is the $layer_near_t%e_net event which is defined using spatial$temporal
rules where as the second one the a'(%and_sli'e event is defined using the
H== approach. The temporal relation requires e# to start at least -. frames before
event e&. The event descriptions are evaluated by the query processor. 0t rewrites the
event from its conceptual definition into a standard ob&ect algebra extended by the
*'BR, video model and spatial$temporal operations. Therefore a user is able to
explore video content specifying very detailed complex queries that include a
combination of features ob&ects and events as well as spatial$temporal relations
among them.
CONC6USION
9isual information has always been always an important source of knowledge. #ith
the advances in information computing J communication technology this
information in the type of digital images J digital video is highly available also
through the computer. To be able to cope with the explosion of visual information an
organi"ation of material which allows for fast search J retrieval is required. These
calls for the system which is in some way can provide content$based handling of
visual information. 0n this seminar 0 have tried to give the basic image processing
techniques status of content based access to images J video databases some
applications regarding to video content extraction.
,n image extraction system is necessarily for users that have large collection
of images like digital library. 3uring the last few years some content based
techniques for image retrieval system are commercially available. These systems offer
retrieval by colorB texture or shape J smart combinations of these images help users
in finding the image he is looking for. , video retrieval system is useful for video
archiving video editing production etc.
/2
BIB6IOGRAPH=
M+N Basis of 9ideo and 0mage ?rocessing by Oian #ang
M/N , survey on content based access to image J video databases by :&ersti ,as J
6ine (ikvil
M5N*ontent$based representation and retrieval of visual =edia) a state$of$the$art
review +
?hilippe ,igrain Hong&iang Phang 3ragutin ?etkovic
M7N,dvanced content$based semantic scene analysis and information retrieval) the
schema pro&ect
(. 0"quierdo O.R. *asas R. 6eonardi ?. =igliorati Ioel e.
'Econnor 0. :ompatsiaris and =. G. 4trint"is
M2NRelevance feedback techniques in content$based image retrieval
By Dong Rui Thomas 8 Huang 4harad =ehrotra
M;N 0((( conference on ,dvanced 9ideo and 4ignal Based 4urveillance
9ideo (xtraction in *ompressed 3omain
M>N 0((( paper on 0mage 4egmentationB block based feature extractionB large
9ideoAimage databasesB

M@N 0((( paper on !ace location in wavelet$based video compression for
high perpetual quality
!eature extraction

/;
/>
/@

You might also like