You are on page 1of 18

Transportation Research Part C 14 (2006) 96113

www.elsevier.com/locate/trc

A motion-based image processing system for


detecting potentially dangerous situations
in underground railway stations
Sergio A. Velastin
a

a,*

, Boghos A. Boghossian b, Maria Alicia Vicencio-Silva

Digital Imaging Research Centre, Faculty of Computing, Information Systems and Mathematics, Kingston University,
Penrhyn Road, Kingston upon Thames, Surrey KT1 2EE, United Kingdom
b
Ipsotek Ltd., P.O. Box 54055, London SW19 4WE, United Kingdom
c
Centre for Transport Studies, University College London, Gower St, London WC1E 6BT, United Kingdom
Received 14 August 2002; received in revised form 23 March 2006; accepted 25 May 2006

Abstract
The timely detection of potentially dangerous situations involving passengers in public transport sites is vital to improve
the safety and condence of the travelling public. Conventional CCTV systems are monitored manually so that a single
observer is typically responsible for dealing with tens or hundreds of cameras at a time. Thus, important events might
be missed or detected too late for eective action. This paper gives an overview of motion-based methods used in a system
developed as part of a EU-funded research project, to detect three important situations of interest to public transport operators. The style has been kept intentionally general so as to provide a good broad understanding of the transport needs
being addressed. Emphasis is given to the performance of these methods as assessed with a large set of video recordings
supplied by metropolitan railway networks in London, Paris and Milan.
 2006 Elsevier Ltd. All rights reserved.
Keywords: Visual surveillance; Personal security; Public transport security; Pedestrian monitoring; Motion estimation; Background
estimation

1. Introduction
There is widespread recognition that public transport networks can make a signicant contribution towards
shifting patterns of travel, especially in big cities, from private means to public means. National and supranational policies aim thus to reduce the levels of congestion and pollution and, in general, improve the quality
of life of citizens. This is a complex problem that concerns the implementation of truly integrated modes of
transport, improvements on mobility and accessibility, long-term investment on infrastructure, taxing regimes
*

Corresponding author. Tel.: +44 20 8547 7719; fax: +44 20 8547 7972.
E-mail addresses: sergio.velastin@kingston.ac.uk, sergio.velastin@iee.org (S.A. Velastin), boghos.boghossian@ipsotek.com (B.A.
Boghossian), mavs@transport.ucl.ac.uk (M.A. Vicencio-Silva).
0968-090X/$ - see front matter  2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.trc.2006.05.006

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

97

and fares policies. As discussed by Sanchez-Svensson et al. (2001), an important part of the necessary improvements is to make public transport systems safer in terms of personal security, both in actual terms and,
perhaps more importantly, on how they are perceived by the travelling public, especially those that currently
either chose not to use them or are eectively excluded from them. Therefore, the increasing demands on public transport networks has led to extensive deployment of Closed Circuit Television systems (CCTV) to
improve the safety and condence of the travelling public. Continuous or time-lapsed recordings of CCTV
cameras are kept as visual evidence to help with a posteriori investigation of unusual events or criminal
actions. CCTV systems therefore can potentially play an important role in the planning of urban crowd management and routine pedestrian data collection.
On-line management of crowds requires continuous monitoring by surveillance ocers. In periods with
high levels of crowding it is crucial that those managing a facility have timely information of the areas where
problems are likely to arise, so that incidents can be prevented before they adversely aect the normal operation of a transport facility. In a typical network, images from a number of cameras are routed to a Control
Centre located at a station (in large interchanges) or at a remote location dealing with many stations. A Control Centre might deal with video signals originating from 100 to 500 cameras. A small subset of these (530) is
then shown, for manual observation, on an array of monitors. Most networks use CCTV in what can be called
a passive or reactive mode, whereby careful monitoring only takes place once an unusual situation has
been reported by ground personnel or the public. Some networks use CCTV in an active mode, where one
or two human observers are responsible for permanently looking at the monitors to detect situations of interest. Trained CCTV operators have an excellent ability to spot abnormal behaviour even when they cover less
than 5% of the observed image. However, there is a limit to what such operators can do. For example, in a
busy commuter station all the monitoring eort is normally concentrated on detecting congestion on train
platforms. Important events at other locations might be missed completely or not detected promptly. Moreover, according to research carried out by the Police Scientic Research Branch at the UKs Home Oce
(Wallace and Diey, 1998), CCTV observers suer from video blindness after 2040 min of observation.
Network operators might not be in a position to employ more monitoring sta and in fact there is a desirability and public demand for the existing sta to be in closer contact with the public on the ground.
In summary, partly as a response to public demands for increased safety, there has been a rapid increase in
the number of CCTV systems installed to monitor public places (e.g. London has over 20,000 cameras for
public transport alone). Being operated on a daily basis, these systems generate huge volumes of data that
could provide valuable information on routine patterns of behaviour and site usage, but which it is too expensive and tedious to analyse manually. Events that require immediate action are missed because the few available human observers cannot see all the cameras simultaneously and spend much of their time dealing with
uneventful situations. It is increasingly dicult for those in charge CCTV systems, to deal with and manage
the large amount of potentially useful information that they generate.
There is, therefore, a need for automating the pedestrian monitoring task. To investigate and demonstrate
this were the main aims of the project CROMATICA (Crowd Management through Telematic Imaging and
Communications Assistance), funded under the European Unions (EU) Framework IV Research and Development Programme. This paper provides an overview of the methods used and the results obtained by the
authors within that project. Their work dealt with the detection of potentially dangerous situations involving people in underground metropolitan railway stations. The term potentially dangerous refers to situations that require the attention of a human operator to prevent a possible uncontrollable condition. These
results formed the basis for the undertaking of the follow-up projects, ADVISOR and PRISMATICA, also
funded by the European Union (under its Framework V Research and Development Programme), whereby
the methods outlined here were integrated as part of an advanced multi-camera, multi-sensor (video, audio,
smart cards and wireless cameras) system. Note that to provide sucient descriptions of results and a global
appreciation of the work described here, space limitations prevent in-depth description and characterisation of
algorithms. Full details are given in Boghossian (2001). The reader is also directed, for example, to see Velastin
et al. (2004) and Lo et al. (2003) for complementary work carried out as part of the PRISMATICA project.
The starting point was a worldwide survey (Langlais, 1996) carried out among public transport operators
that identied their monitoring priorities. This was followed up by detailed informal 2-h interviews with the
3 shifts of control room operators in one of the busiest stations in the London Underground network for the

98

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

researchers to gain sucient understanding of their manual observations methods and the areas more likely to
benet from automatic detection assistance. Further details are given in Sanchez-Svensson et al. (2005). It
became clear that a number of situations are detected manually using motion cues. Those rated by operators
as being most critical and then considered as amenable to detection through image processing are:
1. Overcrowding, dened as the presence of too many people in a given area (i.e. where density has exceeded a
pre-dened safe threshold). An interesting situation occurs when overcrowding is associated to lack of
movement (i.e. congestion) where from an operational point of view the thresholds of acceptability tend
to be lower than those applied for moving crowds.
2. Forbidden or unexpected directions of motion (e.g. counter-ow in a one-way corridor), dened as a signicant amount of motion within a range of possible directions.
3. Stationary individuals or objects, dened as the consistent non-moving presence of people or objects (over a
minimum size) exceeding a pre-determined safe (typical) time threshold.
The work reported here dealt with the detection of these situations in main circulation areas (ticket halls
and corridors). For subsequent work done for estimating congestion in underground platforms please see
Lo and Velastin (2001).
This paper is organised as follows. Section 2 reviews some of the relevant previous work while Section 3
describes the main hardware and algorithmic components of the developed system. Then Section 4 shows
how detection mechanisms have been built from these components and Section 5 presents the experimental
results with a representative set of real-world data. The paper ends, Section 6 with the main conclusions
and suggestions for further work.
2. Relevant work
2.1. Obtaining motion vectors
The changes in illumination that occur from one image to another, can be represented by so called motion
vectors. At any given position in the image, the change is measured by a displacement and a direction. Given
the large amounts of data involved (for example in images of 512 512 pixels each), this process is computationally intensive but needs to be robust in a range of small and large displacements. The net result is a
(motion) vector eld in image space that correlates to changes in the scenes arising from either object movements, illumination changes or camera movements. This is useful information for a computer-based video
analysis system. A popular method of computing image motion is the so called block-matching algorithm,
whereby for any given rectangular block in an image, a similar block is found on the subsequent image. When
such a block is found, the dierence in position between the two blocks corresponds to the motion vector (i.e.
indicates by how much the original block as moved). The use of the block-matching technique was rst proposed in Velastin et al. (1993) for estimating the general trends in motion of crowds by considering regions in
histograms of velocity directions. The use of a block-matching technique to detect the direction of motion of
crowds was also considered by Bouchafa et al. (1997) while a more detailed study was conducted by Yin (1996)
showed that suciently accurate estimation of crowd movements can be obtained through the appropriate
settings of the operating parameters (size of block, size of search window).
2.2. Estimating the background
A major component of a visual surveillance system is the process that separates the (irrelevant) background
from the (relevant) foreground. In this context, background corresponds to the xed environment (oors,
walls, pillars, ticket oces and so on) and foreground that of the generally transient objects (sta, passengers
and their belongings). As human beings, we have the ability (the mechanisms of which are yet to be fully
understood) to recognise objects. We know what a table or a oor looks like and more importantly what their
purpose is in a given environment, so that when we see a scene (directly or through a television screen) we can
automatically concentrate on foreground even if there are short-term or long-term changes such as sudden or

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

99

gradual variations in lighting. The applications discussed here clearly require robust continuous unattended
operation and a system that cannot adapt to environmental changes cannot be considered. There has been
much eort in the computer vision research community to emulate this ability at least at a very low level
of pixel intensity (as opposed to objects and their contextual meaning). In its more basic level, it is possible
to work on dierences of images (known as interframe images) so as to eliminate longer term variability. These
are eective to detect intrusion into sterile zones (and hence somewhat misleadingly called motion detectors
in the surveillance industry) but suer from noise (inherent in dierential operation). Some early work (e.g.
Ridder et al., 1995; Wern et al., 1997; Tsuchikawa et al., 1995) dealt with methods to model and lter out illumination variability, e.g. through the use of Kalman lters. A more interesting approach is that rst proposed
by Stauer et al. (2000) where the temporal properties of pixel illumination (and colour) are regarded to be
stochastic and approximated by a mixture of Gaussian distributions. A given pixel at a given point in time
is considered to be either background or foreground depending on its probability to belong to the calculated
distributions. At the core of this approach is the assumption that background pixels occur more often. This is
not necessarily the case in crowded conditions. Also, an object that remains in the same position for some time
eventually becomes part of the background and disappears. When the object moves o again, it leaves behind
a ghost image (of the background it was previously covering). So, these approaches have serious limitations
for the applications considered here. The novelty of our approach is to qualify the background estimation process (and hence that of event detection) with motion vector information.
2.3. Measuring crowding levels
The estimation of crowding levels in public places gained signicant interest in the early literature. See for
example Velastin et al. (1993, 1994), Ivanov et al. (1998), Regazzoni et al. (1993), Regazzoni and Tesei (1994),
Schoeld et al. (1995), Coianiz et al. (1996), Ottonello et al. (1992) and Marana et al. (1997). This is because
the measurement of crowding levels plays an important role in ensuring public safety and on measuring levels
of service. One of the approaches is to establish a direct relationship between the number of feasible image
features (e.g. edge pixels, vertical edges, foreground pixels, circles, blobs etc.) and the crowding levels
(or the number of people in the scene). In this paper we follow the work reported by Davies et al. (1995)
and Tsuchikawa et al. (1995) by using foreground blocks as features to estimate the number of people in
the scene. Moreover, we address the problem of perspective distortion and its eect on the accuracy of the
estimation results. We also present a perspective-distortion correction method derived by automatic camera
calibration.
2.4. Detecting stationarity
The detection of stationary objects or people in complex (cluttered) environments has been addressed in the
past mainly through three approaches: temporal ltering as in Takatoo et al. (1996), frequency domain methods as in Davies et al. (1995) and motion estimation as in Bouchafa et al. (1997). The typical problems associated with the detection of stationarity in complex scenes are:
Frequent occlusion of the stationary object by moving pedestrians.
Occlusion of the stationary object by moving pedestrians wearing shades of colour similar to the
background.
Continuous change in the pose and position of human subjects suspiciously waiting in public places.
In Sections 3.3 and 4.2 we show how some of these challenges can be addressed using the foreground and
motion information extracted from images.
2.5. Perspective distortion
The data obtained from a single xed camera corresponds to a projection, on the imaging plane, of the
reection from objects in the real world. So, there is inevitably a distortion that results from perspective

100

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

(e.g. objects nearer to the camera appear to be bigger than those further away). If the distortion is not taken
into account, measurements based on image features (such as crowding level estimation) will be biased especially if what is observed is not distributed homogeneously in the images. There are well established means of
compensating for this distortion by establishing the correspondence between the image plane and the (real)
ground plane. This is typically done through a process of careful camera calibration (see for example Seitz
and Dyer, 1995). When contemplating the deployment of this type of systems over hundreds and perhaps
thousands of cameras (which are inevitably moved from time to time), such manual methods become an obstacle. Methods to automatically estimate the scene structure in similar cases have tended to follow one of the
following approaches:

Scene structure from controlled or uncontrolled camera movement as in Vieville and Faugeras (1995).
Depth from defocus (Nayar et al., 1996).
Stereo and multi-camera imaging (Rander et al., 1996).
Range Sensors (Indyk and Velastin, 1994).
Surface orientation from repetitive texture and pattern analysis (Schaalitzky and Zisserman, 1998).

Here we propose a method based on using motion information and exploiting the observation than, on the
whole, people move within a narrow range of speeds.
3. Base system components
The prototype system (Fig. 1) consisted of a Pentium PC tted with a monochrome (256-level grey scale)
video digitiser and a specic-purpose motion detection board developed by the authors. The digitiser converts
analogue images (direct from cameras or pre-recorded on video tape) at 25 frames per second (fps) and a resolution of 512 512 pixels. The PC feeds these images onto the motion detector (described in the next section)
that then returns motion vectors to carry out further processing, namely noise reduction, foreground extraction, perspective correction and event detection. In this section we concentrate on the preliminary processes
that take place before event detection.
3.1. Block-matching motion detector
A block-matching motion detection approach has been used to preserve motion discontinuities and its ecient implementation through a hardware systolic array. The work reported in Yin (1996) showed that accurate results can be obtained with a block size of 8 8 pixels and a search window of 24 24 pixels. When using
non-overlapping blocks, a 512 512 frame would be processed in about 500 ms (2 fps) on a 2.4 GHz Pentium
processor. Although this seems fast at rst sight, we need to bear in mind that this is only the rst processing
step and further processing is required to reduce noise, estimate background/background, correct perspective

Camera
Alarm to alert
the operators

PX
500
Video
Digitizer

STi
3220

Motion
Detector

Pentium
PC

Image
Processor

Fig. 1. System architecture.

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

101

Fig. 2. Typical raw motion vectors (shown in white superimposed on the input image).

and detect events. We also need to consider that the cost of a general-purpose computer can still be a significant overhead when compared to the cost of a surveillance camera.
In this system, real-time block-matching motion detection is performed by specialised hardware that operates on images of 512 512 pixels. Further details are given in Boghossian and Velastin (1998). Motion vectors
are calculated at full video rate (25 fps) from consecutive pairs of images, using non-overlapping blocks of
8 8 pixels (software-selectable). A typical set of output motion vectors (raw: before any further processing)
is shown in Fig. 2. Including the additional processing of data carried out by the PC leading to event detection,
the system performed at rates between 5.5 and 16 fps (depending on the complexity of feature extraction and
detection), which is well within the requirements of potential end-users (Langlais, 1996).
3.2. Motion noise reduction
The raw outcome of the block-matching motion detection stage is inevitably sensitive to image noise. The
most signicant noise sources experienced are camera and mains frequency interference, digitisation and
recording media noise. Therefore, a pre-processing stage is necessary to eliminate outliers, using a 3 3 mean
lter followed by a 3 3 median lter (throughout the processing chain), a user-dened Region of Interest
(ROI) is used to disregard areas where analysis is not needed (e.g. ceilings and far away regions). A typical
result is shown in Fig. 3.
3.3. Background estimation
Motion features are used to control a statistical estimation of the background image, whereby only stationary blocks are considered to be background updating candidates. In the context here, stationary blocks are
those determined as not being foreground (explained later) and having a lterer block-matching motion vector
of zero and having an interframe dierence of zero (an example of an interframe image is shown in Fig. 4).
Starting from a fast rough estimation of the background (based on instantaneous detection of non-moving
blocks), the process adaptively reduces ambiguities in the motion vector map. Then, stationary scene patterns
are kept in a multi-layer temporal (history) array that is updated continuously to hold the most frequently
repeated patterns at the top layer. Patterns that persist, within a narrow range of intensities in order to account
for slight variations of illumination mainly due to shadows, over a suciently long period of time are then
considered to correspond to the scene background. This period of time is selected on the basis of the expected
reaction time by operators upon the detection of a stationary person or object (typically of the order of 5 min).
Fig. 5 shows a typical image at the start of this process and Fig. 6 shows the corresponding result of background estimation.

102

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

Fig. 3. Typical motion eld after mean and median lters.

Fig. 4. Example of an interframe image: (a) image at time t, (b) image at time t + 1 and (c) interframe image (white: no motion).

Fig. 5. Before background estimation.

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

103

Fig. 6. After background estimation.

3.4. Foreground extraction


The background estimation consists on identifying the most likely illumination of each image block of 8 8
(as this is the resolution of the block-motion detector). Once information on the background becomes available, it can be used to determine which image blocks represent foreground data (i.e. those whose illumination
is dierent to that of the corresponding are of background). The labelling of each image block as either background or foreground is also used to lter out motion vectors (as explained earlier). So what might appear as
deceptively simple is in fact a feedback reinforcing mechanism that is many times missing from similar work. A
typical result that shows foreground-ltered motion vectors is shown in Fig. 7.
3.5. Correction of perspective distortion
Methods that attempt to assess the overall level of crowding in a scene based on image features will have
problems if they assume a homogeneous pedestrian distribution and do not take into account the eect of distortions in the image object size due to the perspective projection. To compensate for this eect, an appropriate spatial correction procedure is required to scale the number of detected features according to where they
occur in the image. As explained earlier, it is impractical to manually measure camera parameters on site (and
almost impossible when working only based on pre-recorded examples). We describe here a method to extract
scene geometry parameters directly from observations of motion values.

Fig. 7. Typical pre-processed motion vectors (foreground-ltered).

104

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

Fig. 8 shows the imaging model and describes the perspective distortion problem. It also shows a linear
correction curve as a function of the vertical position in the image. A linear correction curve is used to compensate for the perspective distortion eect because the latter is proportional to the inverse of the object depth.
Considering the bottom image row as the base for the linear correction curve projection, a weight of one is
assigned to it. Based on the geometry of the scene and the camera (Yin et al., 1995), a linear increase in
the weights is introduced along the ground plane rows with a slope proportional to the maximum perspective
distortion expected {R}. Finally, the rows above the projected ground plane are assigned a constant weight
because they lie at the same depth. Two variables are involved in the estimation of the correction curve,
namely: the maximum projection distortion {R} and the extent of the ground plane {H}.
The fact that pedestrians circulate in planes perpendicular to the ground plane and at dierent depths in the
scene, enforces the need to use a single correction factor throughout the body of each pedestrian. This factor
corresponds to the ground plane projection of the body position. This requires complex and time-consuming
segmentation techniques to dene the area occupied by each pedestrian in the scene. Alternatively, it is possible to encode the average pedestrian height variation into the correction curve in a way that allows the integral of the correction curve over the pedestrians body height to be equal to the correction factor at the feet.
Fig. 9 shows a typical updated correction curve for one of the ticket halls in one of the stations used for experimental purposes (shown in Fig. 7).

Scene
Camera

0,0

512

H2

Image plane
Projection

512-H

H1

512
Maximum distortion R=H1/H2

R
01
Linear Distortion
Correction

Fig. 8. Imaging model and linear perspective distortion correction.

3
2
1
0

Updated correction
Linear correction

-1
-2
-3

10

20

30

40

50

60

Vertical Position
Fig. 9. Original and updated perspective correction curves for one of the ticket halls.

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113


Y

Position
(x,y,z)

(x,y)
View point
Z

105

(Vx, Vy)

(Vx, Vy, Vz)


Velocity

f
Image plane
X

Fig. 10. Pinhole camera model.

Fig. 10 shows a pinhole camera scene-projection model for a monocular imaging system with a xed camera. Hence, the transformation of world co-ordinates to image plane co-ordinates is given by Eq. (1) and
velocity component projections are given by Eq. (2)

fx
fy
;
f z f z


fvx
lfvy
;
vx0 ; vy 0
f  z zf  z

x0 ; y 0

1
2

where (x, y, z) and (x 0 , y 0 ) are world and image co-ordinates respectively, (vx, vy, vz) and (vx 0 , vy 0 ) are world and
image velocities respectively, f is the imaging system focal length and l is camera height.
Assuming a constant world velocity and constant imaging parameter (f, l) the horizontal component of the
projected object velocity (vx 0 ) is inversely proportional to object depth (z), whereas the vertical component is
inversely proportional with the square of (z). Consequently, the object depth, and hence scene structure, can
be estimated from these cues.
Fig. 11 shows the scene model used to estimate the two variables involved in the derivation of the distortion
correction curve as in Fig. 9. The velocity at each row in the image is estimated via a temporal averaging lter
and the velocity curve (plotted on the right) is used directly to estimate {R} and {H}. The maximum projection
distortion {R} is dened as the ratio of maximum to minimum experienced velocities, as they correspond to
the velocities of the closest and furthest pedestrians respectively. The average height of the closest pedestrian
{H2} is estimated as the length of the region with maximum velocity. This can be used to estimate the average
height of the furthest pedestrian by dividing over {R}. Then, the ground plane border is derived from {H2}
and the length of the motion free region.

Correction
curve

Average
Velocity

Scene model

H2

H
H1

-3 -2 -1 0

1 2

Fig. 11. Scene geometry model.

106

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

4. Detection of potentially dangerous situations


This section outlines the algorithms developed by the authors to deal with the detection of the three situations pointed out in Section 1.
4.1. Overcrowding
The approach used is to use the perspective calibration and correction method presented in Section 3.5 to
compute a weighted sum of the number of foreground blocks. This sum correlates to the number of pedestrians in the scene. In fact, what is computed is the (normalised) ratio between the perspective-weighted number
of foreground blocks and the maximum perspective-weighted number of blocks in the Region of Interest that
has been chosen for detection. This gives a normalised indication of density. In a rst method, we only use the
instantaneous value of this measure. This makes no distinction between moving or stationary blocks and
hence provides a measure of instantaneous density. When this value exceeds a user-dened threshold (e.g.
80%) then overcrowding is said to have been detected. Interaction with human operators revealed that
high densities are regarded as a problem primarily when they involve noticeable reductions in ow (to the
point where there is no ow). In other words, a large crowd that can move does not require the same attention
as a smaller crowd that is brought to a halt, possibly due to congestion elsewhere upstream. Therefore we
implemented a second method where only foreground blocks that have remained stationary for a short period
of time (30 s) are taken into account and aggregated in the same way as before (ratio of the sum of perspectiveweighted blocks to the sum of all possible perspective-weighted block in the same Region of Interest). When
this value exceeds a user-dened threshold (e.g. 60%) then congestion is said to have been detected. In the
rst case, the complete process is executed at a rate of 16 fps while in the second case this is 5.5 fps.
4.2. Unusual or forbidden directions of motion
Previous work, e.g. Velastin et al. (1993) and Yin (1996), has shown that a simple aggregation of the measured directions of motion-vectors can detect forbidden trends of motion. This technique works well for scenes
with a good perspective view and horizontal motion paths. However in cases of bad perspective due to a low
mounted camera (e.g. at the entrance or exit of a one-way corridor), the main cause for false alarms of motionbased algorithms is the oscillatory up-down motion that people make while walking. We have in fact observed
that this is exploited by human operators who, under very crowded conditions, know that the presence of such
oscillations indicates that people are indeed moving. The phenomenon was studied in detailed by manually
measuring these movements. The average frequency of oscillation was found to be 2.2 Hz, practically independently of camera position but dependent on individual behaviour. In a rst approach, the rate of operation of
the motion detection process was reduced beyond 4.4 Hz. This attempts to eliminate the eects of the oscillations by under-sampling, at the expense of accuracy in matching resulting from the longer interval between
successive images. In a second, more successful approach the behaviour of the motion-vectors associated with
a complete oscillation cycle was studied, conrming that the components in the main direction of motion are
more signicant than the components due to body oscillations. Region-growing segmentation then groups the
motion vectors based on direction and magnitude. Thus, the primary movements of each individual can be
captured and small groups of vectors related to undesired movements are eliminated. Then, the detection
of an abnormal direction of movement is triggered when the number of the resulting motion vectors, within
an angular arc of user-dened directions, exceeds a user-dened value. A typical example is shown in Fig. 12.
4.3. Stationary individuals, objects or crowds
The study reported in Langlais (1996) suggests that the normal maximum period for individuals to remain
stationary in underground stations is around 2 min. If this is exceeded, it should be considered an abnormal
situation. A similar reasoning exists for the detection of possible abandoned packages. The ultimate aim is that
automatic detection coupled with digital video recording (DVR) capability would allow operators to make
better informed decisions as to the level of threat posed by an abandoned package, as through reviewing

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

107

Fig. 12. Abnormal direction of motion, shown by white vectors (Paris Metro).

the digital video they could quickly see who left the package and make a judgement on their intentions. Without this conrmatory mechanism, stations are being unnecessarily evacuated and closed (this is a daily occurrence in a network such as Londons) with the subsequent loss in revenue and inconvenience to passengers.
We dene what we call a scene information array that holds the number of samples (images) during which
the corresponding image block (8 8 pixels) has been stationary. An image block is rst set as a candidate for a
stationary area once it satises two conditions: it does not belong to the background and it experiences no
motion. Subsequently, cells in the information array corresponding to candidate blocks are incremented on
each new sample unless they belong to the background and there is no motion. These two sets of conditions
that control the beginning and resetting of information array counters provide immunity against occlusion
including cases of moving people with shades of grey similar to the background. A region-growing algorithm
is used to update the information array cells as candidate individuals change position or pose. Image blocks
removed from the information array due to sudden changes in position are reintroduced to the array at the
new positions by this algorithm, allowing slow or overlapping changes to be recovered within a few seconds
(typically 3.25 s). A nal process clusters neighbouring blocks that have remained stationary for a period longer
than a user-dened value (typically the 2 min. period mentioned earlier). The presence of one or more of such
clusters triggers the detection of this type of abnormal situation. Figs. 13 and 14 show typical examples.
An abnormal stationarity event (of an object or a person or a group of people and so on, as we are not
concerned here with classifying dierent types of objects or identifying people) is then said to have occurred
if there is a region of a (perspective-corrected) size that exceeds a user-dened value by over a period of time
also determined by the user.

Fig. 13. Dealing with occlusion from moving pedestrians (images magnied for clarity): (a) stationary person detected, (b) stationary
person is occluded and (c) stationary person still detected.

108

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

Fig. 14. Dealing with changes in position and pose (images magnied for clarity): (a) stationary person detected, (b) person moves to the
right and (c) person re-detected after 3 s.

5. Experimental results and discussion


An important aspect of the work described here was the emphasis on a realistic verication of performance
of the developed systems and algorithms with a representative set of data. The system described here was
tested under operating conditions in a major station in the London Underground network. The chosen station
is the fourth busiest station, in terms of number of passengers, in this network. It deals with commuter trac
to/from one of the biggest nancial centres in Europe and connects with main railways and buses. There are
more than seventy cameras in this station covering approximately 80% of its total area. The experimental setup is shown in Fig. 15.
To study the performance of the algorithms in detail, video tape recordings (500 h over a period of one
year) have taken place, and manually analysed, in three major networks: London Underground, Paris Metro
and Milan Metro. Records obtained through manual inspection of the processed video sequences are taken as
ground truths for the assessment of automated detection performance. The results obtained are summarised in
the following sections. It should be noted that a VTR (video tape recorder) source has a worse signal-to-noise
ratio (SNR) than a live camera (we have typically measured a reduction in SNR by 9 dB at playback). Consequently, evaluation using recorded images provides a realistic worse-case scenario.
5.1. Background estimation
The performance of the proposed background estimation algorithm (motion-controlled statistical background model, as described in Section 3.2) is measured as a function of the time period (Estimation Time)
required to generate a complete reference image and of the Mean Absolute Error (MAE) between the
estimated and manually captured reference images. The MAE criterion is used because the foreground/

Fig. 15. Experimental set-up (control room, detection system prototype).

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

109

Table 1
Performance gures for various background reference-image estimation methods
Method

Estimation time (s)

MAE

Running average (pixel intensity)


Motion (interframe)
Kalman lter (pixel intensity)
Kalman + Motion (interframe)
Statistical model
Statistical + Motion

40
22
18
20
14
13

30.97
9.22
21.39
6.40
6.34
6.20

background labelling operation used is based on absolute similarity. Table 1 shows the performance measures
for six methods operating on the same video sequence (the last one corresponds to the work presented here).
The Estimation Time is an important factor when continuous adaptation to variations in the scene background is important, e.g. as in Ivanov et al. (1998). Therefore, in such cases a simple statistical model is fast
and good enough. However, the additional use of motion information in the process introduces robustness
against situations where the background is occluded by moving crowds for a long period. Therefore, it
achieves better results in shorter periods.
5.2. Estimation of crowd motion direction
For ground truthing purposes, a pedestrian entering and leaving the cameras eld of view is dened as an
event. Then, the system estimation is compared with the manual observation records to estimate performance
gures. The under-sampling approach to overcome the up-down head and body movements discussed in Section 4.2 proved ineective, whereas, the segmentation approach proved more successful to avoid false alarms.
Table 2 shows the performance assessment gures of the latter method.
5.3. Stationarity
The algorithm presented in Section 4.3 has been evaluated to verify its robustness against:
1. 100% occlusion. Complete occlusion by moving or standing pedestrians.
2. Occlusion with the same colour as the background. Occlusion with moving pedestrians wearing grey shades
similar to the background shades (note that only eight cases were considered due to lack of data).
3. Pose and position variations. Movement of limbs and torso or shift in standing location with at least 1%
overlapping with original position, with an updating period of 3.25 s.
Moreover, the accuracy of the detection delay is assessed for each of the above tests. Consequently the evaluation metrics is dened as stability with occlusion, stability with occlusion with background shades, accuracy
of detection delay and accuracy in updating the stationary area. In this evaluation process, an event is dened
as the case of a pedestrian standing within the area of interest for more than the user-dened period (2 min).
Table 3 shows the performance gures for the tests mentioned above.
5.4. Correction of perspective distortion
The eect of the perspective distortion correction approach has been evaluated for the range of crowding
levels present in our dataset to estimate its performance. Fig. 16 shows the automated and manual estimation
Table 2
Motion direction estimation performance (for 250 events)
Walking direction
Up

Down

68%

32%

True positive

False positive

True negative

99.6%

0.8%

0.4%

110

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

Table 3
Performance gures for stationary object detection
Test

Detection percentage

Normal occlusion
Occlusion with background colour
Detection delay accuracy 2 min 5 s
Position updating in 3.25 s

True positive

False positive

True negative

97.9
87.5
100
100

0
0
0
4

2.1
12.5
0
0

Number of estimated people

50
45
40
35
30
25
20
15

Estimated
Manual

10
5
0
0

10

20

30

40

50

Number of People in the scene


Fig. 16. Estimated number of people for the automated and manual detection.

gures. It is obvious that the eect of occlusion at high crowding levels is signicant. That eect is embedded
in the adopted approach, where the features used to estimate the number of pedestrians in the scene (nonbackground image blocks) are not immune to occlusion. The non-linearity in the relation between automated
and manual measurements can be ignored if operation is within the linear region. However, setting high
crowding alarm-triggering values causes the system response to become slower and the true negative rate to
increase. On the other hand, the performance of the proposed non-linear correction procedure has been tested
against the conventional (linear) approach, showing that the latter overestimates the crowding levels when
pedestrians are close to the camera whereas the former gives more accurate results.
5.5. Scene structure from motion
The algorithm has been tested on locations with dierent geometry, pedestrian movement paths, obstacle
distributions (queues, columns, etc.) and crowding levels. The estimated scene parameters have been compared
with the manually generated ones to measure performance. By analysing the experimental results in Table 4 it
can be seen in many of the cases studied (especially those involving horizontal motion) the approach gives
results comparable to human performance (itself estimated to be around 3%). However, we can observe the
following:
The structure parameters for scenes with dominant vertical paths are poorly estimated, because the vertical
component of the image object velocity vanishes rapidly with depth causing reduced accuracy in the
estimation.
Queues and obstacles have signicant eect on the world velocities of objects (pedestrians), therefore causing errors in scene parameters.
Very low mounted cameras (less than 2.5 m) increase the eect of occlusion causing uncertainty and thus
errors in estimation.

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

111

Table 4
Scene structure parameters estimation for eight dierent scenes
Test description
Number

Obstacles

Paths
Horizontal

1
2
3
4
5
6
7
8

X
X
X

X
X
X
X
X

Vertical

X
X
X
X
X

Queues

Low
camera

X
X
X
X

Diagonal

X
X
X

Manual measurement

Automatic estimation

Perspective
distortion

Ground
plane extent

Perspective
distortion

Ground plane
extent

2.20
n/a
n/a
3.90
3.30
2.80
2.28
2.30

40
n/a
n/a
49
43
39
46
45

2.33
n/a
n/a
4.00
3.33
2.60
2.00
2.00

40
n/a
n/a
50
44
38
46
43

Table 5
Performance gures for the overcrowding and congestion estimates
Method

Overcrowding
Congestion

Detection percentage
True positive

False positive

True negative

95.62
98.51

4.00
0.28

0.37
1.21

The period required to estimate scene parameters is completely dependent on the available motion in the
scene.
The algorithm does not converge in cases where large permanent obstacles restrict pedestrian movements.
This is the case in test 3 where the camera gives a side view of underground ticket barriers.

5.6. Detection of overcrowding/congestion


The measures of performance and reliability are dened as the amount of delay in the automatic detection
of overcrowded situations and the rate of false alarms. After testing the two algorithms described earlier on
the same video sequences we can discuss the performance as follows. The rst algorithm (overcrowding)
detects high density situations immediately but some false alarms are produced due to bursts of large loosely
distributed crowd ows. On the other hand, the second algorithm (congestion) overcomes this deciency but
introduces a delay in detection. Based on the experiments performed on both algorithms the performance gures for a test spanning 3 h of recorded data are given as in Table 5.
6. Conclusions
Motion is a powerful clue used by human operators to detect situations that might lead to incidents if they
are not controlled in a timely fashion. Surprisingly the use of motion in the sense of direction and magnitude is
not generally exploited in visual surveillance systems (although there are references to motion in the sense of
the detection of (magnitude) changes from one image to another). In particular, the integral use of motion for
extracting foreground (even when objects stop) still remains a relatively novel solution. The paper has outlined
a number of techniques that have been evaluated using real footage from underground railway stations. The
obtained performance has been shown to be promising in the context of the possible expectations from the
managers and operators in these sites. We have to bear in mind that having simultaneous on-line access to
510% of the existing cameras in a site is quite common so that any support in bringing to that attention
of an operator situations that might require control, can have a signicant eect on the levels of personal
and asset security in these environments (as well as improvements in the perception of security by the travelling public). The work reported here necessarily concentrated on a small number of events to assess the

112

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

feasibility of such systems. Although much progress is being made in visual surveillance, we are still far away
from being able to emulate the human ability to assess, based on experience and a complex set of clues and
contextual information, if a given situation is likely to develop into a problem. A signicant amount of work
still needs to be done to be able, for example, to track people in cluttered conditions and from one place to
another (with spatial and temporal gaps in visibility), interpret the interaction between people and between
people and the environment, make sense of posture and gesture clues and above all to carry out such interpretation in a robust manner (e.g. one that uses dierent sources of data to reinforce situational assessment
and that degrades gracefully (or at least that it can self-assess degradation and ask for assistance) as environmental conditions worsen).
Acknowledgements
The work described in this paper was mainly carried out as part of the EC project TR-1016 CROMATICA. Other partners in this project included UCL (University College London), RATP (Paris public transport operator), LUL (London Underground), Politecnico di Milano, ATM (Milan public transport operator),
Molynx Ltd. (UK), INRETS (French Transport Research Laboratory), USTL (University of Lille) and CEALETI (French Atomic Energy Authority). The authors are particularly grateful to Mr. Gary Trimmer, Group
Station Manager, for his cooperation in allowing access to a major London Underground station and its sta.
B.A. Boghossian is now based at Ipsotek Ltd. (UK).
References
Boghossian, B.A., 2001. Motion-based image processing algorithms applied to crowd monitoring systems. Ph.D. Thesis, Department of
Electronic Engineering, Kings College London.
Boghossian, B.A., Velastin, S.A., 1998. Real-time motion detection of crowds in video signals. In: IEE Colloquium on High Performance
Architecture for real-time image processing, London, UK, February 1998, pp. 12/112/6.
Bouchafa, S., Aubert, D., Bouzar, S., 1997. Crowd motion estimation and motionless detection in subway corridors by image processing.
In: IEEE Conference on Intelligent Transportation Systems (ITSC97), pp. 332337.
Coianiz, T., Boninsegna, M., Caprile, B., 1996. A fuzzy classier for visual crowding estimates. In: IEEE International Conference on
Neural Networks 36 June 1996, vol. 2, pp. 11741178.
Davies, A.C., Yin, J.H., Velastin, S.A., 1995. Crowd Monitoring Using Image Processing. Electronics and Communication Engineering
Journal 7 (1), 3747.
Indyk, D., Velastin, S.A., 1994. Survey of range vision systems. Mechatronics 4 (4), 417449.
Ivanov, Y., Bobick, A., Liu, J., 1998. Fast lighting independent background subtraction. In: IEEE Workshop on Visual Surveillance, pp.
4955.
Langlais, A., 1996. Deliverable D2: User Needs Analysis. CROMATICA TR-1016 (CEC Framework IV Telematics Applications
Programme), November 1996 (available on request from the authors).
Lo, B.P.L., Velastin, S.A., 2001. Automatic congestion detection system for underground platforms. In: International Symposium on
Intelligent Multimedia, Video and Speech Processing, IEEE, Hong-Kong, 24 May 2001, pp. 159161.
Lo, B.P.L., Sun, J., Velastin, S.A., 2003. Fusing visual and audio information in a distributed intelligent surveillance system for public
transport systems. Acta Automatica Sinica 29 (3), 393407.
Marana, N., Velastin, S.A., Costa, L.F., Lotufo, R., 1997. Estimation of crowd density using image processing. In: IEE Colloquium on
Image Processing for security applications, London, UK, 1997, pp. 11/111/8.
Nayar, S.K., Watanabe, M., Noguchi, M., 1996. Real-time focus range sensor. IEEE Transactions on Pattern Analysis and Machine
Intelligence 18 (12), 11861198.
Ottonello, C., Peri, M., Regazzoni, C.S., Tesei, A., 1992. Integration of multisensor data for overcrowding estimation. In: IEEE
International Conference on Systems, Man and Cybernetics, 1992, pp. 791796.
Rander, P.W., Narayanan, P.J., Kanade, T., 1996. Recovery of dynamic scene structure from multiple image sequences. In: IEEE/SICE/
RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems (MF96), 1996, pp. 305312.
Regazzoni, C.S., Tesei, A., 1994. Density evaluation and tracking of multiple objects from image sequences. In: IEEE International
Conference on Image Processing, 1994 (ICIP-94), pp. 545549.
Regazzoni, C.S., Tesei, A., Murino, V., 1993. A real-time vision system for crowding monitoring. In: International Conference on
Industrial Electronics 1993 (IECON93), pp. 18601964.
Ridder, C., Munkelt, O., Kirchner, H., 1995. Adaptive background estimation and foreground detection using Kalman-ltering. In:
International Conference on Recent Advances in Mechatronics, ICRAM (1995), pp. 193199.
Sanchez-Svensson, M., Heath, C., Hindmarsh, J., Lu, P, Vicencio-Silva, M.A., Allsop, R.E., Tyler, N., 2001. Deliverable D4: Report on
requirements for project tools and processes. Part I: Operational Requirements; Part II: Empirical Studies of the Perception of Key
Stakeholders, PRISMATICA Project (GRD1 200010601), European Commission, Brussels, September 2001.

S.A. Velastin et al. / Transportation Research Part C 14 (2006) 96113

113

Sanchez-Svensson, M., Heath, C., Lu, P., 2005. Monitoring practice: event detection and system design. In: Velastin, S.A., Remagnino,
P. (Eds.), Intelligent Distributed Surveillance Systems. The Institution of Electrical Engineers (IEE), ISBN 0-86341-504-0, pp. 3154.
Schaalitzky, F., Zisserman, A., 1998. Geometric grouping of repeated elements within images. In: British Machine Vision Conference,
BMVC 1998, pp. 1322.
Schoeld, A.J., Stonham, T.J., Mehta, P.A., 1995. A RAM based neural network approach to people counting. In: Fifth International
Conference on Image Processing and its Applications, 46 July 1995. IEE, pp. 652656, Conference Publication No. 410.
Seitz, S.M., Dyer, C.R., 1995. Complete scene structure from four point correspondences. In: Fifth International Conference on Computer
Vision, 1995, pp. 330337.
Stauer, C., Eric, W., Grimson, L., 2000. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis
and Machine Intelligence 22 (8), 747757.
Takatoo, M., Onuma, C., Kobayashi, Y., 1996. Detection of objects including persons using image processing. In: 13th IEEE
International Conference on Pattern Recognition ICPR96, pp. 466472.
Tsuchikawa, M., Sato, A., Koike, H., Tomono, A., 1995. A moving-object extraction method robust against illumination level changes for
pedestrian counting system. In: Fifth International Symposium on Computer Vision, pp. 563568.
Velastin, S.A., Davies, A.C., Yin, J.H., Vicencio-Silva, M.A., Allsop, R.E., Penn, A., 1993. Analysis of crowd movements and densities in
built-up environments using image processing. In: IEE Colloquium on Image Processing for Transport Applications, London UK,
1993, pp. 8/18/6.
Velastin, S.A., Davies, A.C., Yin, J.H., Vicencio-Silva, M.A., Allsop, R.E., Penn, A., 1994. Automated measurement of crowd density and
motion using image processing, In: Seventh International Conference on Road Trac Monitoring and Control, London, UK, pp. 127
132.
Velastin, S.A., Lo, B.P.L., Sun, J., 2004. A exible communications protocol for a distributed surveillance system. Journal of Network and
Computer Applications 27 (4), 221253.
Vieville, T., Faugeras, O.D., 1995. Motion analysis with a camera with unknown, and possibly varying intrinsic parameters. In: Fifth
International Conference on Computer Vision, pp. 750756.
Wallace, E., Diey, C., 1998. CCTV Control Room Ergonomics. Police Scientic Development Branch (PSDB), UK Home Oce,
Publication No. 14/98.
Wern, C., Azarbayejani, A., Darrell, T., Pentland, A., 1997. Pnder: Real-time tracking of the human body. IEEE Transactions on Pattern
Analysis and Machine Intelligence 19 (7), 780785.
Yin, J.H. 1996. Automation of crowd data-acquisition and monitoring in conned areas using image processing. Ph.D. Thesis, Kings
College London.
Yin, J.H., Velastin, S.A., Davies, A.C., 1995. Image processing techniques for crowd density estimation using a reference image. In:
Second Asian Conference on Computer Vision, Singapore, 58 December 1995, Vol. III, pp. 610.

You might also like