You are on page 1of 23

446408

2012
VCJ11310.1177/1470357212446408Boeriis and Holsanova: Tracking visual segmentationVisual Communication

visual communication

ARTICLE

Tracking visual segmentation:


connecting semiotic and
cognitive perspectives

MORTEN BOERIIS
University of Southern Denmark, Odense, Denmark

JANA HOLSANOVA
Lund University, Sweden

ABSTRACT
This article introduces a new methodology for deriving the dynamics of visual
segmentation in relation to the underlying cognitive processes involved. The
method combines social semiotics approaches to visual segmentation with
eye-tracking studies on authentic image viewing and simultaneous image
description. The authors thesis is that visual segmentation suggested by
the social semiotic approach is traceable in the behaviour of the viewers
who perceive images while creating meaning. From this perspective, visual
zooming is seen as both perceptually, cognitively, grammatically and analyti-
cally relevant. The interdisciplinary approach developed in the article pres-
ents new perspectives on the ways images are segmented and interpreted.

KEYWORDS
cognition dynamic rank scale eye tracking image description image
viewing social semiotics visual segmentation

1 INTRODUCTION
The notion of visual segmentation, image segmentation or scene segmenta-
tion has been discussed in various research areas. This has been done in visual
perception (in terms of natural partitioning of a given image into meaningful
regions and objects conducted by the viewers visual and cognitive system; see
Henderson, 2007; Holsanova 2001, 2008), in computer vision and image pro-
cessing (in terms of an automated partitioning of a given image into mean-
ingful regions and objects conducted by feature extraction, edge detection,
and region extraction; see DeLiang Wang, 2003) and in social semiotics (in
SAGE Publications (Los Angeles, London, New Delhi, Singapore and Washington DC:
http://vcj.sagepub.com) Copyright The Author(s), 2012.
Reprints and permissions: http://www.sagepub.co.uk/journalspermissions.nav/
Vol 11(3): 259281 DOI 10.1177/1470357212446408
terms of a contextually and functionally based description and partitioning of
images as a communicative resource for meaning-making; see Boeriis, 2009,
2012; OToole, 2011[1994]).
The human visual system extracts and groups similar features and
separates dissimilar ones, segments the scene into coherent patterns, identifies
objects, distinguishes between figure and ground, perceives object properties
and parts, and recognizes perceptual organization (Palmer, 1999). Our ability
to perceive figures and meaningful wholes instead of collections of lines have
been described in terms of gestalt principles of (a) proximity, (b) similarity,
(c) closure, (d) continuation, and (e) symmetry (Khler, 1947). For instance,
the proximity principle implies that features and objects placed close to each
other appear as groups rather than a random cluster. The similarity principle
means that there is a tendency for elements of the same shape, brightness
or colour to be seen as belonging together. The principle of closure implies
that elements tend to be grouped together if they are parts of a closed figure.
The continuation principle means that units that are aligned appear as inte-
grated perceptual wholes. Finally, the symmetry principle implies that regions
bounded by symmetrical borders tend to be perceived as coherent figures.
In the following, we will focus on visual segmentation in social semi-
otics and in visual perception and cognition. The aim of our case study is to
compare the social semiotics approach to visual segmentation (Boeriis, 2009,
2012) to eye-tracking studies of authentic image viewing and simultaneous
image description (Holsanova, 2001). In particular, we will first apply cate-
gories and rank mechanisms from the social semiotic framework to a com-
plex image. We will then compare the results of the semiotic analysis with the
results of the eye-tracking study of authentic viewing and description of the
same image. Our thesis is that visual segmentation suggested by the social
semiotic approach is traceable in the behaviour of the viewers who perceive
visuals while creating meaning. It becomes visible in visual fixation patterns
accompanied by a simultaneous image description. A comparison of semi-
otic analysis with eye-tracking measurements is very useful (Holsanova et al.,
2006). By combining these two perspectives and methodologies, we hope to
come to a better understanding of the dynamics of visual segmentation.

2 BACKGROUND
2.1 Visual segmentation and the social semiotic
approach
The social semiotic approach describes communicative resources as inter-subjec-
tive emergent rather than normative rule governed phenomena. The resource
systems emerge from social use and, as such, are regulated by socially conven-
tionalized communicative aptness (Kress, 2010: 55). The individual visual text
is the result of intentional strategic choices from the resource systems regu-
lated by the given context. Consequently, an image is seen as a communicative

260 Visual Communication 11(3)


product (multimodal text) constructed with an ideal model viewer/reader in
mind. The situational and cultural context is seen as playing a crucial role in the
meaning-making regulating the communicative consequences of the grammat-
ical choices realized by certain structural elements (Baldry and Thibault, 2006;
Halliday and Matthiessen, 2004; OToole, 2011[1994]; Van Leeuwen, 2005).
Consequently, from a social semiotic perspective, any grouping in an image is
suggested by the image itself in context, even the more top-down conceptual
groupings discussed below, because the way the individual image is arranged as
a communicative product constrains what groups may be conceived of in that
particular image. An image may lend itself to several readings but the overall
(model reader) meaning is a complex intertwining of all of them.
Preliminary studies show that there seems to be a socially shared
understanding of the segmentation of visual texts and of the general role
played by segmentation in visual communication (Boeriis, 2012). We assume
that segmentation plays an important role in visual communication as part of
the socially shared communicative resources employed both by text producers
and by text receivers (Kress, 2010: 445).
One of OTooles major contributions to visual social semiotics is his
segmentation of images into four general rank scale levels called Member,
Figure, Episode and Work (OToole, 2011[1994]). This is inspired by
Hallidays (2004: 32) segmentation of verbal language into morpheme, word,
group and clause. OToole proceeds from visual art and defines Figure as
what we recognize as one (whole) depicted entity in the image. Episodes are
groupings of Figures that are involved in shared processes of different kinds,
and Members are elements of the Figures that play important roles in the over-
all meaning. Work is the overall level of the entire piece. He argues that when
viewing an image we tend to home in on configurations of Members, Figures
and Episodes and then a kind of shuttling process takes place between the
individual parts and the whole image (OToole, 2011[1994]: 12). According to
Baldry and Thibault (2006: 26), the reading of the fixated structure in images
(they term it cluster hopping) is based on a number of mechanisms such as
periodicity and variation. Their cluster theory builds on a multi-variable rank
scale with the spatial grouping of elements as the major factor. As we shall
argue below, we find that there is more to the grouping of elements than the
fact they merely occupy the same area of space in the pictorial frame.
In their ambitious description of visual semiotic resources, Kress
and Van Leeuwen (2001, 2006[1996]) do little explicit sub-segmentation of
the visual text. Kress (2010) operates with a general multimodal distinction
between text, module and sign, in what he calls a top-down approach,
where the elements are shaped by the contingent circumstances of those who
make the text in its social setting (p. 148, original emphasis). Kress focuses on
rank scale levels as situated semiotic functions within visual text rather than
as a priori scalar levels. We will elaborate the idea of a more dynamically con-
ceptualized rank scale in section 3.1.

Boeriis and Holsanova: Tracking visual segmentation 261


Whole The Whole is what functions as an overall whole in the
visual text, usually consisting of several Groups. More often
than not the Whole corresponds to the totality of what is
considered the visual multimodal text.
Group Groups are Units functioning as a unified entity. Relations
within Groups (e.g. classificatory processes) may be the
result of a number of ranking mechanisms (e.g. proximity,
segregation, framing and joint process involvement) that
are elaborated below.
Unit Units are elements that function as complete entities
within the (world of the) visual text. Units may be any
whole entity within a visual text such as for instance a
person, a table, a house, a car or a tree. Units cannot be
divided without breaking them into parts.
Component Components are elements functioning as parts of a Unit and
may play an important role in the overall meaning-making
in a visual text; for instance, when certain features of a
persons look (intensive process) or clothing (possessive
process) make it possible to recognize him or her as
somebody known (identificational process).
Figure 1 The dynamic rank scale.

2.2 Visual segmentation and the cognitive approach


Within the cognitive approach, researchers in the field of scene percep-
tion investigate where and when people look in a scene, and why they do
so (Henderson, 2007). In order to assess these processes in detail, they use
eye-tracking methodology and controlled studies with a careful experi-
mental design. Eye xations have been considered to constitute a boundary
between perception and cognition since they overtly indicate that informa-
tion was acquired. Therefore, eye movements provide an unobtrusive, sensi-
tive, real-time behavioural index of ongoing visual and cognitive processing
(Henderson and Ferreira, 2004: 18).
Where we look in a scene is partly determined by the scene constraints
and partly by the viewers goal, interest and expectation. Current theory
suggests two different mechanisms for eye movement guidance: bottom-up
(driven by low-level image features, such as luminance, contrast, edge den-
sity, colour and motion; see Itti and Koch, 2000) and top-down (driven by
high-level cognitive factors, such as task, goals, prior knowledge and expecta-
tions of the viewer). This issue is problematic, however, since fixated regions
(for example, faces) often contain both low-level factors of visual saliency and
high-level factors of potential semantic importance. The influence and inter-
action between bottom-up and top-down factors is the subject of a current

262 Visual Communication 11(3)


debate in the field (Foulsham and Underwood, 2007; Harding and Bloj, 2010;
Nystrm and Holmqvist, 2008).
What is the time-course in scene perception? Can we find recurring pat-
terns and phases? First of all, there is agreement on our ability to get the gist of
the scene very early in the viewing process. Secondly, several researchers have
found regularities concerning the time course of image viewing, characterized
by specific patterns and phases of visual exploration. Back in the 1930s, George
T. Buswell identified two general eye-movement patterns: an initial global and
subsequent local phase of image exploration. The first consisted of a general
survey in which the eyes moved with a series of relatively short pauses over the
main portions of the image. The second consisted of a series of long xations,
concentrated over small areas of the image, evidencing detailed examination
of those sections (Buswell, 1935: 142). Similar patterns have been identified by
other researchers, where the phases have been given alternative names accord-
ing to the inferred activity: evaluating orienting phase, focal ambient phase,
exploration inspection phase (Unema et al., 1968).
Recent studies show a revival of so-called scanpaths, spatial and chron-
ological sequences of fixations, which were introduced as a concept by Noton
and Stark (1971). Scanpaths offer more information than singe fixations and
saccades since they encompass a whole range of oculomotor data into one
construct, which reveals visual processing over time and in space (Dewhurst
et al., 2012).
Which units do viewers perceive? How do they segment the content
of the scene? A visual fixation on an image element does not indicate what
aspects and properties of the image element have been focused on, or at what
level of abstraction. Scanpaths contain more information than single fixations
but they do not reveal which concept was associated with it, or what the viewer
had in mind. We still need some kind of referential framework in order to
infer the ideas and thoughts to which these fixations and scanpaths corre-
spond. In order to capture the dynamics of perception and meaning-making,
we need two windows to the mind, in the form of eye movement protocols
and (simultaneous) verbal protocols (Holsanova, 2001, 2008). This method is
described and illustrated in section 3.2.

2.3 Differences and commonalties between


the two approaches
One difference between the two approaches discussed above is that the semi-
otic approach considers visual segmentation to be a result of intentional,
socially typical choices that are made to achieve the optimally desired com-
municative effect on a hypothetical model reader/viewer, whereas the cogni-
tive approach focuses on the dynamic process of viewers actual interaction
with the visuals from a reception perspective. While the semiotic approach
emphasizes resources as socially shared knowledge based on conventions and

Boeriis and Holsanova: Tracking visual segmentation 263


consensus, the cognitive approach investigates general and individual patterns
in the unfolding scene perception and segmentation. In particular, research-
ers study factors that may influence the perception and segmentation process,
such as task or instruction, properties of various types of scenes and images,
viewers expert knowledge, etc. The relation between expertise and the evolv-
ing socially shared semiotic system has been discussed extensively in recent
social semiotics (e.g. Kress, 2010; Kress and Van Leeuwen, 2001) but the theo-
retical focus lies on typical ways of doing things communicatively and there-
fore less on individual preferences and style. The social semiotics description
of a fully evolved semiotic system will always be from an experts point of view.
Both approaches recognize principles of perceptual organization. In
particular, the more structural gestalt principles based on proximity, similar-
ity and other coherent patterns are crucial both from the social semiotic per-
spective and the cognitive perspective. As both approaches share these basic
assumptions about structuring and perception of visual elements, we would
at least expect to find evidence of the textual (structural) ranking mechanisms
in the eye-tracking data.

3 METHOD
Firstly, we will describe visual segmentation based on a social semiotic
approach (Baldry and Thibault, 2006; Boeriis, 2012; OToole, 1994). Secondly,
we will present the cognitive approach and introduce the method of combin-
ing eye-movement analysis with simultaneous image descriptions (Holsanova,
2001).

3.1 Social semiotic method: the dynamic rank scale


The fact that an image may function as the overall work in one context and
as part of a text in another poses a challenge for a static rank scale. As parts
of overall wholes, images do not lose any unique features in the embedding
and therefore they are at the same time a whole and a part. We find that
a good way to address this issue may be to adopt a dynamic approach to
wholeness in visual texts. In the analytical zoom approach (Boeriis, 2012),
we zoom in or out between separate parts and more overall wholeness, and
this dynamic approach renders it possible to analyse complex structures of
parts and embedded wholes in an overall visual text. The analytical zoom is
the analytical equivalent of the perceptual shuttling/cluster hopping discussed
above. The hypothesis is that images are perceived as conceptual zooms back
and forth from detail to wholes.
We operate with Boeriiss (2012) four basic rank scale functions:
Whole, Group, Unit and Component (see Figure 1). Although comparable
to OTooles rank levels, these functions are not distinct, static a priori levels,
but rather dynamic text mechanisms in which certain parts or groups of ele-
ments may perform one of these functions.

264 Visual Communication 11(3)


An image may function as a Whole framed on the wall, as a Unit in
an article and as a webpage background. When analysing an article with an
embedded image, it is possible and fruitful to zoom in and analyse the image
itself, and then relate to its function as Unit as we zoom all the way out. The
analysis will be a process of motivated zooming in and out between different
rank scale levels. The responsibility for the identifiable rank scale levels rests
with the instantiated ranking mechanisms.
The segmentation of visual texts in social semiotics is typically based
on gestalt theory laws of proximity and prgnanz (e.g. Baldry and Thibault,
2006; OToole, 2011[1994]), but other mechanisms have influence on the per-
ceived grouping of elements. For instance, the gestalt laws of closure, symme-
try and similarity are also relevant (Boeriis, 2009: 147). From a visual social
semiotic viewpoint, we would expect to find a number of grouping mecha-
nisms (rank-defining choices) that stem from different metafunctional real-
izations. To avoid resorting to running commentaries (Bateman, 2008: 13), it
is important to outline the important basic mechanisms defining the ranking
in visual texts. Visual elements may play different united functional roles in
the rank scale. Elements functioning as Components may join in certain ways
and elements functioning as Units may group in certain other ways. These
ways of coming together we call ranking mechanisms.
Visual social semiotics inherits the concept of metafunctionality from
Halliday (Kress and Van Leeuwen, 2006[1996]: 20), which is the idea that three
types of meaning are always simultaneously conveyed in communication,
namely ideational, interpersonal and textual meaning. Ideational meaning is the
functional representation of the world around us, and it is often divided into two
separate metafunctions, named the experiential and the logical metafunction.
The experiential metafunction is visually evident through processes (e.g. action),
participants (e.g. the actor or the goal) and circumstantials (e.g. the setting). The
logical metafunction is about the complex relations between several processes
and participants in the same image. This metafunction has not been discussed
much in visual social semiotics, except in relation to moving images (Boeriis,
2009: 194200; OHalloran, 2004: 125). The interpersonal metafunction is made
up of interpersonal relations between the communicative participants as they
are realized in the text. In visual communication, this is realized through differ-
ent viewpoint and modality systems (Boeriis, 2009). The textual metafunction is
the structural meaning of an image in the composition where elements are inte-
grated into a meaningful whole (Kress and Van Leeuwen, 2006[1996]: 176). The
meaning is realized through systems such as information value, salience, framing,
frame dimension and frame shape (Kress and Van Leeuwen, 2006[1996]: 201,
OHalloran, 2004: 120).
The ranking mechanisms presented by Boeriis (2012) are based on the
descriptions of grammatical realizations of metafunctional meaning in visual
social semiotics. They describe the mechanisms by which elements are made
to function at different rank scale levels by joining Components into com-

Boeriis and Holsanova: Tracking visual segmentation 265


Metafunction Ranking Realization
mechanism
Textual Segregation Framing devices such as concrete graphic lines or
more abstract lines grouping elements as contingent
(typically Figures into Groups)
Separation Separation by means of emptiness or unused space
creating high distance between elements in the visual
plane (typically Figures into Groups)
Proximity Elements being relatively closer to each other
compared to the rest of the visual text (both
Components into Figures and Figures into Groups)
Rhyme Elements having features in common such as similarity
in hue, saturation, brightness, shape and size (both
Components into Figures and Figures into Groups)
Contrast Elements that share the fact that they are different
(salient), but not necessarily in the same way
(typically Figures into Groups)
Matrix Elements that are part of the same visual pattern
(typically Figures into Groups)
Slant Elements having the same angle in the graphical
arrangement (both Components into Figures and
Figures into Groups)
Interpersonal Viewpoint Elements depicted as oriented in the same direction
hence viewed from the same horizontal and vertical
viewpoint (both Components into Figures and
Figures into Groups)
Modality Elements or areas sharing the same or similar modality
convergence profiles in color, lightness and detail (typically Figures
into Groups and Groups into larger Groups)
Experiential Process Elements functioning as participants are involved in
involvement the same process (typically Figures into Groups)
Actional Elements functioning as participants in similar parallel
coincidence material processes (typically Figures into Groups)
Reactional Elements functioning as participants in similar parallel
coincidence reaction processes (typically Figures into Groups)
Relational Elements functioning as participants join into
coincidence classificatory, attributive or referential processes
(typically Components into Figures)
Logical Process Elements functioning as parts in processes
fusion that function as parts of other processes (both
Components into Figures and Figures into Groups)
Figure 2 Visual ranking mechanisms based on visual social semiotics.

266 Visual Communication 11(3)


bined Figures or grouping Figures into Groups. Figure 2 organizes the ranking
mechanisms by their metafunctional origin.
We assume that it is possible to detect some of these rank-defining
mechanisms in the eye-tracking and verbal protocols to be discussed below.

3.2 Cognitive method: image viewing and


image description
How do viewers perceive and segment complex images? What units do they
identify and at what level of abstraction? What does the temporal and seman-
tic build-up of the visual examination look like? In order to answer these ques-
tions, we will use a dynamic, sequential method, combining the analysis of two
sources, eye movement data and simultaneous verbal descriptions (Holsanova,

Figure 3 Complex picture: the motif comes from a childrens book by Sven Nordqvist
(1990). Reproduced by kind permission from Sven Nordqvist.

Figure 4 Scanpath: image discovery of one participant during 7 seconds (Holsanova,


2008: 132).

Boeriis and Holsanova: Tracking visual segmentation 267


2001). This section presents the method and shows how these two kinds of
data can give us distinct hints about the dynamics of the underlying cognitive
processes. Our starting point is a complex picture (Figure 3).
The visual fixation pattern in Figure 4 illustrates the path of image dis-
covery. It shows: (a) which objects and areas of the visual scene have been
fixated by the viewer; (b) in what order; and (c) for how long. The circles indi-
cate the position and duration of the fixations, the diameter of each fixation
being proportional to its duration. The lines connecting fixations represent
saccades. The white circle in the lower right-hand corner is a reference point:
it represents the size of a one-second fixation. This scanpath comes as a result
of 7 seconds of image viewing.
However, the image content can be perceived and described on dif-
ferent levels and the xation itself does not indicate what properties of an
object in a scene have been acquired. Thus, in order to study the process of
meaning-making and visual segmentation, we still need some kind of refer-
ential framework, to infer the ideas and thoughts to which these fixations and
scanpaths correspond. Therefore, we need to combine visual data with the
(simultaneous) verbal descriptions. The verbal description, uttered during the
visual examination illustrated above, is shown below:

in the middle is a tree/


with one with three birds doing different things//
one is sitting on its eggs/
the other is singing/
and the third female bird is beating a rug or something//

Each line in the transcript represents a new verbal focus expressing the con-
tent of active consciousness. Verbal focus is usually a phrase or a short clause,
delimited by prosodic and acoustic features: it has one primary accent, a
coherent intonation contour, and is usually preceded by a pause or hesitation
(Holsanova, 2001: 15ff.). It implies that one new idea is formulated at a time
and active information is replaced by other, partially different information
at approximately two-second intervals (Chafe, 1994). Several verbal foci are
clustered into superfoci (for example, a summarizing superfocus or a list of
items in the above example, delimited by lines). A verbal superfocus is a coher-
ent chunk of speech, typically a longer sentence, consisting of several foci con-
nected by the same thematic aspect and having a sentence-final prosodic pat-
tern. Superfoci can be conceived of as thresholds into a new complex unit of
thought (Holsanova, 2008: 8ff.).
Both the visual scanpath (Figure 4) and the verbal transcript illus-
trate the result of 7 seconds of viewing and description. They can be seen as a
functionally delimited segmentation unit. If we want to look at the process of
image viewing and image description in detail, however, we need to use a dif-

268 Visual Communication 11(3)


tree foliage tree tree foliage bird 2 tree bird 2 bird 1 bird 3 bird 2 bird 1 bird 3
VISUAL

VERBAL
in the middle is a tree with one with three birds doing different things

time

Figure 5 Multimodal score sheet (Holsanova, 2008: 111).

ferent visualization, known as multimodal score sheets (Holsanova, 2001). A


multimodal score sheet (Figure 5) enables us to synchronize visual and verbal
behaviour, follow and compare the content of the attentional spotlight and
extract clusters in the visual and verbal flow. With its help, we can examine the
relationship between what is looked at and what is said at a particular point
in time.
The score sheet contains two different streams: it shows visual behav-
iour (objects fixated visually during description on line 1; thin box = short
fixation; thick box = long fixation) and verbal behaviour (verbal idea units
on line 2), synchronized over time. Simple bars mark the borders of verbal
foci (expressing the conscious focus of attention) and double bars mark the
borders of verbal superfoci (thematic clusters of foci that form more complex
units of thought).
The scene is viewed, described and interpreted stepwise, in terms of
sub-scenes. One portion of the image is in the focus of an attentional spotlight
at a time. Thus, verbal data combined with visual data can be used as two win-
dows to the mind. They contain the content of the attentional spotlight and
reflect the functional visual segmentation of the image. It is not only detection
and recognition of objects that matter but also how the perception process
unfolds. When looking at an image and describing it verbally, viewers not only
report what they see but also how the image appears to them. In other words,
they are involved in perceptual, categorizing and interpreting activities.
To sum up, the combination of visual and verbal data showed that
objects were conceptualized on different levels of specificity. We have wit-
nessed a process of stepwise specification, evaluation, interpretation and even
reconceptualization of image elements and the image as a whole. Informants
started by looking at scene-inherent objects, units and gestalts. As their image
viewing progressed, they tended to create mental units independently of the
concrete image elements. They made large saccades, picking up information
from different locations to support concepts that were distributed across the

Boeriis and Holsanova: Tracking visual segmentation 269


image. With the increasing cognitive involvement, observers and describers
tended to return to certain areas, changed their perspective and reformulated
or recategorized the scene. Their perception of the image changed over time.
Active mental groupings were created on the basis of similar traits, symmetry
and common activity. The process of mental zooming in and out could be
documented, whereby concrete objects were refixated and viewed on another
level of specificity or with another concept in mind (Holsanova, 2001, 2008,
2011; see examples in 4.2).

4 EMPIRICAL EXAMPLES
In this section, we will show examples of how the categories from the social
semiotic framework can be applied to the complex image that served as a
stimulus in the cognitive eye-tracking study (see Figure 3). We will also show
examples of units created and delimited by informants during authentic image
viewing and image description (Holsanova, 2001, 2008). This section will be
concluded by a comparison of the social semiotic and cognitive approaches.

4.1 Application of the social semiotic analytical method


to an image
The overall Pettson image (see Figure 3) is easily identified as the Whole since
there are no marked variations in modality profile in the picture or any overall
graphical framing devices. The image employs several ranking mechanisms
and the ranking hierarchy they create is pivotal to the understanding of the
image. Salience plays an important role in selecting the most prominent ele-
ments in a picture. The prevailing notion of salience in social semiotics is
inherited from Kress and Van Leeuwens (2006[1996]) work and applied in
the salience hierarchy of a (multimodal) text (e.g. Baldry and Thibault, 2006;
Bateman, 2008; Boeriis, 2008, 2009). In this tradition, salience is based both on
genre-predefined mechanisms and perception-based mechanisms. The four
men (Pettsons) are the most salient elements in the hierarchy, closely followed
by the cats. The birds and flying insects have a more circumstantial function.
The repetitive representation of a participant with the exact same
intensive and possessive attributes identifies him as one and the same person
in the same circumstantial setting. This short-circuits the spatialtemporal
logic and makes the image a so-called simultanbild. Although complex and
at times contradictory, the rank structure seems to quite clearly support an
understanding of the image as a temporal progression from left to right: rather
static on the left side, while on the right the narrative is more dynamic with the
action processes and the compositional complexity.

270 Visual Communication 11(3)


Segregation As a whole the picture is divided
into two sides by the abstract
vertical line created by the tree
(and grass) in the middle. The
elements on each side of the line
are grouped as contingent in some
way.

Separation & The man on the left side is


Proximity separated from the rest of the text
by empty space (Separation) since
there are no other close primary
participants. The three men on
the right side are closer to each
other (Proximity), making them
functionally related as well.

Actional The four Pettsons are all holding or


coincidence handling gardening tools (Actional
(Pr. involvement) coincidence). The three Pettsons
on the right side are all involved
in action processes with the same
goal, namely the soil. Four unit
instantiations of the same person
in an unfolding work process of
tilling the ground in the vegetable
garden (Process involvement).

Relational The Pettson participants constitute


coincidence a covert taxonomy of Pettsons
because of the intensive and
possessive attributive processes
of each man joining into one
referential relational process of
identifying Pettson.

Rhyme The similar colour and shape


of components of the Pettson
participants (e.g. the hat) also
contribute to the grouping of the
four Pettson participants. In the
same way the green colour rhyme
of the grass and plant groups
them as functionally coherent
(circumstantials).

Figure 6 Examples of ranking mechanisms in the Pettson image. (Continued)

Boeriis and Holsanova: Tracking visual segmentation 271


Process fusion The left Pettson participant
performs a unit scrutinizing
process which is a fusion of
component processes such as
standing, bent posture, holding
hand up, looking at something
and facial expression. The three
to the right all perform actional
processes of holding a gardening
tool, activating it in the soil and
reactional processes of being
attentive of the soil. The joined
overall unit-group process is
gardening.

Reactional The Pettson participants on


coincidence the right side share a grave and
focused facial expression which
makes them functionally closer
to each other than to the Pettson
participant to the left, who has
a more uncertain look (raised
eyebrow).

Contrast The giant daffodils and the tiny


potted plant in the tree are units
that do not share anything besides
being out of place size-wise in the
generally realistic diegetic universe,
but function as a group in this
contrast. This group of scattered
surreal elements adds a magical
feel to the picture in conjunction
with the personified animals.

Viewpoint The Pettson participant to the left


seems to be slightly closer and is
also seen from a slightly higher
viewpoint than the others. The
three to the right share a more
similar viewpoint and are therefore
more closely related to each other.

Figure 6 Examples of ranking mechanisms in the Pettson image.

272 Visual Communication 11(3)


4.2 Results of the cognitive multimodal analysis
How did the viewers perceive and segment the image? What units did they
identify and at what level of abstraction? This section presents results from
a dynamic sequential analysis of eye-movement data and simultaneous ver-
bal protocol data (Holsanova, 2001, 2008, 2011). Figure 7 illustrates examples
of meaningful units created during the process of image viewing and image
description by five viewers.

Single objects Viewers inspect and describe


single objects, central to the
scene content (there is a tree
in the middle).

Attributes When inspecting single


objects, viewers often perceive
and comment on their
attributes (one bird looks very
human).

Locations The viewed objects sometimes


stand for locations. The cows
are looked at during both the
localization and the object
description: behind the tree/
some cows are grazing.

Activity When uttering Pettson is digging,


the viewers eye movements
depict the relationships among
parts of the image and mimic it
in a repetitive rhythmic pattern.
The viewer is extracting an
activity from a static image
by filling in links between
functionally relevant parts of
objects (facehandtool; hand
tool, handtoolsoil), according
to a schema for an action.

Figure 7 Results of the cognitive multimodal analysis. (Continued)

Boeriis and Holsanova: Tracking visual segmentation 273


Groupings: Viewers often inspect and
Spatial proximity mention groups of objects
that are based on the image-
inherent spatial proximity
(there are three birds).

Groupings: However, viewers also


Categorical perceive and describe
similarity clusters of objects that are
not spatially close. These
clusters consist of groupings
of multiple similar objects
distributed over the whole
scene (four cats, four versions
of Pettson). This type of
cluster is based on categorical
similarity.

Multiple objects: Some viewers notice


Simultanbild repetitive figures in the
scene and describe them
as four varieties of Pettson
and Findus, whereas others
focus on the event instead
and describe one person and
one cat involved in various
activities.

Grouping: Another type of cluster


composition perceived as a meaningful
unit in the scene is hills on
the horizon. The viewers
eyes follow the horizontal
line, filling in links between
objects. This cluster seems to
be a compositionally guided
grouping.

Figure 7 Results of the cognitive multimodal analysis. (Continued)

274 Visual Communication 11(3)


Grouping: This is an example of a
similar traits mental grouping of objects
based on an extraction of
similar traits and activities
(flying objects). Despite the
fact that the objects are
distributed across the whole
scene, they are perceived as a
unit because of an identified
common denominator. The
viewer is mentally zooming
out and creating a unit that is
relatively independent of the
meaningful units suggested
by the composition of the
scene.

Zooming out: The perception of an image


Reconceptualization unfolds dynamically and
changes over time. Here, the
viewer is rescanning several
earlier identified objects. The
verbal data reveal that the
refixated objects are perceived
differently on different
occasions: when I think about
it, it seems as if there were in
fact two different fields. By
mentally zooming out, the
viewer discovers an inferential
boundary between portions
of the image that he or she
has not perceived before.
The scene that was originally
perceived in terms of one
field has become two fields
as the viewer becomes more
and more acquainted with the
image.
Figure 7 Results of the cognitive multimodal analysis. (Continued)

Boeriis and Holsanova: Tracking visual segmentation 275


Abstract scenario This example is another
example of how concrete
objects can be viewed
differently on different
occasions as a result of our
mental zooming in and out.
The viewer is verbalizing his
or her global impression of
the image content: it looks like
early summer. The scanpath
shows large saccades across
the whole image. This cluster
is based on a mental coupling
of concrete objects, their
parts and attributes (such as
flowers, foliage, plants, leaves,
colors) on a higher level of
abstraction. Visual scanning
is guided by top-down factors
and the objects are perceived
as concrete indices of a
complex, abstract scenario
(season of the year).

Figure 7 Results of the cognitive multimodal analysis.

4.3 Comparison between the cognitive and semiotic


approach
As hypothesized, the data analysis confirms that semiotic and cognitive
approaches give rise to common units based on perceptual organization,
compositional characteristics and ideational aspects. We found parallels
between compositional grouping and the mechanisms of segregation, sepa-
ration, proximity and rhyme; between taxonomic grouping and the mecha-
nisms of relational coincidence, classification and typification. Groupings
based on similar traits and common activity were in line with the mechanism
of relational (attributive) coincidence, process involvement and actional
coincidence. Finally, mental zooming in and out documented in the eye-
tracking study has been described as local ranking mechanisms at the various
zoom steps of the dynamic rank scale but without attaching importance to
the temporal unfolding in the perception.
The interpersonal ranking mechanisms of the semiotic model, how-
ever, were not found explicitly in the empirical study. This may be because the
image has a homogeneous modality profile and only very subtle variations
in viewpoint/perspective. The social semiotic analysis factors in the divergent

276 Visual Communication 11(3)


perspectives as one mechanism among others in the division into the 1+3
structure. Even though a viewer does not explicitly detect the variations, these
may have a subliminal impact which can be very difficult to verify with eye-
tracking and verbal protocols.

5 RELEVANCE
The results of this explorative interdisciplinary study of visual segmentation
are relevant for further investigation of the rank scale in visual social semiot-
ics in general and for multimodal rank scale theory in particular. They are of
importance for further investigations of visual segmentation in scene percep-
tion and give new perspectives on the discussion of conceptual categories
and on the role of bottom-up and top-down processes. The results can also
contribute to future research on grammatical/structural functions that may
not as yet be empirically verifiable, as well as to research on circumstantial
meaning.
The ranking mechanisms and methods presented here can be applied,
for instance, to figurative images, photographs, illustrations, graphics and lay-
out on two-dimensional visual canvas. Even theories of moving images may
benefit from the dynamic approach. Other modalities such as in three-dimen-
sional or auditory communication, however, will probably employ other sys-
tems, rank scales and ranking mechanisms.
The combination of the social semiotic and cognitive approaches
is beneficial in several ways. Empirical data can verify or refute theoretical
assumptions and categories deduced from visual grammar and perhaps sug-
gest new issues that have not yet been addressed. This may induce consider-
ations about the aptness of certain grammatical categories from which they
originally stem.

6 LIMITATIONS
This is merely an explorative case study and it is not based on hundreds of
viewers or hundreds of images. The Pettson image cannot be considered rep-
resentative of images in general, and therefore all tendencies revealed by this
investigation have to be tested with a whole range of other images and with
larger groups of viewers. Also, there is a need for carefully designed controlled
experiments that would investigate in a systematic way the role of factors such
as individual differences and expertise, the role of the task or instruction and
the role of bottom-up and top-down processes for visual segmentation.
The fact that the social semiotic dynamic visual rank scale applied
here is only a first tentative proposition for a rank scale is also a limitation.
Moreover, the dynamic rank scale needs further discussion and development.
Another potential limitation may be that the eye-tracking and verbal proto-
cols are restricted by limitations in the respondents knowledge and awareness.

Boeriis and Holsanova: Tracking visual segmentation 277


The respondents can only express what they have concepts and words for, and
implicit factors that may subliminally impact the meaning-making will not be
mentioned. This may to a certain degree be provided by the combination with
the social/grammatical approach. Also, the verbalization process itself has an
impact on the perceptual process.
Due to limited space, a number of issues emerging from combining
the two approaches are not pursued in this article. Also, a number of nuances
which are significant in each approach could not be elaborated in detail
because the focus is directed towards the integration of the two perspectives,
rather than an individual examination of each approach. This is always a chal-
lenge in explorative interdisciplinary studies.

7 DISCUSSION
This study compares the social semiotic model of visual segmentation with
eye-tracking studies of image viewing and simultaneous image description.
Our main thesis was verified as we found that visual segmentation was trace-
able in the behaviour of viewers who perceive visuals while creating meaning.
We found the concept of the visual zoom applicable in both a social semiotic
and a perceptual cognitive approach. Also, we found quite coincident seg-
mentation categories (rank scale levels) and ranking mechanisms in the two
approaches, as many of the semiotic categories were demonstrated by eye-
tracking and verbal protocols. The inspiration of gestalt theory was one clear
common denominator which facilitated the unison of the cognitive and social
semiotic approach to visual segmentation.
The different approaches revealed different perspectives on the same
phenomenon and there were of course discrepancies in what was emphasized
by the two approaches. Certain distinct aspects that appear important in the
empirical study of image perception were not as accentuated in the social
semiotic approach (and vice versa). Among these discrepancies were: (1) the
temporal aspects of image perception and the dynamics of visual segmenta-
tion; (2) the role of individual differences and expertise; (3) the role of the
context, task, instruction or goal; (4) the role of an implicit model reader; (5)
the role of interpersonal segmentation; (6) the role of paradigmatic relations
within system resources; and (7) the understanding of salience. Space con-
straints prohibit us from further elaboration of these discrepancies, but they
would all be very interesting areas for future investigation.
The semiotic versus the reception focus did yield interesting perspec-
tives on each other. From both perspectives and in unison with both we
found that rank scale segmentation plays a very important role in visual
meaning-making. This indicates that the two approaches can indeed sup-
port each other, and in combining the two perspectives and methodologies,
we came to a better understanding of the dynamics of visual segmentation
and the underlying cognitive processes. Even though this is merely a first

278 Visual Communication 11(3)


explorative study in an area that needs much more research, we find it plau-
sible to suggest that taking similar interdisciplinary approaches to this as
well as to other multimodal phenomena could be very fruitful.

ACKNOWLEDGEMENTS
We would like to thank Kay OHalloran, Anders Bjrkvall, Roger Johansson
and the Eye Tracking Group at Lund Humanities Lab for their comments on
previous versions of the manuscript. The work has been supported by the
Linnaeus Center for Thinking in Time: Cognition, Communication and
Learning (CCL) at Lund University, funded by the Swedish Research Council
(grant no. 349-2007-8695). The work has also been supported by the Faculty of
Humanities and the Institute of Language and Communication at University
of Southern Denmark.

REFERENCES
Baldry, A.P. and Thibault, P.J. (2006) Multimodal Transcription and Text
Analysis. London: Equinox.
Bateman, J.A. (2008) Multimodality and Genre: A Foundation for the Systematic
Analysis of Multimodal Documents. London: Palgrave Macmillan.
Boeriis, M. (2008) Mastering Multimodal Complexity, in N. Nrgaard (ed.)
Systemic Functional Linguistics in Use. Odense Working Papers in
Language and Communication, Vol. 29, pp. 21936. Odense: University
of Southern Denmark.
Boeriis, M. (2009) Multimodal Socialsemiotik og Levende Billeder, PhD
thesis, University of Southern Denmark, Odense.
Boeriis, M. (2012) Tekstzoom om en dynamisk funktionel rangstruktur
i visuelle tekster, in T. Andersen and M. Boeriis (eds) Nordisk
Socialsemiotik - multimodale, pdagogiske og sprogvidenskabelige
landvindinger, pp/ 131153. Odense: University Press of Southern
Denmark.
Buswell, G.T. (1935) How People Look at Images: A Study of the Psychology of
Perception in Art. Chicago: University of Chicago Press.
Chafe, W.L. (1994) Discourse, Consciousness, and Time: The Flow and
Displacement of Conscious Experience in Speaking and Writing. Chicago:
University of Chicago Press.
DeLiang Wang (2003) Visual Scene Segmentation, in M.A. Arbib (ed.) The
Handbook of Brain Theory and Neural Networks, 2nd edn, pp. 121519.
Cambridge, MA: MIT Press.
Dewhurst, R. et al. (2012) It Depends on How You Look at It: Scanpath
Comparison in Multiple dimensions with MultiMatch, a Vector-Based
Approach, Behavioral Research Methods, 42(2). Published online 31
May 2012. DOI 10.3758/s13428-012-0212-2.

Boeriis and Holsanova: Tracking visual segmentation 279


Foulsham, T. and Underwood, G. (2007) How Does the Purpose of Inspection
Influence the Potency of Visual Saliency in Scene perception?,
Perception 36: 112338.
Halliday, M.A.K. and Matthiessen, C.M.I.M. (2004) An Introduction to
Functional Grammar, 3rd edn. London: Arnold.
Harding, G. and Bloj, M. (2010) Real and Predicted Influence of Image
Manipulations on Eye Movements during Scene Recognition, Journal
of Vision 10(2), 11 February: 8.
Henderson, J.M. (2007) Regarding Scenes, Current Directions in Psychological
Science 16(4): 21922.
Henderson, J.M. and Ferreira, F. (eds) (2004) The Integration of Language,
Vision, and Action: Eye Movements and the Visual World. New York:
Psychology Press.
Holsanova, J. (2001) Picture Viewing and Picture Description: Two
Windows on the Mind, doctoral dissertation, Lund University
Cognitive Studies 83.
Holsanova, J. (2008): Discourse, Vision, and Cognition. Human Cognitive
Processes 23. Amsterdam: John Benjamins Publishing Company.
Holsanova, J. (2011) How We Focus Attention in Image Viewing, Image
Description, and during Mental Imagery, in K. Sachs-Hombach and
R. Totzke (eds) Bilder, Sehen, Denken, pp. 291313. Cologne: Herbert
von Halem Verlag.
Holsanova, J., Rahm, H. and Holmqvist, K. (2006) Entry Points and Reading
Paths on the Newspaper Spread: Comparing Semiotic Analysis with
Eye-Tracking Measurements, Visual Communication 5(1): 6593.
Itti, L. and Koch, C. (2000) A Saliency-Based Search Mechanism for Overt and
Covert Shifts of Visual Attention, Vision Research 40: 14891506.
Khler, W. (1947) Gestalt Psychology: An Introduction to New Concepts in
Modern Psychology. New York: Liveright Publishing Corporation.
Kress, G. (2010) Multimodality: A Social Semiotic Approach to Contemporary
Communication. New York: Routledge.
Kress, G. and Van Leeuwen, T. (2001) Multimodal Discourse: The Modes and
Media of Contemporary Communication Discourse. London: Arnold.
Kress, G. and Van Leeuwen, T. (2006[1996]) Reading Images: The Grammar of
Visual Design. London: Routledge.
Nordqvist, S. (1990) Kackel i trdgrdslandet. Bromma, Sweden: Opal.
Noton, D. and Stark, L. (1971) Scanpaths in Saccadic Eye Movements while
Viewing and Recognizing Patterns, Vision Research 11: 92942.
Nystrm, M. and Holmqvist, K. (2008) Semantic Override of Low-level
Features in Image Viewing Both Initially and Overall, Journal of Eye
Movement Research 2(2): 2, 111.
OHalloran, K.L. (2004) Visual Semiosis in Film, in K.L. OHalloran (ed.)
Multimodal Discourse Analysis, pp. 10930. London: Continuum.

280 Visual Communication 11(3)


OToole, M. (2011[1994]) The Language of Displayed Art, 2nd edn. London:
Routledge.
Palmer, S.E. (1999) Vision Science: Photons to Phenomenology. Cambridge,
MA: MIT Press.
Unema, P.J.A. et al. (1968) Time Course of Information Processing during
Scene Perception, Visual Cognition 12(3): 473494.
Van Leeuwen, T. (2005) Introducing Social Semiotics. London: Routledge.
Yarbus, A.L. (1967) Eye Movements and Vision (1st Russian edn 1965). New
York: Plenum.

BIOGRAPHICAL NOTES
MORTEN BOERIIS is an Assistant Professor at the Institute of Language
and Communication at the University of Southern Denmark. He specializes
in multimodality, visual communication, moving images and business com-
munication, and teaches various courses at the Department of International
Business Communication.
Address: Institute of Language and Communication, University of Southern
Denmark, Campusvej 55, DK-5230 Odense M, Denmark. [email: boeriis@
language.sdu.dk]

JANA HOLSANOVA is Associate Professor in Cognitive Science at


Lund University, Senior Researcher in Linn Environment Cognition,
Communication, Learning and Project Leader at Humanities Laboratory
in Lund. She has been elected the Vice Chair/Chair Elect of the Visual
Communication Division, International Communication Division (2011
2014). She has been using eye-tracking methodology to study image per-
ception, the interplay between language and images, the role of images for
learning, visual thinking and interaction with various media. Her publica-
tions include Discourse, Vision and Cognition (Benjamins, 2008) and Myths
and Facts about Reading: On the Interplay between Language and Pictures in
Various Media (Norstedts, 2010).
Address: Cognitive Science Department, Lund University, Kungshuset,
Lundagrd S-222 22 Lund, Sweden. [email: jana.holsanova@lucs.lu.se]

Boeriis and Holsanova: Tracking visual segmentation 281

You might also like