You are on page 1of 193

POLITECNICO DI TORINO

SCUOLA DI DOTTORATO
Dottorato in Ingegneria Informatica e dell’automazione – XVIII ciclo

Tesi di Dottorato

Architectures and Algorithms for


Intelligent Web Applications
How to bring more intelligence to the web and beyond

Dario Bonino

Tutore Coordinatore del corso di dottorato


prof. Fulvio Corno prof. Pietro Laface

Dicembre 2005
Acknowledgements

Many people deserve a very grateful acknowledgment for their role in supporting me
during these long and exciting 3 years. As first I would cite my adviser, Fulvio Corno,
which always supported and guided me toward the best decisions and solutions.
Together with Fulvio I want to thank Laura Farinetti very much too: she was there
at anytime I needed her help for both insightful discussions and stupid questions.
Thanks to all my colleagues in the e-Lite research group for their constant support,
for their kindness and their ability to ignore my bad moments. Particular thanks
shall go to Alessio for being not only my best colleague and competitor, but one of
the best friend I’ve ever had. The same goes to Paolo, our railway discussions have
been so much interesting and useful!
Thanks to Mike that introduced me to many Linux secrets, to Franco for being
the calmest person I’ve ever known and to Alessandro ”the Eye tracker”, the best
surfer I’ve ever met.
Thanks to my parents Ercolino and Laura and to my sister Serena, they have
always been my jumping board and my unbreakable backbone.
Thank you to all the people I’ve met in these years and that are not cited here,
I am very glad to have been with you, even if for only few moments, thank you!

I
Contents

Acknowledgements I

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The Semantic Web vision 8


2.1 Semantic Web Technologies . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Explicit Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Semantic Web Languages . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 RDF and RDF-Schema . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 OWL languages . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 OWL in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Web applications for Information Management 25


3.1 The knowledge sharing and life cycle model . . . . . . . . . . . . . . 26
3.2 Software tools for knowledge management . . . . . . . . . . . . . . . 27
3.2.1 Content Management Systems (CMS) . . . . . . . . . . . . . . 28
3.2.2 Information Retrieval systems . . . . . . . . . . . . . . . . . . 34
3.2.3 e-Learning systems . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Requirements for Semantic Web Applications 49


4.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Use Cases for Functional Requirements . . . . . . . . . . . . . . . . . 53
4.2.1 The “Semantic what’s related” . . . . . . . . . . . . . . . . . 53
4.2.2 The “Directory search” . . . . . . . . . . . . . . . . . . . . . . 53

II
4.2.3 The “Semi-automatic classification” . . . . . . . . . . . . . . . 54
4.3 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . 56

5 The H-DOSE platform: logical architecture 59


5.1 The basic components of the H-DOSE semantic platform . . . . . . . 60
5.1.1 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Principles of Semantic Resource Retrieval . . . . . . . . . . . . . . . . 69
5.2.1 Searching for instances . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Dealing with annotations . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.4 Searching by conceptual spectra . . . . . . . . . . . . . . . . . 73
5.3 Bridging the gap between syntax and semantics . . . . . . . . . . . . 74
5.3.1 Focus-based synset expansion . . . . . . . . . . . . . . . . . . 75
5.3.2 Statistical integration . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Experimental evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Multilingual approach results . . . . . . . . . . . . . . . . . . 79
5.4.2 Conceptual spectra experiments . . . . . . . . . . . . . . . . . 81
5.4.3 Automatic learning of text-to-concept mappings . . . . . . . . 83

6 The H-DOSE platform 85


6.1 A layered view of H-DOSE . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Service Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Kernel Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.3 Data-access layer . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.4 Management and maintenance sub-system . . . . . . . . . . . 98
6.2 Application scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7 Case studies 107


7.1 The Passepartout case study . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 The Moodle case study . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 The CABLE case study . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.2 mH-DOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4 The Shortbread case study . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 120

III
7.4.2 Typical Operation Scenario . . . . . . . . . . . . . . . . . . . 122
7.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 H-DOSE related tools and utilities 125


8.1 Genetic refinement of semantic annotations . . . . . . . . . . . . . . . 126
8.1.1 Semantics powered annotation refinement . . . . . . . . . . . 128
8.1.2 Evolutionary refiner . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 OntoSphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.2 Implementation and preliminary results . . . . . . . . . . . . . 145

9 Semantics beyond the Web 149


9.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.1.1 Application Interfaces, Hardware and Appliances . . . . . . . 152
9.1.2 Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3 Communication Layer . . . . . . . . . . . . . . . . . . . . . . 154
9.1.4 Event Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.5 House Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.6 Domotic Intelligence System . . . . . . . . . . . . . . . . . . . 156
9.1.7 Event Logger . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.1.8 Rule Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.1.9 Run-Time Engine . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.1.10 User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2.1 BTicino MyHome System . . . . . . . . . . . . . . . . . . . . 158
9.2.2 Parallel port and LEDs . . . . . . . . . . . . . . . . . . . . . . 158
9.2.3 Music Server (MServ) . . . . . . . . . . . . . . . . . . . . . . 159
9.2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 159
9.3 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10 Related Works 161


10.1 Automatic annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.2 Multilingual issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2.1 Term-related issues . . . . . . . . . . . . . . . . . . . . . . . . 164
10.3 Semantic search and retrieval . . . . . . . . . . . . . . . . . . . . . . 165
10.4 Ontology visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.5 Domotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

11 Conclusions and Future works 172


11.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

IV
Bibliography 178

A Publications 186

V
Chapter 1

Introduction

The Semantic Web will be an extension of the current Web in which


information is given well-defined meaning, better enabling computers
and people to work in cooperation.

Tim Berners-Lee

The new generation of the Web will enable humans to gain wisdom of
living, working, playing,and learning, in addition to information search
and knowledge queries.

Ning Zhong

1.1 Motivation
Over the last decade the World Wide Web gained a great momentum becoming
rapidly a fundamental part of our everyday life. In personal communication, as well
as in business, the impact of the global network has completely changed the way
people interact with each other and with machines. This revolution touches all the
aspects of people lives and is gradually pushing the world toward a “Knowledge so-
ciety” where the most valuable resources will no longer be material but informative.
Also the way we think of computers has been influenced by this development:
we are in fact evolving from thinking computers as ”calculus engines” to consid-
ering computers as “gateways” or entry points to the newly available information
highways.
The popularity of the Web has lead to the exponential growth of published pages
and services observed in these years. Companies are now offering web pages to adver-
tise and sell their products. Learning institutions are presenting teaching material
and on-line training facilities. Governments provide web-accessible administrative

1
1 – Introduction

services to ease the citizen’s life. Users build up communities to exchange any kind
of information and/or to form more powerful market actors able to survive in this
global ecosystem.
The stunning success is also the curse of the current web: most of todays’s Web
content is only suitable for human consumption, but the huge number of available
information makes it increasingly difficult for users to find and access required infor-
mation. Under these conditions, keyword-based search engines, such as AltaVista,
Yahoo, and Google, are the main tools for using today’s Web. However, there are
serious problems associated with their use:

• High recall, low precision: relevant pages are buried into other thousands
of low interesting pages.

• Low recall: although rarer, sometimes happens that queries get no answers
because they are formulated with the wrong words.

• Results are sensitive to vocabularies, so, for example, the adoption of


different synonymous of the same keyword may lead to different results.

• Searches are for documents and not for information.

Even if the search process is successful, the result is only a “relevant” set of web
pages that the user shall scan for finding the required information. In a sense, the
term used to classify the involved technologies, Information Retrieval, is in this case
rather misleading and Location Retrieval might be better.
The critical point is that, at now, machines are not able, without heuristics
and tricks, to understand documents published on the web and to extract only the
relevant information from pages. Of course there are tools that can retrieve text,
split phrases, count words, etc. But when it comes to interpreting and extracting
useful data for the users, the capabilities of current software are still limited.
One solution to this problem consists in keeping information as it currently
is and in developing sophisticated tools that use artificial intelligence techniques
and computational linguistics to “understand” what is written in web pages. This
approach has been pursued for a while, but, at now, still appears too ambitious.
Another approach is to define the web in a more machine understandable fashion
and to use intelligent techniques to take advantage of this representation.
This plan of revolutionizing the web is usually referred to as Semantic Web
initiative and is only a single aspect of the next evolution of the web, the Wisdom
Web.
It is important to notice that the Semantic Web aims not at being parallel to the
World Wide Web, instead it aims at evolve the Web into a new knowledge centric,
global network. Such a new network will be populated by intelligent web agents able

2
1.2 – Domain

to act on behalf of their human counterparts, taking into account the semantics of
information (meaning). Users will be once more the center of the Web but they will
be able to communicate and to use information with a more human-like interaction
and they will also be provided with ubiquitous access to such information.

1.2 Domain: the long road from today’s Web to


the Wisdom Web
Starting from the current web, the ongoing evolution aims at transforming the to-
day’s syntactic World Wide Web in the future Wisdom Web. The fundamental
capabilities of this new network include:

• Autonomic Web support, allowing to design self-regulating systems able to


cooperate in autonomy with other available applications/information sources.

• Problem Solving capabilities, for specifying, identifying and solving roles,


settings and relationships between services.

• Semantics, ensuring the “right” understanding of involved concepts and the


right context for service interactions.

• Meta-Knowledge, for defining and/or addressing the spatial and temporal


constraints or conflicts in planning and executing services.

• Planning, for enabling services or agents to autonomously reach their goals


and subgoals.

• Personalization, for understanding recent encounters and to relate different


episodes together.

• A sense of humor, so services on the Wisdom Web will be able to interact


with users on a personal level.

These capabilities stem from several, active, research fields including the Multi-
Agent community, the Semantic Web community, the Ubiquitous Computing com-
munity and have impact, or use, technologies developed for databases, computa-
tional grids, social networks, etc. The Widsom Web, in a sense, is the place where
most of the currently available technologies and their evolutions will join in a single
scenario with a “devastating” impact on the human society.
Many steps, however, still separate us from this future web, the Semantic Web
initiative being the first serious and world-wide attempt to build the necessary in-
frastructure.

3
1 – Introduction

Semantics is one of the angular stones of the Wisdom Web. It has its own
foundation on the formal definition of “concepts” involved by web pages and by
services available on the web. Such a formal definition needs two critical compo-
nents: knowledge models and associations between knowledge and resources. The
formers are known as Ontologies while the latter are usually referred to as Semantic
Annotations.
There are several issues related to the introduction of semantics on the web: how
should knowledge be modeled? how to associate knowledge to real world entities or
to real world information?
The Semantic Web initiative builds up on the evidence that creating a common,
monolithic, omni-comprehensive knowledge model is infeasible and, instead, assumes
that each actor playing a role on the web shall be able to define its own model
according to its own view of the world. In the SW vision, the global knowledge
model is the result of a shared effort in building and linking the single models
developed all around the world, in a way like what happened for the current Web.
Of course, in such a process conflicts will arise that will be solved by proper tools
for mapping “conceptually similar” entities in the different models.
The definition of knowledge models alone, does not introduces semantics on the
web; in order to get such a result another step shall be done: resources must be
“associated” to the models. The current trend is to perform this association by
means of Semantic Annotations. A Semantic Annotation is basically a link between
a resource on the web, either a web page or a video or a music piece and one or
more concepts defined by an ontology. So, for example, the pages of a news site can
be associated to the concept news in a given ontology by means of an annotation,
in form of a triple: about(site, news).
Semantic annotations are not only for documents or web pages but can be used
to associate semantics to nearly all kinds of informative, physical and meta-physical
entities in the world.
Many issues are involved in the annotation process. So, just for mentioning some
of them, at which granularity these annotations shall be defined? How can we define
if a given annotation is trusted or not? Who shall annotate resources? Shall be the
creator or anyone on the web?
To answer these and the previous questions standard languages and practices
for semantic markup are needed, together with a formal logic for reasoning about
the available knowledge and for turning implicit information into explicit facts. At
contour, databases engines, for storing semantics rich data, and search engines for
offering new question-answering interfaces, constitute the informatics backbone of
the new information highway so defined.
Other pillars of the forthcoming Wisdom Web, which are more “technical”, in-
clude autonomic systems, planning and problem solving, etc. For them, great im-
provements are currently being provided by the ever active community of Intelligent

4
1.3 – Contribution

and Multi-Agent systems. In this case the problematics involved are slightly differ-
ent from the ones cited above although semantics integration can be interesting in
many aspects. The main research stream is in fact about designing machines able
to think and act rationally. In simple terms, the main concern, in this field, is to
define systems able to autonomously pursue goals and subgoals defined either by
humans or by other systems.
Meta-knowledge plays a crucial role in such a process, providing means for mod-
eling spatial and temporal constraints, or conflicts, that may arise between agent’s
goals and can be, in turn, strongly based on semantics. Knowledge modeling can, in
fact, support the definition and discovery of similarities and relationships between
constraints, in a way that is independent from the dialects with which each single
agent composes its world understanding.
Personalization, in the end, is not a new discipline and is historically concerned
with interaction modalities between users and machines, defining methodologies and
instruments to design usable interfaces, for every people, be they “normal” users or
diversely able people. It has impact on, or interacts with, many research fields
starting from Human Computer Interaction and encompassing Statistical Analysis
of user preferences and Prediction Systems. Personalization is a key factor for the
Wisdom Web exploitation: until usable and efficient interfaces will not be available,
in fact, this new web, in which available information will be many order of magnitude
wider that the current one, will not be adopted.
The good news is that all the cited issues can be solved without requiring a
revolutionary scientific progress. We can in fact reasonably claim that the challenge
is only in engineering and technological adoption, as partial solutions to all the
relevant parts of the scenario already exist. At the present, the greatest needs
appear to be in the area of integration, standardization, development of tools and
adoption by the users.

1.3 Contribution
In this Thesis, methodologies and techniques for paving the way that starts from
nowadays web applications and leads to the Wisdom Web have been studied, with
a particular focus on information retrieval systems, content management systems
and e-Learning systems. A new platform for supporting the easy integration of
semantic information into nowadays systems has been designed and developed, and
has been applied to several case studies: a custom-made CMS, a publicly available
e-Learning system (Moodle [1]), an intelligent proxy for web navigation (Muffin [2])
and a life-long learning system developed in the context of the CABLE project [3]
(a EU-funded Minerva Project).
In addition some extensions of the proposed system to environments sharing

5
1 – Introduction

with the Web the underlying infrastructure and the communication and interaction
paradigms have been studied. A case study is provided for domotic systems.
Several contributions to the state of art in semantic systems can be found in the
components of the platform including: an extension of the T.R. Gruber ontology
definition, which allows to transparently support multilingual knowledge domains, a
new annotation “expansion” system that allows to leverage the information encoded
into ontologies for extending semantic annotations, and a new “conceptual” search
paradigm based on a compact representation of semantic annotations called Con-
ceptual Spectrum. The semantic platform discussed in this thesis is named H-DOSE
(Holistic Distributed Semantic Elaboration Platform) and is currently available as
an Open Source Project on Sourceforge: http://dose.sourceforge.net.
H-DOSE has been entirely developed in Java for allowing better interoperability
with already existing web systems and is currently deployed as a set of web services
running on the Apache Tomcat servlet container. It is, at now, available in two
different forms, one intended for micro enterprises, characterized by a small footprint
on the server onto which is run, and one for small and medium enterprises that
integrates the ability to distribute jobs on different machines, by means of agents,
and that includes principles of autonomic computing for keeping the underlying
knowledge base constantly up-to-date. Rather than being an isolated attempt to
semantics integration in the current web, H-DOSE is still a very active project and is
undergoing several improvements and refinements for better supporting the indexing
and retrieval of non-textual information such as video clips, audio pieces, etc. There
is also some ongoing work on the integration of H-DOSE into competitive intelligence
systems as done by IntelliSemantic: a start-up of the Turin’s Polytechnic that builds
its business plan on the adoption of semantic techniques, and in particular of the
H-DOSE platform, for patent discovery services.
Eventually, several side issues related to semantics handling and deployment on
web applications have been addressed during the H-DOSE design, some of them will
also be presented in this thesis. A newly designed ontology visualization tool based
on multi-dimensional information spaces is an example.

1.4 Structure of the Thesis


The remainder of this thesis is organized as follows:
Chapter 2 introduces the vision of Semantic Web and discusses the data-model,
standards, and technologies used to bring this vision into being. These building
blocks are used in the design of H-DOSE trying to maximize the reuse of already
available and well tested technologies thus avoiding to reinvent the wheel.
Chapter 3 moves in parallel with the preceding chapter introducing an overview

6
1.4 – Structure of the Thesis

of currently available web applications with a particular focus on systems for infor-
mation management such as Content Management Systems, Indexing and retrieval
systems, e-Learning systems. For every category of application, the points in which
semantics can give substantial improvements either in effectiveness (performance)
or in user experience are evidenced.
Chapter 4 defines the requirements for the H-DOSE semantic platform, as they
emerge from interviews with web actors such as content publishers, site administra-
tors and so on.
Chapter 5 introduces the H-DOSE logical architecture, and uses such architec-
ture as a guide for discussing the basic principles and assumptions on to which the
platform is built. For every innovative principle the strength points are evidenced
together with the weaknesses emerged either during the presentations of such ele-
ments in international conferences and workshops or during the H-DOSE design and
development process.
Chapter 6 describes in deep detail the H-DOSE platform, focusing on the role
and the interactions that involve every single component of the platform. The main
concern of this chapter is to provide a complete view of the platform, in its more
specific aspects, discussing the adopted solutions from a “software engineering” point
of view.
Chapter 7 presents the case studies that constituted the benchmark of the
H-DOSE platform. Each case study is addressed separately starting from a brief
description of requirements and going through the integration design process, the
deployment of the H-DOSE platform and the phase of results gathering and analysis.
Chapter 8 is about the H-DOSE related tools developed during the platform
design and implementation. They include a new ontology visualization tool and a
genetic algorithm for semantic annotations refinement.
Chapter 9 discusses the extension of H-DOSE principles and techniques to
non-Web scenarios, with a particular focus on domotics. An ongoing project on
semantics reach house gateways is described highlighting how the lessons learned
in the design and development of H-DOSE can be applied in a complete different
scenario, still retaining their valuability.
Chapter 10 presents the related works in the field of both Semantic Web and
Web Intelligence, with a particular focus on semantic platforms and semantics inte-
gration on the Web.
Chapter 11 eventually concludes the thesis and provides an overview on possible
future works.

7
Chapter 2

The Semantic Web vision

This chapter introduces the vision of Semantic Web and discusses the
data-model, standards, and technologies used to bring this vision into
being. These building blocks are used in the design of H-DOSE trying to
maximize the reuse of already available and well tested technologies thus
avoiding to reinvent the wheel.

The Semantic Web is developed layer by layer; the pragmatic justification for such
a procedure is that it is easier to achieve consensus on small steps, whereas it is
much harder to make everyone agree on very wide proposals. In fact there are many
research groups that are exploring different and sometimes conflicting solutions.
After all, competition is one of the major driving force for scientific development.
Such a competition makes very hard to reach agreements on wide steps and often
only a partial consensus can be achieved. The Semantic Web builds upon the steps
for which consensus can be reached, instead of waiting to see which alternative
research line will be successful in the end.
The Semantic Web is such that companies, research groups and users must build
tools, add content and use that content. It is certainly myopic to wait until the full
vision will materialize: it may take another ten years to realize the full extent of
SW, and many years more for the Wisdom Web.
In evolving from one layer to another, two principles are usually followed:

• Downward compatibility: applications, or agents, fully compliant with a


layer shall also be aware of the lower layers, i.e., they shall be able to interpret
and use information coming from those layers. As an example we can consider
an application able to understand the OWL semantics. The same application
shall also take full advantage of information encoded in RDF and RDF-S [4].

• Partial upward understanding: agents fully aware of a given layer should

8
take, at least partial, advantage from information at higher levels. So, a RDF-
aware agent should also be able to use information encoded in OWL [5], ig-
noring those elements that go beyond RDF and RDF Schema.

Figure 2.1. The Semantic Web ”cake”.

The layered cake of the Semantic Web is shown in Figure 2.1 and describes the
main components involved in the realization of the Semantic Web vision (due to
Tim Berners Lee). At the bottom it is located XML (eXtensible Markup Language)
a language for writing well structured documents according to a user-defined vocab-
ulary. XML is a “de facto” standard for the exchange of information over the World
Wide Web. On the top of XML builds up the RDF layer.
RDF is a simple data model for writing statements about Web objects. RDF is
not XML, however it has a XML-based syntax, so it is located, in the cake, over the
XML layer.
RDF-Schema defines the vocabulary used in RDF data models. It can be seen
as a very primitive language for defining ontologies, as it provides the basic building
blocks for organizing Web objects into hierarchies. Supported constructs include:
classes and properties, the subClass and subProperty relations and the domain and
range restrictions. RDF-Schema uses a RDF syntax.
The Logic layer is used for further enhancing the ontology support offered by
RDF-Schema, thus allowing to model application-specific declarative knowledge.
The Proof layer, instead, involves the process of deductive reasoning as well as
the process of providing and representing proofs in Web languages. Applications
lying at the proof level shall be able to reason about the knowledge data defined in
the lower layers and to provide conclusions together with “explanations” (proofs)
about the deductive process leading to them.

9
2 – The Semantic Web vision

The Trust layer, in the end, will emerge through the adoption of digital signatures
and other kinds of knowledge, based on recommendations by trusted agents, by
rating and certification agencies or, even, by consumer organizations. The expression
“Web of Trust” means that the trust over the Web will be organized in the same
distributed and sometimes chaotic way as the WWW itself. Trust is crucial for the
final exploitation of the Semantic Web vision: until users will not have trust in its
operations (security) and in quality of information provided (relevance) the SW will
not reach its full potential.

2.1 Semantic Web Technologies


The Semantic Web cake depicted above builds upon the so-called Semantic Web
Technologies. These technologies empower the foundational components of the SW,
which are introduced separately in the following subsections.

2.1.1 Explicit Metadata


At now, the World Wide Web is mainly formatted for human users rather than
for programs. Pages either static or dynamically built using information stored
in databases are written in HTML or XHTML. A typical web page of an ICT
consultancy agency can look like this:

<html>
<head></head>
<body>
<h1> SpiderNet internet consultancy,
network applications and more </h1>
<p> Welcome to the SpiderNet web site, we offer
a wide variety of ICT services related to the net.
<br/> Adam Jenkins, our graphics designer has designed many
of the most famous web sites as you can see in
<a href=’’gallery.html’’>the gallery</a>.
Matt Kirkpatrick is our Java guru and is able to develop
any new kind of functionalities you may need.
<br> If you are seeking a great new opportunity
for your business on the web, contact us at the
following e-mails:
<ul>
<li>jenkins@spidernet.net</li>
<li>kirkpatrick@spidernet.net</li>

10
2.1 – Semantic Web Technologies

</ul>
Or you may visit us in the following opening hours
<ul>
<li>Mon 11am - 7pm</li>
<li>Tue 11am - 2pm</li>
<li>Wed 11am - 2pm</li>
<li>Thu 11am - 2pm</li>
<li>Fri 2pm - 9pm</li>
</ul>
Please note that we are closed every weekend and every festivity.
</p>
</body>
</html>

For people, the provided information is presented in a rather satisfactory way,


but for machines this document results nearly incomprehensible. Keyword-based
techniques might be able to identify the words web site, graphics designer and Java.
And an intelligent agent, could identify the email addresses and the personnel of the
agency, and with a little bit of heuristics it might associate each employee with the
correct e-mail address. But it will have troubles for distinguishing who is the graphics
designer and who is the Java developer, and even more difficulties in capturing the
opening hours (for which the agent would have to understand what festivities are
celebrated during the year, and in which days, depending on the location of the
agency, which in turn is not explicitly available in the web page.). The Semantic
Web tries to address these issues not by developing super-intelligent agents able to
understand information as humans. Instead it acts on the HTML side, trying to
replace this language with more appropriate languages so that web pages could carry
their content in a machine processable form, still remaining visually appealing for
the users. In addition to formatting information for human users, these new web
pages will also carry information about their content, such as:

<company type=’’consultancy’’>
<service>Web Consultancy</service>
<products> Web pages, Web applications </products>
<staff>
<graphicsDesigner>Adam Jenkins</graphicsDesigner>
<javaDeveloper>Matt Kirkpatrick</javaDeveloper>
</staff>
</company>

11
2 – The Semantic Web vision

This representation is much easier for machines to understand and is usually


known as metadata that means: data about data. Metadata encodes, in a sense,
the meaning of data, so defining the semantics of a web document (thus the term
Semantic Web).

2.1.2 Ontologies
The term ontology stems from philosophy. In that context, it is used to name a
subfield of philosophy, namely the study of the nature of existence (from the greek
oντ øλøγια), the branch of metaphysics concerned with identifying, in general terms,
the kinds of things that actually exist, and how to describe them. For example the
observation that the world is made up of specific entities that can be grouped in
abstract classes based on shared properties is a typical ontological commitment.
For what concerns nowadays technologies, ontology has been given a specific
meaning that is quite different from the original one. For the purposes of this thesis
the T.R. Gruber’s definition, later refined by R. Studer can be adopted: An ontology
is an explicit and formal specification of a conceptualization.
In other words, an ontology formally describes a knowledge domain. Typically,
an ontology is composed of a finite list of terms and the relationships between these
terms. The terms denote important concepts (classes of objects) of the domain.
Relationships include, among the others, hierarchies of classes. A hierarchy
specifies a class C to be a subclass of another class C  if every object in C is also
included in C  . Apart from the subclass relationship (also known as “is A” relation),
ontologies may include information such as:
• properties (X makes Y )
• value restrictions (only smiths can make iron tools)
• disjointness statements (teachers and secretary staff are disjoint)
• specification of logical relationships between objects
In the context of the web, ontologies provide a shared understanding of a domain.
Such an understanding is necessary to overcome differences in terminology. As an
example a web application may use the term “ZIP” for the same information that in
another one is denoted as “area code”. Another problem is when two applications
use the same term with different meanings. Such differences can be overcome by
associating a particular terminology with a shared ontology, and/or by defining
mappings between different ontologies. In both cases, it is easy to notice that
ontologies support semantic interoperability.
Ontologies are also useful for improving the results of Web searches. The search
engine can look for pages that refer to a precise concept, or set of concepts, in

12
2.2 – Logic

an ontology instead of collecting all pages in which certain, possibly ambiguous,


keywords occur. In the same way as above, ontologies allow to overcome differ-
ences in terminology between Web pages and queries. In addition, when performing
ontology-based searches it is possible to exploit generalization and specialization
information. If a query fails to find any relevant documents (or provides too many
results), the search engine can suggest to the user a more general (specific) query
[6]. It is even conceivable that the search engine runs such queries proactively, in
order to reduce the reaction time in case the user adopts such suggestion.
Ontologies can even be used to better organize Web sites and navigation of
them. Many nowadays sites offer on the left-hand side of the pages the top levels
of a concept hierarchy of terms. The user may click on them to expand the sub
categories and to finally reach new pages in the same site.
In the Semantic Web layered approach, ontologies are located in between the
third layer of RDF and RDF-S and the fourth level of abstraction where the Web
Ontology Language (OWL) resides.

2.2 Logic
Logic is the discipline that studies the principles of reasoning; in general, it offers
formal languages for expressing knowledge and well-understood formal semantics.
Logic usually works with the so-called declarative knowledge, which describes what
holds without caring about how it can be deduced.
Deduction can be performed by automated reasoners: software entities that have
been extensively studied in Artificial Intelligence. Logic deduction (inference) allows
to transform implicit knowledge defined in a domain model (ontology) into explicit
knowledge. For example, if a knowledge base contains the following axioms in pred-
icate logic,

human(X) → mammal(X)
P h.Dstudent(X) → human(X)
P h.Dstudent(Dario)
an automated inferencing engine can easily deduce that

human(Dario)

mammal(Dario)
P h.Dstudent(X) → mammal(X)
Logic can therefore be used to uncover ontological knowledge that is implicitly given
and, by doing so, it can help revealing unexpected relationships and inconsistencies.

13
2 – The Semantic Web vision

But logic is more general than ontologies and can also be used by agents for
making decisions and selecting courses of action, for example.
Generally there is a trade-off between expressive power and computational ef-
ficiency. The more expressive a logic is, the more computationally expensive it
becomes for drawing conclusions. And drawing conclusions can sometimes be impos-
sible when non-computability barriers are encountered. Fortunately, a considerable
part of the knowledge relevant to the Semantic Web seems to be of a relatively re-
stricted form, and the required subset of logics is almost tractable, and is supported
by efficient reasoning tools.
Another important aspect of logic, especially in the context of the Semantic Web,
is the ability to provide explanations (proofs) for the conclusions: the series of infer-
ences can be retraced. Moreover, AI researchers have developed ways of presenting
proofs in a human-friendly fashion, by organizing them as natural deductions and
by grouping, in a single element, a number of small inference steps that a person
would typically consider a single proof step.
Explanations are important for the Semantic Web because they increase the
users’ confidence in Semantic Web agents. Even Tim Berners Lee speaks of a “Oh
yeah?” button that would ask for explanation.
Of course, for logic to be useful on the Web, it must be usable in conjunction with
other data, and it must be machine processable as well. From these requirements
stem the nowadays research efforts on representing logical knowledge and proofs in
Web languages. Initial approaches work at the XML level, but in the future rules
and proofs will need to be represented at the level of ontology languages such as
OWL.

2.2.1 Agents
Agents are software entities that work autonomously and proactively. Conceptually
they evolved out of the concepts of object-oriented programming and of component-
based software development.
According to the Tim Berners-Lee’s article [7], a Semantic Web agent shall be
able to receive some tasks and preferences from the user, seek information from Web
sources, communicate with other agents, compare information about user require-
ments and preferences, select certain choices, and give answers back to the user.
Agents will not replace human users on the Semantic Web, nor will they necessarily
make decisions. In most cases their role will be to collect and organize information,
and present choices for the users to select from.
Semantic Web agents will make use of all the outlined technologies, in particular:

• Metadata will be used to identify and extract information from Web Sources.

14
2.3 – Semantic Web Languages

• Ontologies will be used to assist in Web searches, to interpret retrieved infor-


mation, and to communicate with other agents.

• Logic will be used for processing retrieved information and for drawing con-
clusions

2.3 Semantic Web Languages


2.3.1 RDF and RDF-Schema
RDF is essentially a data-model and its basic building block is a object-attribute-
value triple, called statement. An example of statement is: Kimba is a Lion.
This abstract data-model needs a concrete syntax to be represented and ex-
changed and RDF has been given a XML syntax. As a result, it inherits the
advantages of the XML language. However it is important to notice that other
representations of RDF, not in XML syntax, are possible, N3 is an example.
RDF is, by itself, domain independent: no assumptions on a particular domain
of application are done. It is up to each user to define the terminology to be used
in his/her RDF data-model using a schema language called RDF-Schema (RDF-S).
RDF-Schema defines the terms that can be used in a RDF data-model. In RDF-S
we can specify which objects exist and which properties can be applied to them, and
what values they can take. We can also describe the relationships between objects
so, for example, we can write: The lion is a carnivore.
This sentence means that all the lions are carnivores. Clearly there is an intended
meaning for the “is a” relation. It is not up to applications to interpret the “is a”
term; its intended meaning shall be respected by all RDF processing softwares. By
fixing the meaning of some elements, RDF-Schema enables developers to model
specific knowledge domains.
The principal elements of the RDF data-model are: resources, properties and
statements.
Resources are the object we want to talk about. Resources may be authors,
cities, hotels, places, people, etc. Every resource is identified by a sort of identity
ID, called URI. URI stands for Uniform Resource Identifier an provides means to
uniquely identify a resource, be it available on the web or not. URIs do not imply
the actual accessibility of a resource and therefore are suitable not only for web
resources but also for printed books, phone numbers, people and so on.
Properties are special kind of resources that describe relations between the
objects of the RDF data-model, for example: “written by”, “eats”, “lives”, “title”,
“color”, “age”, and so on. Properties are also identified by URIs. This choice allows,
from one side to adopt a global, worldwide naming scheme, on the other side to write
statements having a property either as subject or as object. URIs also allow to solve

15
2 – The Semantic Web vision

the homonym problem that has been the plague of distributed data representation
until now.
Statements assert the properties of resources. They are object-attribute-value
triples consisting respectively of a resource, a property and a value. Values can
either be resources or literals. Literals are atomic values (string), that can have a
specific XSD type, xsd:double as an example. A typical example of statement is:

the H-DOSE website is hosted by www.sourceforge.net.

This statement can be rewritten in a triple form:

(‘‘H-DOSE web site’’,’’hosted by’’,’’www.sourceforge.net’’)

and in RDF it can be modeled as:

<?xml version=’’1.0’’ encoding=’’UTF-8’’?>


<rdf:RDF
xmlns:rdf=’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’
xmlns:mydomain=’’http://www.mydomain.net/my-rdf-ns’’>
<rdf:Description about=’’http://dose.sourceforge.net’’>
<mydomain:hostedBy>
http://www.sourceforge.net
</mydomain:hostedBy>
</rdf:Description>
</rdf:RDF>

One of the major strength points of RDF is the so-called reification: in RDF it
is possible to make statements about statements, such as:

Mike thinks that Joy has stolen its diary

This kind of statement allows to model belief or trust on other statements, which
is important for some kinds of application. In addition, reification allows to model
non-binary relations using triples. The key idea, since RDF only supports triples,
i.e., binary relationships, is to introduce an auxiliary object and relate it to each of
the parts of the non-binary relation through the properties subject, predicate and
object.
So, for example, if we want to represent the tertiary relationship referee(X,Y,Z)
having the following well defined meaning:

X is the referee in a tennis game between players X and Y.

16
2.3 – Semantic Web Languages

Figure 2.2. Representation of a tertiary predicate.

we have to break it in three binary relations adding an auxiliary resource called


tennisGame, as in Figure 2.2.

RDF critical view

As already cited RDF only uses binary properties. This restriction could be a quite
limiting factor since usually we adopt predicates with more than two arguments.
Fortunately, reification allows to overcome this issue. However some critic aspects
arise from the adoption of the reification mechanism. As first, although the solution
is sound, the problem remains that not binary predicates are more natural with
more arguments. Secondly, reification is a quite complex and powerful technique
which may appear misplaced for a basic layer of the Semantic Web, instead it would
have appeared more natural to include it in more powerful layers that provide richer
representational capabilities.
In addition, the XML syntax of RDF is quite verbose and can easily become too
cumbersome to be managed directly by users, especially for huge data-models. So
comes the adoption of user-friendly tools that automatically translate higher level
representations into RDF.
Eventually, RDF is a standard format therefore the benefits of drafting data in
RDF can be seen as similar to drafting information in HTML in the early days of
the Web.

From RDF/RDF-S to OWL

The expressiveness of RDF and RDF-Schema (described above) is very limited (and
this is a deliberate choice): RDF is roughly limited to model binary relationships and
RDF-S is limited to sub-class hierarchies and property hierarchies, with restrictions
on the domain and range of the lasts.

17
2 – The Semantic Web vision

However a number of research groups have identified different characteristic use-


cases for the Semantic Web that would require much more expressiveness than RDF
and RDF-S offer. Initiatives from both Europe and United States came up with
proposals for richer languages, respectively named OIL and DAML-ONT, whose
merging DAML+OIL was taken by the W3C as the starting point for the Web
Ontology Language OWL.
Ontology languages must allow users to write explicit, formal conceptualizations
of domain knowledge, the main requirements are therefore:

• a well defined syntax,

• a formal semantics,

• an efficient reasoning support,

• a sufficient expressive power,

• a convenience of expression.

The importance of a well-defined syntax is clear, and known from the area of pro-
gramming languages: it is a necessary condition for “machine understandability”
and thus for machine processing of information. Both RDF/RDF-S and OWL have
this kind of syntax. A formal semantics allows to describe the meaning of knowledge
precisely. Precisely means that semantics does not refer to subjective intuitions and
is not open to different interpretations by different people (or different machines).
The importance of a formal semantics is well known, for example, in the domain of
mathematical logic. Formal semantics is needed for allowing people to reason about
knowledge. This, for ontologies, means that we may reason about:

• Class membership. If x is an instance of a class C, and C is a subclass of D,


we can infer that x is also an instance of D.

• Equivalence of classes. If a class A is equivalent to a class B, and B is equiv-


alent to C, then A is equivalent to C, too.

• Consistency. Let x be an instance of A, and suppose that A is a subclass of


B ∩ C and of D. Now suppose that B and D are disjoint. There is a clear
inconsistence in our model because A should be empty but has the instance
x. Inconsistencies like this indicate errors in the ontology definition.

• Classification. If we have declared that certain property-value pairs are suffi-


cient conditions for membership in a class A, then if an individual (instance)
x satisfies such conditions, we can conclude that x must be an instance of A.

18
2.3 – Semantic Web Languages

Semantics is a prerequisite for reasoning support. Derivation such as the preced-


ing ones can be made by machines instead of being made by hand. Reasoning is
important because allows to:

• check the consistency of the ontology and of the knowledge model,

• check for unintended relationships between classes,

• automatically classify instances.

Automatic reasoning allows to check much more cases than could be checked man-
ually. Such checks become critical when developing large ontologies, where multiple
authors are involved, as well as when integrating and sharing ontologies from various
sources.
Formal semantics is obtained by defining an explicit mapping between an ontol-
ogy language and a known logic formalism, and by using automated reasoners that
already exist for that formalism. OWL, for instance, is (partially) mapped on de-
scription logic, and makes use of existing reasoners such as Fact, Pellet and RACER.
Description logics are a subset of predicate logic for which efficient reasoning support
is possible.

2.3.2 OWL languages


The full set of requirements for an ontology language are: efficient reasoning support
and convenience of expression, for a language as powerful as the combination of
RDF-Schema with full logic. These requirements have been the main motivation
for the W3C Ontology Group to split OWL in three different sub languages, each
targeted at different aspects of the full set of requirements.

OWL Full
The entire Web Ontology Language is called OWL Full and uses all the OWL lan-
guages primitives. It allows, in addition, the combination of these primitives in
arbitrary ways with RDF and RDF-Schema. This includes the possibility (already
present in RDF) of changing the meaning of the predefined (RDF and OWL) prim-
itives, by applying the language primitives to each other. For example, in OWL full
it is possible to impose a cardinality constraint on the Class of all classes, essentially
limiting the number of classes that can be described in an ontology.
The advantage of OWL Full is that it is fully upward-compatible with RDF,
both syntactically and semantically: any legal RDF document is also a legal OWL
Full document, and any valid RDF/RDF-S conclusion is also a valid OWL Full
conclusion. The disadvantage of the OWL Full is that the language has become

19
2 – The Semantic Web vision

so powerful as to be undecidable, so dashing any hope of complete (or efficient)


reasoning support.

OWL DL
In order to re-obtain computational efficiency, OWL DL (DL stands for Description
Logic) is a sub language of OWL Full that restricts how the constructors from
RDF and OWL may be used: application of OWL’s constructors to each other is
prohibited so as to ensure that the language corresponds to a well studied description
logic.
The advantage of this is that it permits efficient reasoning support. The dis-
advantage is the lost of full compatibility with RDF: an RDF document will, in
general, have to be extended in some ways and restricted in others before becoming
a legal OWL DL document. Every OWL DL document is, in turn, a legal RDF
document.

OWL Lite
An even further restriction limits OWL DL to a subset of the language constructors.
For example, OWL Lite excludes enumerated classes, disjointness statements, and
arbitrary cardinality.
The advantage of this is a language that is both easier to grasp for users and easier
to implement for developers. The disadvantage is of course a restricted expressivity.

2.3.3 OWL in a nutshell


Header
OWL documents are usually called OWL ontologies and they are RDF documents.
The root element of an ontology is an rdf:RDF element, which specifies a number
of namespaces:

<rdf:RDF
xmlns:owl = ’http://www.w3.org/2002/07/owl#’’
xmlns:rdf = ’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’
xmlns:rdfs = ’’http://www.w3.org/2000/01/rdf-schema#’’
xmlns:xsd = ’’http://www.w3.org/2001/XMLSchema#’’>

An OWL ontology can start with a set of assertions for house keeping purpose. These
assertions are grouped under an owl:Ontology element, which contains comments,
version control, and inclusion of other ontologies.

20
2.3 – Semantic Web Languages

<owl:Ontology rdf:about=’’ ’’>


<rdfs:comment>A simple OWL ontology </rdfs:comment>
<owl:priorVersion
rdf:resource=’’http://www.domain.net/ontologyold’’/>
<owl:imports
rdf:resource=’’http://www.domain2.org/savanna’’/>
<rdfs:label>Africa animals ontology</rdfs:label>
</owl:Ontology>

The most important of the above assertions is the owl:imports, which lists other
ontologies whose content is assumed to be part of the current ontology. It is impor-
tant to be aware that the owl:imports is a transitive property: if the ontology A
imports the ontology B, and the ontology B imports the ontology C, then A is also
importing C.

Classes
Classes are defined using the owl:Class element and can be organized in hierarchies
by means of the rdfs:subClassOf construct.

<owl:Class rdf:ID=’’Lion’’>
<rdfs:subClassOf rdf:resource=’’#Carnivore’’/>
</owl:Class>

It is also possible to indicate that two classes are completely disjoint such as the
herbivores and the carnivores, using the owl:disjointWith construct.

<owl:Class rdf:about=’’#carnivore’’>
<owl:disjointWith rdf:resource=’’#herbivore’’/>
<owl:disjointWith rdf:resource=’’#omnivore’’/>
</owl:Class>

Equivalence of classes may be defined using the owl:equivalentClass element.


Eventually there are two predefined classes, owl:Thing and owl:Nothing, which,
respectively, indicate the most general class containing everything in a OWL doc-
ument, and the empty class. As a consequence, every owl:Class is a subclass of
owl:Thing and a superclass of owl:Nothing.

Properties
In OWL are defined two kinds of properties:

21
2 – The Semantic Web vision

• Object properties, which relate objects to other objects. Example is, in the
savanna ontology, the relation eats.
• Datatype properties, which relate objects with datatype values. Examples are
age, name, and so on. OWL has not any predefined data types, nor does it
provide special definition facilities. Instead, it allows the use of XML-Schema
datatypes, making use of the layered architecture of the Semantic Web.
Here there are two examples, the first for a Datatype property while the second is
for Object properties:

<owl:DatatypeProperty rdf:ID=’’age’’>
<rdfs:range rdf:resource=’’&xsd;#nonNegativeInteger’’/>
</owl:DatatypeProperty>

<owl:ObjectProperty rdf:ID=’’eats’’>
<rdfs:domain rdf:resource=’’#animal’’/>
</owl:ObjectProperty>

More than one domain and range can be declared, in such a case the intersection
of the domains (ranges) is taken. OWL allows to identify “inverse properties”, for
them a specific owl element exists (owl:inverseOf) and has the effect of relating a
property with its inverse by inter-changing the domain and range definitions.
<owl:ObjectProperty rdf:ID=’’eatenBy’’>
<owl:inverseOf rdf:resource=’’#eats’’/>
</owl:ObjectProperty>
Eventually equivalence of properties can be defined through the use of the element
owl:equivalentProperty.

Restrictions on properties
In RDFS it is possible to declare a class C as a subclass of a class C  , then every
instance of C will be also an instance of C  . OWL allows to specify classes C 
that satisfy some precise conditions, i.e., all instances of C satisfy the conditions.
This is done by defining C as a subclass of the class C  which collects all the
objects that satisfy the conditions. In general, C  remains anonymous. In OWL
there are three specific elements for defining classes basing on restrictions, they
are owl:allValuesFrom, owl:someValuesFrom and owl:hasValue, and they are
always nested into a owl:Restriction element. The owl:allValuesFrom specify a
universal quantification (∀) while the owl:someValuesFrom defines and existential
quantification (∃).

22
2.3 – Semantic Web Languages

<owl:Class rdf:about=’’#firstYearCourse’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#isTaughtBy’’/>
<owl:allValuesFrom
rdf:resource=’’#Professor’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>

This example requires every person who teaches an instance of “firstYearCourse”,


e.g., a first year subject, to be a professor (universal quantification).

<owl:Class rdf:about=’’#academicStaffMember’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#teaches’’/>
<owl:someValuesFrom
rdf:resource=’’#undergraduateCourse’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>

This second example, instead, requires that there exist an undergraduate course
taught by an instance of the class of academic staff members (existential quantifi-
cation).
In general an owl:Restriction element contains an owl:onProperty element
and one ore more restriction declarations. Restrictions for defining the cardinality
of a given class are also supported through the elements:

• owl:minCardinality,

• owl:maxCardinality,

• owl:Cardinality.

The latter is a shortcut for a cardinality definition in which owl:minCardinality


and owl:maxCardinality assume the same value.

Special properties
Some properties of the property element can be defined directly:

23
2 – The Semantic Web vision

• owl:TransitiveProperty defines a transitive property, such as “has better


grade than”, “is older than”, etc.
• owl:SymmetricProperty defines a symmetric property, such as “has same
grade as” or “is sibling of”.
• owl:FunctionalProperty defines a property that has at most one value for
each object, such as “age”, “height”, “directSupervisor”, etc.
• owl:InverseFunctionalProperty defines a property for which two different
objects cannot have the same value, for example “is identity ID for”.

Instances
Instances of classes, in OWL, are declared as in RDF:
<rdf:Description rdf:ID=’’Kimba’’>
<rdf:type rdf:resource=’’#Lion’’/>
</rdf:Description>
OWL, unlike typical database systems, does not adopt a unique-names assump-
tion therefore two instances that have different names are not required to be actually
two different individuals. Then, to ensure that different individuals are recognized
by automated reasoners as such, inequality must be explicitly asserted.
<lecturer rdf:ID=’’91145’’>
<owl:differentFrom rdf:resource=’’#98760’’/>
</lecturer>
Because such inequality statements frequently occur, and the required number of
statements would explode for stating the inequality of a large number of individual,
OWL provides a shorthand notation to assert the pairwise inequality for all the
individuals in a list: owl:AllDifferent.
<owl:AllDifferent>
<owl:distinctMembers rdf:parseType=’’Collection’’>
<lecturer rdf:about=’’#91345’’/>
<lecturer rdf:about=’’#91247’’/>
<lecturer rdf:about=’’#95647’’/>
<lecturer rdf:about=’’#98920’’/>
</owl:distinctMembers>
</owl:AllDifferent>
Note that owl:distinctMembers can only be used in combination with the
owl:AllDifferent element.

24
Chapter 3

Web applications for Information


Management

This chapter introduces an overview of currently available web appli-


cations with a particular focus on systems for information management
such as Content Management Systems, Indexing and retrieval systems, e-
Learning systems. For every category of applications the points in which
semantics can give substantial improvements either in effectiveness (per-
formance) or in user experience are evidenced.

Many businesses and activities, either on the web or not, are human and knowledge
intensive. Examples include consulting, advertising, media, high-tech, pharmaceuti-
cal, law, software development, etc. Knowledge intensive organizations have already
found that a large number of problems can be attributed to un-captured and un-
shared product and process knowledge, as well as to lacks of “who knows what”
information, to the need to capture lessons learned and best practices, and to the
need of more effective distance collaboration.
These realizations are leading to a growing call for knowledge management.
Knowledge capture and learning can happen ad-hoc (e.g. knowledge sharing around
the coffee maker or problem discussions around the water cooler). Sharing, however,
is more efficient when organized; moreover, knowledge must be captured, stored and
organized according to the context of each company, or organization, in order to be
useful and efficiently disseminated.
The knowledge items that an organization usually needs to manage can have
different forms and contents. They include manuals, correspondence with vendors
and customers, news, competitors intelligence, and knowledge derived from work
processes (e.g. documentation, proposals, project plans, etc.), possibly in different
formats (text, pictures, video). The amount of information and knowledge that a
modern organization must capture, store and share, the geographic distribution of

25
3 – Web applications for Information Management

sources and consumers, and the dynamic evolution of information makes the use of
technology nearly mandatory.

3.1 The knowledge sharing and life cycle model


One of the most adopted knowledge sharing model has been developed by Nonaka
and Takenuchi in 1995 [8] and is called the “tacit-explicit model” (see Figure 3.1).
Referring to this model, tacit knowledge is knowledge that rests with the employee,

Figure 3.1. The Tacit-Explicit Model.

and explicit knowledge is knowledge that resides in the knowledge base. Conversion
of knowledge from one form to another leads often to the creation of new knowledge;
such conversion may follow four different patterns:

• Explicit-to-explicit knowledge conversion, or “Combination”, is the reconfigu-


ration of explicit knowledge through sorting, adding, combining and catego-
rizing. Often this process leads to new knowledge discovery.

• Explicit-to-tacit knowledge conversion, or “Internalization”, takes place when


one assimilates knowledge acquired from knowledge items. This internalization
contributes to the user’s tacit knowledge and helps him/her in making future
decisions.

26
3.2 – Software tools for knowledge management

• Tacit-to-explicit knowledge conversion, or “Externalization”, involves trans-


forming context-based facts into context-free knowledge, with the help of
analogies. Tacit knowledge is usually personal and depends on the person’s
experiences in various conditions. As a consequence, it is a strongly contextu-
alized resource. Once explicit, it will not retain much value unless context in-
formation is somehow preserved. Externalization can take two forms: recorded
or unrecorded knowledge.

• Tacit-to-tacit knowledge conversion or “Socialization”, occurs by sharing ex-


periences, by working together in a team, and, more in general, by direct
exchange of knowledge. Knowledge exchange at places where people social-
ize, such as around the coffeemaker or the water cooler, leads to tacit-to-tacit
conversion.

The “knowledge life cycle” takes the path of knowledge creation/acquisition, of


knowledge organization and storage, of distribution of knowledge, of application
and reuse, and finally ends up in creation/acquisition again, in a sort of a spiral
shaped process (Figure 3.1).
Tacit knowledge has to be made explicit in order to be captured and made
available to all the actors of a given organization. This is accomplished with the aid
of knowledge acquisition or knowledge creation tools. Knowledge acquisition builds
and evolves the knowledge bases of organizations. Knowledge organization/storage
takes place through activities by which knowledge is organized, classified and stored
in repositories. Explicit knowledge needs to be organized and indexed for easy
browsing and retrieving. It must be stored efficiently to minimize the required
storage space.
Knowledge can be distributed through various channels such as training pro-
grams, automatic knowledge distribution systems and knowledge-based expert sys-
tems. Without regarding the way knowledge is distributed, making the knowledge
base of a given organization available to the users, i.e., distributing the right infor-
mation at the right place and time, is one of the most critical assets for a nowadays
company.

3.2 Software tools for knowledge management


Knowledge management (KM) shall be supported by a collection of technologies for
authoring, indexing, classifying, storing, contextualizing and retrieving information,
as well as for collaboration and application of knowledge. A friendly front-end and
a robust back-end are the basis for KM software tools. Involved elements include,
among the others, content management systems for efficiently publishing and sharing
knowledge and data sources, indexing, classification and retrieval systems to easy

27
3 – Web applications for Information Management

access information stored in knowledge bases, e-Learning systems for allowing users
to perform “Internalization”, possibly in a personalized way.
These technologies, are still far from being completely semantic, i.e. based on
context and domain aware elements able to fully support knowledge operations at
a more high level, similar to what usually done by humans and, at the same time,
being still accessible to machines.
The following sub-sections take as a reference three wide diffused technologies
such as content management systems (CMS), search engines and e-Learning systems,
and for each of them discuss the basic principles and “how and where” semantics
can improve their performance. Performance is evaluated both from the functional
point of view and from the knowledge transfer point of view.
For each technology, a separated subsection is therefore provided with a brief
introduction, a discussion about currently available solutions and some considera-
tions about the integration of semantic functionalities. Finally, the shortcomings
and solutions identified will define the requirements for a general purpose platform
able to provide semantics integration on the web, with a minimal effort.

3.2.1 Content Management Systems (CMS)


In terms of knowledge management, the documents that an organization produces,
and publishes, represent its explicit knowledge. New knowledge can be created by
efficiently managing document production and classification: for example, de-facto
experts can be identified based on authorship of documents. Document and, more in
general, content management systems (CMS) enable explicit-to-explicit knowledge
conversion.
A CMS is, in simple terms, a software designed to manage a complete web site
and the related information. It allows to keep track of changes, by recording who
changed what, and when, and can allow notes to be added to each managed resource.
A writer can create a new page (with standard navigation bars and images on each
page) and submit it, without using HTML specific software. In the same way, an
editor can receive all the proposed changes and approve them, or he/she can send
them back to the writer to be corrected. Many other functionalities are supported
such as sending e-mails, contacting other users of the CMS, setting up forums or
chats, etc. In a sense, a CMS is the central point for “presentation independent”
exchange of information on the web and/or on organizations intra-nets.
The separation between presentation and creation of published resources is, in-
deed, one of the added value of CMS adoption. CMS allows editors to concentrate
their efforts on content production while the graphical presentation issues are car-
ried out by the system itself, so that the resulting web site is both coherent and
homogeneously organized. The same separation allows to better organize the web
site navigation and to define the cognitive path that users are expected to follow

28
3.2 – Software tools for knowledge management

when browsing the available pages.

CMS common features


Almost all CMSs help organizations to achieve the following goals:

• Streamline and automate content administration. Historically, Web content


has consisted of static pages/files of HTML, requiring HTML programming ex-
perience and manual updating of content and design: clearly a time-consuming
and labor-intensive process. In contrast, CMSs significantly reduce this over-
head by hiding the complexities of HTML and by automating the management
of content.

• Implement Web-forms-based content administration. In an ideal CMS, all con-


tent administration is performed through Web forms using a Web browser.
Proprietary software and specialized expertise (such as HTML) are not re-
quired for content managers. Users simply copy and paste existing content or
fill in the blanks on a form.

• Distribute content management and control. The Web manager has often been
a critical bottleneck in the timely publication and ongoing maintenance of Web
content. CMSs remove that bottleneck by distributing content management
responsibilities to individuals throughout the organization. Those individuals
who are responsible for content, now have the authority and tools to maintain
the content themselves, without any knowledge of HTML, graphic design, or
Web publishing.

• Separate content from layout and design. In a CMS, content is stored sepa-
rately from its publication format. Content managers enter the content only
once, but it can appear in many different places, formatted using very differ-
ent layouts and graphic designs. All the pages immediately reflect approved
content changes.

• Create reusable content repositories. CMSs allow for reuse of content. Objects
such as templates, graphics, images, and content are created and entered once
and then reused as needed throughout the Web site.

• Implement central graphic design management. Graphic design in a CMS


becomes template-driven and centrally managed. Templates are the structures
that format and display content following a request from a user for a particular
Web page. Templates ensure a consistent, professional look and feel for all
content on the site. They also allow for (relatively) easy and simultaneous
modification of an entire site graphic design.

29
3 – Web applications for Information Management

• Automate workflow management. Good CMSs enable good workflow pro-


cesses. In the most complex workflow systems, at least three different in-
dividuals create, approve, and publish a piece of content, working separately
and independently. A good workflow system expedites the timely publica-
tion of content by alerting the next person in the chain when an action is
required. It also ensures that content is adequately reviewed and approved
before publication.

• Build sophisticated content access and security. Good CMSs allow for sophis-
ticated control of content access, both for content managers who create and
maintain content and for users who view and use it. Web managers should be
able to define who has access to different types of information and what type
of access each person has.

• Make content administration database-driven. In a CMS static, flat, HTML


pages no longer exist. Instead, the system places most content in a rela-
tional database capable of storing a variety of text and binary materials. The
database then becomes the central repository for contents, templates, graphics,
users and metadata.

• Include structures to collect and store metadata. Because data is stored sepa-
rately from both layout and design, the database also stores metadata describ-
ing and defining the data, usually including author, creation date, publication
and expiration dates, content descriptions and indexing informations, cate-
gories information, revision history, security and access information, and a
range of other content-related data.

• Allow for customization and integration with legacy systems. CMSs allow cus-
tomization of the site functionality through programming. They can expose
their functionalities through an application programming interface (API) and
they can coexist with, and integrate, already deployed legacy systems.

• Allow for archiving and version control. High-end CMSs usually provide mech-
anisms for storing and managing revisions to content. As changes are made,
the system stores archives of the content and allows reversion of any page to
an earlier version. The system also provides means for pruning the archived
content periodically, preferably on a base of criteria including age, location,
number of versions, etc.

Logical architecture
A typical CMS is organized as in Figure 3.2.

30
3.2 – Software tools for knowledge management

Figure 3.2. A typical CMS architecture.

It is composed of five macro-components: the Editing front-end, the Site or-


ganization module, the Review system, the Theme management module and the
Publication system.
The Editing front-end is the starting point of the publishing chain implemented
by the CMS. This component usually allows “journalists”, i.e., persons that produce
new contents, to submit their writings. After submission, the new content is adapted
for efficient storage and, if a classification module exists, it is indexed (and this
could be done automatically or manually) in order to lately allow effective retrieval
of stored resources.
The Review system constitutes the other side of the submission process, and it is
designed to support the work flow of successive reviews that occur between the first
submission of a document and the final publication. Users allowed to interact with
this component are usually more expert users, which have the ability/authority to
review the contents submitted by journalists. They can approve the submitted data
or they can send it back to the journalists for further modifications. The approval
process is differently deployed depending both on the CMS class (high-end, middle
level or entry-level) and on the redaction paradigm adopted: either based on a single
review or on multiple reviews.
The Site organization module provides the interface for organizing the published
information into a coherent and possibly usable web site. This module is specifically
targeted at the definition of the navigation patterns to be proposed to the final users,
i.e. to the definition of the site map. Depending on the class of the CMS system, the

31
3 – Web applications for Information Management

site organization is either designed by journalists or it is proposed by journalists and


subsequently reviewed by editors, possibly in a complete iterative cycle. Since many
conflicting site maps may arise, even with few journalists, the nowadays content
management systems usually adopt the latter publication paradigm.
The Theme management component is charged of the complete definition of the
site appearance. Graphic designers interact with this module by loading pictures
and graphical decorations, by defining publication styles, by creating dynamic and
interactive elements (menus for example), etc. The graphical aspect of a site is a
critical asset both for the site success (for sites published on the Web) and for the site
effectiveness in terms of user experience (usability and accessibility issues), therefore
care must paid when developing themes and decorations. Usually the theme creation
is not moderated by the CMS. In addition, many of the currently available systems
do not allow on-line editing of graphical presentations. Instead, each designer shall
develop its own theme (or its part of a theme) and shall upload such theme on the
CMS as a whole element. The final publication is subject to the editor approval,
however, there are usually no means, for an editor, to set up an iterative refinement
cycle similar to the document publication process.
The Publication system is both the most customizable and the most visible com-
ponent of a CMS. Its main duty is to pick-up and publish resources from the CMS
document base. Only approved documents can be published while the resources un-
der review should remain invisible to the users (except to journalists and editors).
The publishing system adopts the site map defined by means of the organization
module, and stored in the CMS database, to organize the information to be pub-
lished. Links between resources can either be defined at publication time or can be
automatically defined at runtime according to some (complex) rules. The graphical
presentation of pages depends on the theme defined by graphic designers and ap-
proved by the editors. Depending on company needs and mission, pages can result
completely accessible, even to people with disabilities (this is mandatory for web
sites providing services to people, and is desirable for every site on the Web) or
completely not accessible.
In addition to the publication of documents edited by the journalists, the sys-
tem can offer to the final viewers much more services, depending on the sup-
ported/installed sub modules. So, for example, a typical CMS is able to offer mul-
tiple, topic-centered forums, mailing lists, instant messaging, white boards between
on-line users, cooperation systems such as Wikis or Blogs, etc.

Semantics integration

As shown in the aforementioned paragraphs, CMSs are, at the same time, effec-
tive and critical components for knowledge exploitation, especially for explicit to

32
3.2 – Software tools for knowledge management

explicit conversion (Combination) and for explicit to tacit conversion (Internaliza-


tion). They often offer some kind of metadata-managing functions, allowing to keep
track of authors of published data, of creation, publication and expiration dates of
documents, and of information for indexing and categorizing the document base.
This information is only semantically consistent with the internal CMS database,
i.e., it roughly corresponds to the fields of the CMS database. As shown by many
years of research in database systems, this is actually a form of semantics, however it
is neither related with external resources nor with explicitly available models. Stored
meta-information, although being meaningful inside the CMS-related applications,
will thus not be understandable for external applications, making the whole system
less inter-operable.
A Semantic Web system, as well as a future Wisdom Web system, relates, in-
stead, its internal knowledge to well known models, where possible, such as the
Dublin Core for authorship of documents. Even when sufficiently detailed models
are not available, and must be developed “from scratch”, the way metadata is for-
malized follows a well defined standard and it is understood by all “semantic-aware”
softwares and architectures. Given that all Semantic-Web application shall manage
at least RDF/S and the related semantics, interoperability is automatically granted.
Indexing and searching the CMS base can also take advantage of semantic infor-
mation. As for metadata (which is in turn strongly related to indexing and retrieval)
current systems already provide several, advanced facilities for allowing users to store
and retrieve their data, in an effective way. Unfortunately, the adopted technologies
are essentially keyword-based.
A keyword based information retrieval sub-system is based on the occurrence of
specific words (called keywords) inside the indexed documents. The most general
words such as articles, conjunctions, and the like, are usually discarded as they are
not useful for distinguishing documents. While, the remaining terms are collected
and inversely related to each resource in the CMS base. So, in the end, for each
word a list of documents in which the word occurs is compiled.
Whenever a user performs a query, the query terms are matched against the term
list stored by the retrieval subsystem and the correspondent resources are retrieved
according to properly defined ranking mechanisms. Besides the accuracy and effi-
ciency of the different systems available at now, they all share a common and still
problematic issue: they are vocabulary dependent. Being based on keywords found
in the managed document base, these engines are not able to retrieve anything if the
same keywords do not occur both in the user query and in the stored documents.
A more extensive discussion about these topics is available in the Information
Retrieval systems sub-section, however it is easy to notice that semantics integration
can alleviate, if not completely address, these issues by abstracting both queries and
document descriptions to “semantic models”. The matching operation is, in this
case, performed at the level of conceptual models, and if the conversion between

33
3 – Web applications for Information Management

either the query or the document content, and the corresponding conceptual model
has been effectively performed, matches can be found independently from vocabular-
ies. This interesting capability of finding resources independently from vocabularies
can be leveraged by “Semantic Web” CMSs to offer functionalities much more ad-
vanced than now. Language independence, for example, can be easily achieved. So
a user can write a query in a given language, which will likely be his/her mother
language, and can ask the system to retrieve data both in the query language and
in other languages that he/she can understand. Matching queries and documents
at the conceptual level makes this process fairly easy. The currently available sys-
tems, instead, can usually provide results only for same language of the query (in
the rather fortunate case in which the query language corresponds to one of the
languages adopted for keyword-based indexing).
Other retrieval related functionalities include the contextual retrieval of “seman-
tically related” pages during user navigation. When a user requires a given page of
the site managed by a semantic CMS, the page conceptual description is picked up
and used to retrieve links to pages, in the site, that have conceptual descriptions
similar to the one of the required page. This allows for example to browse the pub-
lished site by similarity of pages rather than by following the predefined interaction
scenario fixed by the site map.
The impact of semantics on the CMS technology is not limited to storage, classi-
fication and retrieval of resources. Semantics can also be extremely useful in defining
the organization of published sites and in defining the expected work flow for re-
sources produced and reviewed within the CMS environment. There are several
attempts to provide the first CMS implementations in which content is automat-
ically organized and published according to a given ontology. In the same way,
ontologies are used for defining the complex interactions that characterize the docu-
ment review process and the expected steps required for a document to be published
by the CMS.
In conclusion, the introduction of semantics handling in content management
systems can provide several advantages both for what concerns document storage
and retrieval and for what concerns the site navigation and the site publication work
flow.

3.2.2 Information Retrieval systems


Information Retrieval has been one of the most active research streams during the
past decade. It still permeates almost all web-related applications providing either
means, methodologies or techniques for easily accessing resources, be they human
understandable resources, database records or whatever. Information retrieval deals
with the problem of storing, classifying and effectively retrieving resources, i.e.,
information, in computer systems. The find utility in Unix or the small Microsoft’s

34
3.2 – Software tools for knowledge management

dog are very simple examples of information retrieval systems. More qualified and
probably more diffused examples are also available, Google [9] above all.
A simple information retrieval system works on the concepts of document in-
dexing, classification and retrieval. These three processes are at the basis of every
computer-based search system. For each of the three, several techniques have been
studied starting from the preliminary heuristic-based approaches until the nowadays
statistical and probabilistic methods. The logical architecture of a typical informa-
tion retrieval system is shown in Figure 3.3.

Figure 3.3. The Logical Architecture of a typical Information Retrieval system.

Many blocks can be identified: the Text Operations block performs all the operations
required for adapting the text of documents to the indexing process. As an example
in this block stop words are removed and the remaining words are stemmed. The
Indexing block basically constructs an inverted index of word-to-document pointers.
The searching block retrieves all the documents that contain a given query token
from the inverted index. The ranking block, instead, ranks all the retrieved docu-
ments according to a similarity measure which evaluates how much documents are
similar to queries. The user interface allows users to perform queries and to view
results. Sometimes it also supports some relevance feedback which allows users to
improve the search performances of the IR system by explicitly stating which re-
sults are relevant and which not. In the end, the query operations block transforms
the user query to improve the IR system performances. For example a standard
thesaurus can be used for expanding the user query by adding new relevant terms,
or the query can be transformed by taking into account users’ suggestions coming
from a relevance feedback.
Describing in detail a significant part of all available approaches to information
retrieval and their variants, would require more than one thesis alone. In addition,

35
3 – Web applications for Information Management

the scope of this section is not to be exhaustive with respect to available technologies,
solutions, etc. Instead, the main goal is to provide a rough description of how an
information retrieval system works and a glimpse of what advantages can be implied
by semantics adoption in Information Retrieval. For the ones more interested in this
topic, the bibliography section reports several interesting works that can constitute
a good starting point for investigation. Of course the web is the most viable mean
for gathering other resources.
For the sake of simplicity this section prosecutes by adopting the tf ·idf weighting
scheme and the vector space model as guiding methodology and tries to generalize
the provided considerations whenever it is possible.

Indexing
In the indexing process each searchable resource is analyzed for extracting a suitable
description. This description will be, in turn, used by the classification process and
by the retrieval process. By now we restrict the description of indexing at text-
based documents, i.e. at documents which mainly contain human understandable
terms. In this case, indexing intuitively means taking into account, in some ways,
the information conveyed by the words contained in the document to be indexed.
As humans can understand textual documents, information is indeed contained into
them, in a somewhat encoded form. The indexing goal is to extract this information
and to store it in a machine processable form. In performing this extraction two
main approaches are usually adopted: the first one tries to mimic what humans
do and leads to the wide and complex study of Natural Language Processing. The
second, instead, uses information which is much easier for machines to understand,
such as statistical correlation between occurring words, term frequency and so on.
This last solution is actually the one adopted by nowadays retrieval systems, while
the former finds its application only on more restricted search fields where specific
and rather well defined sub-languages can be found. The tf · idf indexing scheme is
a typical example of “machine level” resource indexing.
The base assumption of tf · idf , and of other more sophisticated methods, is that
information in textual resources is encoded in the adopted terms. The more specific
a term is, the more easily the argument of a document can be inferred. So the main
indexing operations deal with words occurring in resources being analyzed, trying
to extract only the relevant information and to discard all the redundancies typical
of a written language.
In the tf · idf case, the approach works by inspecting the document terms. As
first, all the words that usually convey little or no information such as conjunctions,
articles, adverbs, etc. are removed. They are the so-called stop words and typically
depend on the language in which the given document is written. Removing the
stop words allows to adopt frequency based methods without data being polluted

36
3.2 – Software tools for knowledge management

by non-significant information uniformly occurring in all the documents.


Once purged documents from stop words, the tf · idf method evaluates the
frequency of each term occurring in the document, i.e., the number of times that
the word occurs inside the document, with respect to the more frequent term. In
the simplest implementation of tf · idf a vocabulary L defines the words for which
this operation has to be performed.
Let ti be the i − th term of the vocabulary L. The term frequency tfi of the
term ti in the document di is defined as:

ti ∈ d
tfi (d) =
max(tj ∈ d)

The term frequency alone is clearly a too simplistic feature for characterizing a
textual resource. Term frequency, in fact, is only a relative measure of how much
important is (statistically speaking) a word in a document. However, no information
is provided on the ability of the given word to discriminate the analyzed document
from the others. Therefore a weighting scheme shall be adopted, which takes into
account the frequency with which the same term occurs in the documents base. This
weighting scheme is materialized by the inverse document frequency term idf . The
inverse document frequency takes into account the relative frequency of the term ti
with respect to the documents already indexed. Formally:

dk ∈ D
idfti = log(  )
D tfi (dk )

The two terms, i.e., the tf and the idf values are combined into a single value
called tf · idf which describes the ability of the term ti to discriminate the document
di from the others.
tf · idfti (d) = tfi (d) · idfti

The set of tf · idf values, for each term of the vocabulary L, for a document d,
defines the d representation inside the Information Retrieval system.
It must be noted that this indexing process is strongly vocabulary dependent:
words not occurring in L are not recognized and if they are used in queries they do
not lead to results. The same holds for more complex methods where L is built from
the indexed documents or from a training set of documents: words not occurring
in the set of documents analyzed are not took into account. So, for example, if in
a set of textual resources only the word horse occurs, a query for stallion will not
provide results, even if horse and stallion can be used as synonyms.
NLP is expected to solve these problems, however its adoption in information
retrieval systems still appears immature.

37
3 – Web applications for Information Management

Classification and Retrieval

In this section classification and retrieval will be described in parallel. Although


they are quite different processes, the latter can be seen as a particular case of the
former where the category definition is given at runtime, by the user query. It shall
be stressed that this section, as the preceding one, does not aim at being complete
and exhaustive, instead, it aims at making clear some shortcomings of the retrieval
process which are in common with classification and which can be improved by
semantics adoption. According to the previous subsection, the tf · idf method and
the Vector Space model [10] are adopted as reference implementations.
After the indexing process, each document, in the knowledge base managed by
an IR system, has associated a set of features describing its content. In classification,
these features are compared to a class definition (either predefined or learned through
clustering) to evaluate whether documents belong or not to the class. In retrieval,
instead, the same set is compared against a set of features specified by a user in
form of query.
The retrieval (classification) process defines how the comparison shall be per-
formed. In doing so, a similarity measure shall be defined allowing to quantitatively
measure the distance between document descriptions and user queries or category
definitions.
The similarity Sim(di ,dj ) defines the distance, in terms of features, between
resources in a given representation space. Such a measure is usually normalized:
resources having the same description in terms of modeled features get a similarity
score of 1, while resources completely dissimilar receive a similarity score of 0. Please
note that a similarity measure of 1 does not means that compared resources are
exactly equal to each other. The similarity measure, in fact, works only on the
resources features, and two resources can have the same features without being
equal. However, the underlying assumption, is that, although diverse, resources
with similar features are “about” the same theme, from a human point of view.
Therefore, the more high is the similarity between two resources, the more high is
the probability that they have something in common.
The Vector Space model is one of the most diffused retrieval and classification
model. It works on a vectorial space defined by the document features extracted
during the indexing process. In the Vector Space model, the words belonging to the
vocabulary L are considered as the base of the vectorial space of documents d and
queries q. Documents and queries are in fact expressed in terms of words ti ∈ L and
can therefore be represented in the same space (Figure 3.4). Representing documents
and queries (or class definitions) in the same vectorial space allows to evaluate
the similarity between these resources in a quite straightforward manner, since the
classical cosine similarity measure can be adopted. In the Vector Space model the
similarity is, in other words, evaluated as the cosine of the hyper-angle between the

38
3.2 – Software tools for knowledge management

Figure 3.4. The Vector Space Model.

vector of features representing a given document and the vector representing a given
query. Similarity between documents and classes can be evaluated in the same way.
Formally, the cosine similarity is defined as:

di · dj
Sim(di ,dj ) =    
di  · dj 

where di is the feature vector of di and dj is the feature vector of dj . As demonstrated


by the successful application of this model to many real-world case studies, the
Vector Space is quite an effective solution to the problem of classifying and retrieving
resources. However, at least two shortcomings can be identified: as first, the model
works assuming that the terms in L compose an orthogonal base for the space of
documents and queries.
This assumption is clearly not true since words usually appear in groups, depend-
ing on the document (query) type and domain. Secondly, the approach is strongly
influenced by the features extracted from documents, and since they are in most
cases simple, vocabulary dependent, syntactic features it becomes also syntactic
and vocabulary dependent. As an example, suppose that a wrong vocabulary Lw
contains the two words horse and stallion and suppose that they are not identified
as synonyms (but actually they are, in the analyzed knowledge domain). If a doc-
ument is composed by the single term horse and another document by the single
term stallion they are recognized as completely different. In case a user specifies

39
3 – Web applications for Information Management

horse as query keyword, only the first document is retrieved, while the second is
completely missed, and vice-versa.

Semantics Integration
Semantics can play a great role in information retrieval systems allowing from one
side to face all the issues related to vocabularies and different terminologies, and on
the other side enabling users to perform “conceptual” queries. Conceptual queries
specify the users’ information need with high level “concept descriptions” and do
not use mere keywords, which can be imprecise and sometimes misleading (try to
search “jaguar” on Google, will you find cars or animals?).
With respect to the topics described in the previous sections semantics can be
integrated in the indexing process as well as in the retrieval process. In the indexing
process, semantics can be adopted by mapping, i.e., by classifying resources with
respect to a formal ontology.
Clearly the problem of term dependence still remains: resources are indexed as
before and, in addition, a classification task is performed. However this dependence
is somehow mitigated because the ontology acts as a bridge and as a merging point
for the different vocabularies adopted by the IR system. Whatever language is used
and whatever domain-specific vocabulary is adopted, the features resulting from in-
dexing are always concepts, which are in turn language and vocabulary independent.
The same holds for keywords in user queries: using a ontology as semantic back-
bone, synonyms can be easily took into account, for example by associating many
keywords to each ontology concept (Lexicon).
In the retrieval process, semantics can by-pass the synonym-related issues and
can also provide new kinds of researches, which are in principle similar to the well
known category search. The user, in fact, can directly select the concepts modeled
in the IR ontology as a query. Such a query is then easily converted in retrieved
resources, without vocabulary and, most importantly, without domain-related prob-
lems. So, if a user is browsing a naturalistic web site, and performs a search selecting
the concept “jaguar”, only references to animals will be retrieved, the context being
fixed by the site ontology, which is about nature.
Eventually semantics adoption can easily support queries for related documents
and query refinement processes. Queries for related documents start from a sample
page, possibly retrieved by the IR system in a previous query, and require simi-
lar pages that are stored in the IR system knowledge base. The ontology is newly
the angular stone of the process, allowing to find resources which have conceptual
descriptions similar to the one of the sample page. The process is completely vo-
cabulary independent and transparently supports cross-lingual operations.
In query refinement, instead, the user specifies a query that provides bad or not
relevant enough results. The user can then refine his/her query by selecting new

40
3.2 – Software tools for knowledge management

terms, more specific or more general. Ontologies can be of aid also in this case: the
semantics relationships occurring between ontology concepts offer, in fact, a built-in
method for query refinement. A semantic IR system can therefore easily allow users
to widen or restrict their queries for finding more relevant results. It can even run
queries proactively, in order to timely respond to user demands.

3.2.3 e-Learning systems


By its name, e-learning can best be understood as any type of learning delivered
electronically. Defined broadly, this can encompass learning products delivered by
computer, intranet, internet, satellite, or other remote technologies. Brandon Hall, a
noted e-learning researcher, defines elearning as “instruction delivered electronically
wholly by a web browser, through the Internet or an intranet, or through CD-ROM
or DVD multimedia platforms”.
E-learning is sometimes classified as synchronous or asynchronous. Both terms
refer to “the extent to which a course is bound by place and/or time”, according
to The Distance Learner’s Guide (Prentice Hall: 1998). Synchronous simply means
that two or more events occur at the same time, while Asynchronous means that two
or more events occur “not at the same time”. For example, when someone attends
live training, like in a class or workshop, then the event is synchronous, because the
event and the learning occur simultaneously, or at the same time. Asynchronous
learning occurs when somebody takes an on-line course in which he/she complete
events at different times, and when communication occurs via time-delayed email or
in discussion list postings.
Today, many applications of e-learning principles and systems are emerging and
are slowly pervading all the aspects of organization lives, be they big industries or
small enterprises, or even educational institutes. E-learning is currently adopted for
a variety of different situations. For example, it is used to:

• Deliver introductory training to employees, customers, or other personnel

• Offer refresher or remedial training

• Offer training for credentialing, certification, licensing, or advancement

• Offer academic or educational credit via college and university on-line learning

• Promote and inform an audience about products, policies, and services

• Support organizational initiatives by increasing motivation through easily ac-


cessible learning

• Offer orientation to geographically dispersed personnel

41
3 – Web applications for Information Management

• Create a variety of essential and nonessential learning opportunities for per-


sonnel

• Provide coaching and mentoring through on-line instruction and collaboration

• Build communities of practice using distributed on-line training and commu-


nication

• Standardize common training through fixed content accessible to all users

As with any learning medium, the use of e-learning offers benefits as yet not realized
in traditional training, while also presenting new risks to both producers and users.
On the positive side, e-learning products:

• Energize content with illustrations, animations, and other media effects

• Offer increased fidelity to real-world application through scenarios and simu-


lations

• Enable just-in-time, personalized, adaptive, user-centric learning

• Offer flexibility and accessibility

• Engage inexpensive, distribution capabilities to potentially a worldwide audi-


ence

• Create stability and consistency of content due to the ease in which revisions
can be made

• Standardize content by centralizing knowledge and information in one format

• Cross multiple platforms of web browsing software

• Are less expensive to produce and distribute on a large-scale than traditional


training

• Eliminate travel and lodging expenses required for traditional, in-person train-
ing

• Encourage self-paced instruction by users

• Support increased retention and improved comprehension of content

• Lend themselves to streamlined, easily scalable management and administra-


tion of courses and users

42
3.2 – Software tools for knowledge management

Nevertheless, like any other training format, it also has disadvantages and risks
associated with its production and use. Before committing to e-learning, one must
consider the following:

• Access sometimes varies based on user capabilities

• Internet bandwidth limitations and slow connection speeds sometimes hamper


performance

• User reaction and participation often depends on the level of individual com-
puter literacy

• Development costs can exceed initial estimates unless clear production goals
are established

• Not all content is suitable for delivery via e-learning

• The loss of human instructor contact may be disconcerting to users

• Industry standards for development and delivery are still emerging

• Implementation is challenging if not well-planned in advance of development

Careful thought and planning must go into a decision to purchase, implement, and
utilize e-Learning products, both those bought off-the-shelf or those customized for
specific purposes. This is often referred to as the “build or buy” decision. In either
case, organizations considering e-Learning should conduct a comprehensive analysis
of their needs, goals, education or training plans, and their current infrastructure to
determine if e-Learning is a suitable pursuit.

e-Learning standards
All the major features of e-Learning, i.e., the ability to customize courses, to track
progress, to offer “just-in-time” learning opportunities, are only feasible if the ba-
sic infrastructure of an e-Learning component is designed to be interoperable and
communicates with components from a variety of sources. These elemental units are
usually called “learning objects” and are the basis for the standardization movement.
There are numerous definitions of a learning object, but it is basically a small
“chunk” of learning content that focuses on a specific learning objective. These
learning objects can contain one or many components, or “information objects”,
including text, images, video, or the like. Reusability shall be supported both at the
learning object and at the information object levels, and by standardizing the way
in which these objects are built and indexed, both learning objects and information
objects shall be easy to find and use.

43
3 – Web applications for Information Management

Standardization, currently occurring within the IEEE LTSC working groups, is


acting on 5 areas: data and metadata, content-related issues, learning management
systems and applications, learner-related issues and technical standards.
Data and Metadata: metadata is defined as “information about information”. In
e-Learning, metadata describes learning objects including attributes such as author,
subject, date, etc. Metadata is expected to enable learning objects to be more easily
indexed and stored in content repositories, and subsequently found and retrieved.
The standards in this area:
• Specify the syntax and semantics of learning object metadata, so that the
objects can be described in a standardized format that is independent of the
content itself. The standards also specify the fields to be used when describing
learning objects;

• Facilitate the translation of human languages (to re-purpose the content for
use in different cultures);

• Define a semantic framework that allows the integration of legacy systems and
the development of data exchange formats;

• Provide a common lightweight protocol for exchanging data among clients,


servers and peers.
Content-related issues: standard related to content used in learning exist basi-
cally to inform users of what they are getting, how they are getting it, and the best
way to use it. It is possible to think at these standards as to the “How-to” manuals
for learning content. The specific standards focus on:
• The language used to describe and reference the various media components
(e.g., audio, video, animations). This will be useful in establishing the means
for the portability of the components from one system, or tool, to another;

• A mechanism for managing and adapting the presentation of lessons according


to the needs of the learner in order to dynamically create customized instruc-
tional experiences;

• The packaging of learning content for allowing simple transmission and acti-
vation of learning objects.
Learning management systems and applications: Learning management sys-
tems play an important role in the facilitation of a learning object strategy. The
management system serves as a type of gateway where content enters, is assembled
into meaningful lessons based on the learner’s profile, and is presented to the learner,
whose progress is then tracked by the management system. It is crucial, therefore,

44
3.2 – Software tools for knowledge management

that the system is able to operate with content and tools from multiple sources. The
standards developed in this category:

• Allow lessons and courses to move from one computer managed instruction
system (CMI) to another, while maintaining its ease of use and functionality;

• Make it easier for learning technologies to be implemented on various types of


browsers and operating systems;

• establish a protocol that aids in the communication between the software tools
that a learner is using (e.g. text editors and spreadsheets) and the instructional
software agents that provide guidance to the learner.

Learner-related issues: the main purpose of developing standards is to create a


more effective and efficient way for people to learn using technology. The learner-
related standards focus on creating a connection between the learner and the tech-
nology (and its developers). Groups involved in this effort are working to create
standards that aid in characterizing, identifying, and tracking learners, and in profil-
ing their competencies. The availability of this information enables the development
of more appropriate instruction for the learner. More specifically, the standards deal
with:

• The language used in a “Learner Model” that will maintain a characterization


of the learner, including such attributes as knowledge, skills, abilities, learning
styles, records, and personal information. This creates a sort of electronic
learning portfolio that can be used by the learner throughout his lifetime to
enhance learning experiences;

• A mean for identifying learners for sign-on and record-keeping purposes;

• The components of a user-centered system that aid in the process of managing


life-long learning. It will help with goal-setting, planning, execution, tracking,
and documentation in order to provide learners with guidance that help them
achieve independence in reaching goals, as well as provide documentation of
achieved competences.

Technical standards: the standard currently adopted for the exchange of e-


Learning information over the Internet is XML, whereas HTML is the preferred
language for telling to a system how to present (format) content on a page. Many
innovations are actually under study, in particular for what concerns the learning
object characterization and transmission, and they are somewhat related to the
standardization efforts just cited.

45
3 – Web applications for Information Management

Semantics Integration

In order to understand what contribution can semantics provide to e-Learning appli-


cations, a “near future” scenario can be analyzed, extracting the important points
in which the availability of explicit semantics makes the difference with respect to
nowadays solutions.
Imagine that you are studying Taylor expansions in mathematics. Your teacher
has not yet provided the relevant links to the involved concept in your semantics-
enabled learning framework, so you first enter “Taylor expansions” in the classical
search form provided by the system. The result list shows that Taylor expansions
occurs in several contexts of mathematics, and you decide to have a look at Taylor
expansions in an approximation context, which seems most appropriate for your
current studies.
After having dwelled a while on the different kinds of approximations, you decide
you want to see if there are any appropriate learning resources. Simply listing the
associated resources turns out to return too many results, so you quickly draw a
query for “mathematical resources in Italian related to Taylor expansions that are
on the university level and part of a course in calculus at an Italian university”.
Finding too many resources again, you add the requirement that a professor at
your university must have given a good review of the resource. You find then some
interesting animations provided as part of a similar course at a different university,
where it has been annotated in the personal portfolio of a professor at your university,
and start out with a great QuickTime animation of Taylor expansions in three
dimensions.
The movie player notes that you have a red-green color blindness and adjusts
the animation according to a specification of the color properties of the movie which
was found together with the other descriptions of the movie. After a while you are
getting curious. What, more precisely, are the mechanisms underlying these curves
and surfaces? You decide you need to more interactively manipulate the expansions.
So you take your animation, and drag it to your graphing calculator program, which
retrieves the relevant semantic information from the learning object via the applica-
tion framework, and goes into the Web looking for mathematical descriptions of the
animation. The university, it turns out, never provided the MathML formulas de-
scribing the animations, but the program finds formulas describing a related Taylor
expansion at the MIT OKI site. So it retrieves the formulas, opens an interactive
manipulation window, and lets you experiment.
Your questions concerning Taylor expansions multiply. You badly feel the need
for some deeper answers. Asking the learning system for knowledge sources at your
own university that have announced interest in helping out with advanced Calculus
matters, you find a fellow student and a few math teachers. Deciding that you want
some input from the student before talking to the teachers, you send him/her some

46
3.2 – Software tools for knowledge management

questions and order your calendaring agent to make an appointment with one of the
teachers in a few days.
A week later you feel confident enough for changing the learning objective status
for Taylor expansions in your portfolio from ’active, questions pending’ to ’resting,
but not fully explored’. You also mark your exploration sequence, the conceptual
overviews you produced in discussions with the student and some annotations, as
public in the portfolio. You conclude by registering yourself as a resource on the
level ’beginner’ with a scope restricting the visibility to students at your university
only.
In this scenario some points can be identified where semantics integration plays
a crucial role:

• Distributed material and distributed searches. Here semantics helps by elimi-


nating vocabulary related issues as well as interoperability issues. Describing
learning objects through well known standard languages such as XML, RDF/S
and OWL allows rapid exchange of information between different applications.
Moreover, the ability to share a common model of the domain knowledge, or
the ability to link each application model (ontology) to a well known and
shared ontology, enables easy interoperation between learning systems and
frameworks.

• Combination of metadata schemas, for example, personal information and


content descriptions. In a fully semantic learning framework the informations
about user’s profile and preferences can be integrated as semantic filters for
the retrieval process, allowing to restrict the search results at the subset which
satisfies both the user information need and the user fruition model.

• Machine understandable semantics of metadata: so that a machine is able


to correctly interpret constraints, calendaring info, for example, and to find
correct resources between the available learning objects and learning informa-
tions.

• Human-understandable classification of metadata. The user is able to directly


specify the context of searches (by clicking the involved concepts), the persons
with which he/she wants to interact, and is able to understand the classifica-
tion of available resources, thus being able to select the best suited results.

• Interoperability between tools. As the basic semantics of RDF/S and of OWL


must be understood by every application which is able to manipulate these
representations, interoperability is automatically granted.

• Distributed annotation of any resource by anyone, in this case using digital

47
3 – Web applications for Information Management

portfolios. Every resource in RDF/S and OWL has a unique identifier, there-
fore annotations can be about whatever needed. Attributes can define the an-
notation author, the annotator trust level, experience, etc. offering the basic
infrastructure for building a potentially world wide collaboration framework.

• Personalization of tools, queries and interfaces, affecting the experience in


several ways. Semantic metadata is not only focused on describing the content
of learning materials, it can also be used to describe the physical features of
the same data. So, for example a video can be recognized as problematic for
red-green blindness, and its color parameters can be adjusted according to a
user profile in order to avoid vision problems.

• Competency declaration and discovery for personal contacts. A complete se-


mantic characterization of e-Learning environments also includes information
about level of competence and trust mechanisms. Users are therefore able to
design their own interactions by selecting the profiles of other users/teachers
on the basis of human-kind values such as competence, trust, sympathy, kind-
ness, etc.

48
Chapter 4

Requirements for Semantic Web


Applications

The integration of semantic functionalities in real world web applications requires


a careful analysis of the requirements that such functionalities should satisfy, find-
ing the best trade off between user requirements, site publishers requirements and
developers requirements. Requirements are usually categorized as functional or non-
functional. The former express actions that a system should perform, defining both
the stimulus and the expected response (input and output). Identifying, in other
words, things that systems shall perform. Non-functional requirements, instead,
address the different facets of system deployment such as performance, usability,
robustness, security, hardware and so on.
In order to be able to design a successful integration of semantic functionalities
into web applications, requirements must be gathered by involving several groups
of people: developers, end users, site publishers, content editors, etc. Requirements
are, actually, a contract between these diverse groups of people and a proper repre-
sentation of the involved parties is important.

4.1 Functional requirements


Requirements for semantic-aware web sites and applications have been gathered, in
this thesis, by interviewing the different actors involved in the site business logic:
publishers, authors, developers. These requirements have then been prioritized ac-
cording to user’s and publisher’s needs. A separation has been derived between
functionalities actually needed and extensions or innovations desired/expected from
the availability of semantic processing tools in the standard site development and
publication work flow. The output of the entire process was as follows:
1. The system shall evaluate the relatedness of two resources on the basis of their

49
4 – Requirements for Semantic Web Applications

content, in terms of ontology concepts. A page about wheel-chairs and a page


about physical barriers shall be related since the presence of physical barriers
prevents the access to a given facility for people using wheel-chairs.

2. The system shall serve different users, of different nationalities: all required
services must support the use of different languages.

3. The system shall enable cross-lingual queries, i.e. queries written in a given
language that provide results in a different language(s). As an example, a user
might specify a query in English and require results in English, Italian and
French.

4. The system shall provide a directory-like view of the ontology, for the knowl-
edge domain in which it works.

5. When a user selects a category label, a semantic web application shall provide
the resources classified as “pertaining” the category (ontology concept).

6. A Semantic site shall provide semantic what’s related functionalities: whenever


a user browses a page, the publication system shall provide a selection of pages
(no more than 10) that are related to the viewed page.

7. The system shall provide a classical search (textual search) functionality based
on resource contents. That is to say, it shall provide a search engine able to
work on synonyms and to provide results even if the words in the query do
not occur in the indexed resources.

8. The system shall allow for manual classification of resources with respect to
a given conceptual domain. The content editor (journalist) must therefore be
enabled to easily specify the concepts for which a resource is relevant.

9. For each selectable concept, the system shall provide a label and an extended
description, localized in the user language. This facilitates the user compre-
hension of the domain model, thus reducing misclassified resources.

10. The system shall facilitate manual classification of resources by providing the
available concepts for classification (the old keywords, in a sense) and by only
allowing selection of concepts actually present in the ontology.

11. The system shall support semi-automatic classification of resources, by pro-


viding suggestions for the possible classification of a given resource.

12. The system should support automatic classification of resources.

50
4.1 – Functional requirements

13. The system shall be able to classify both owned and not owned resources.
Therefore a site using the system should be able to offer conceptual searches
both on its internal resources and on resources of other related sites.
14. In response to a specific user setting, the system should provide semantic and
transparent search functionalities. In other words, the system should perform
searches in background, while the user is surfing the site, taking advantage
of the information coming from the user navigation to generate meaningful
queries.
15. In the transparent search mode, the system should provide additional infor-
mation, i.e. retrieved pages, if and only if the relevance of results, as perceived
by the system, is over a reasonable threshold. Such a threshold should be set
by the user and can be modified by the user during the site navigation (in a
way like the “Google Personalized” system).
16. The system can provide a relevance feedback plug-in for the most diffused web
browsers (Mozilla and Internet Explorer at least) where the user can view the
classification of visited pages as deducted by the system and can correct them.
17. The system should guide the logical organization of a site, by proposing a
suitable location for a new page in the site map, depending on the conceptual
classification of that resource. For example, a new page about municipality
aids for people with disabilities should be easily accessible from the pages
about the municipality services and from the pages about disability aids.
The highest priority needs resulting from the gathering phase concern the search-
related part of web sites, mainly because of the poor performances of the nowadays
syntactic search engines. Performances are perceived as poor despite the high pre-
cision and recall values of such search engines in force of the syntactic nature of this
technology. In fact syntactic engines are not able to retrieve resources on the basis
of their conceptual content but they are only able to address retrieval by using the
occurrence of a finite set of keywords that may not appear in a user query.
The first requirement states that the similarity between two resources and the
similarity between queries and resources must be evaluated on the basis of resource
content in a semantics rich way. The immediately following requirement is, surpris-
ingly, not directly concerned with the search task but deals with multilingualism.
In the era of global services the ability to handle service requests and responses
in different languages is perceived as a critical factor. Moreover the capability to
perform cross language operations, i.e. operations whose triggering is in a language
and whose result is in another language, is a value added that many web operators
would like to offer to their users. Multilingualism is, in a sense, completely indepen-
dent from the adoption of semantics in web applications, however it is much more

51
4 – Requirements for Semantic Web Applications

simple to obtain when the main business of searching, classifying and matching is
performed at a conceptual level, which is language independent.

Requirements from 4 to 7 are newly related to search functionalities: in a few


words they state that the limits of syntactic search engines can be surmounted by
adding semantics to the resource classification. These new semantic-based search
engines must adopt new, powerful interfaces able to exploit their full potential.
Such interfaces must be not very dissimilar from the traditional ones in order to
maximize the user interaction. The envisioned interaction models include therefore:
the classical keyword-based query interface, the directory search and only one “new”
interaction paradigm called semantic what’s related in which, for each page requested
by a given user, a set of links to related pages is provided, where the relation between
pages is defined on the basis of their conceptual descriptions.

The other side of the medallion, i.e. the classification task, is addressed in the
requirements labeled from 8 to 13 and refers to three different degrees of automation
in the semantic classification of resources. The first requirement assumes that to
provide results relevant to a human user, the classification must be performed by a
human. He/she manually annotates the published resources by creating associations
between such resources and a model of the conceptual domain into which the web
application is deployed (ontology).

The following requirement, states that human classification is actually the only
way to ensure the provision of “meaningful” results, as said by the 8-th one. However
this task could sometimes be over helming, especially if the conceptual domain is
vast and complex. In such cases, intelligent systems must be of aid by doing the
hard work and by only involving humans for approval, modification or rejection of
automatically extracted classifications.

Requirements 12 and 13 tackle those situations in which the content publisher


cannot carry out the classification task, as an example because he is not the author
of the resources. In such a case, the system is entrusted to automatically categorize
the resources by providing classification results similar to the ones that a human
would provide for the same resources.

Eventually, requirements from 14 to 17 refer to those functionalities that would


constitute a value added for semantic web applications but that are not critical, i.e.
that are not compulsory for the successful integration of semantic functionalities
into web sites. Such requirements have not been addressed in the work presented
by this document, however they are object of future work trends reported in the
ending section of the thesis.

52
4.2 – Use Cases for Functional Requirements

4.2 Use Cases for Functional Requirements


In this section three interaction scenarios between a semantic web application and a
user (in a wide sense) are reported, addressing in more detail the “semantic what’s
related”, the “directory search” and the “semi-automatic classification” tasks. The
widely adopted formalism of uses cases is used.

4.2.1 The “Semantic what’s related”


The use case (Figure 4.1) is deployed as follows: a user surfs a given page on the
web site under examination; the publication system detects the required page and
then extracts the conceptual description (index) associated to the page.
Before providing the page to the user browser, the system inquiries its internal
knowledge base for resources tagged with conceptual descriptions similar to the one
of the page being requested. It then evaluates the relevance of retrieved results
and filters out the ones below a given confidence threshold. In the end, it ranks
the remaining pages using a semantic similarity measure and appends to the page
required by the user, a set of links to conceptually similar resources.
The amount of provided links shall be designed to not impact the page fruition
process too much. Provided information, in fact, must not relieve the user attention
from the page content. Moreover the number of provided links must be small enough
to be managed by the user, say no more than ten, according to standard usability
rules.

4.2.2 The “Directory search”


The directory search is an almost standard and well known way of performing
searches on the web: the majority of web search engines such as Google [9], Ya-
hoo [11] and Altavista [12] adopt this interaction paradigm for providing thematic
access to indexed resources.
A semantic directory is similar to a classical directory: from the user point of view
the changes are negligible, however, in this case, resources belong to categories in a
more dynamical way. The directory is, in fact, a tree representation of an ontology,
obtained by taking into account only hierarchical relationships. As the association
between resources and ontology concepts is “fuzzy”, i.e. resources could refer to
different ontology concepts, possibly with different association strength, the same
resource could belong to different directory branches depending on its conceptual
description, taking also into account the non-hierarchical relationships defined in
the ontology.
The use case (Figure 4.2) is deployed as follows: a user searches the semantics-
aware web site for pages about “disability”. In order to perform this task, he/she

53
4 – Requirements for Semantic Web Applications

Figure 4.1. The “What’s related” use case.

selects the directory service of the site where resources are subdivided into homo-
geneous thematic sets. He/she searches for “disability” and subsequently selects
the corresponding directory entry for retrieving resources. The system receives the
request, maps the directory entry to the corresponding ontology concepts and ex-
tracts the most relevant resources with respect to that query. In other words, the
system identifies the involved ontology concepts and searches its knowledge base for
resources annotated as “about” those concepts. Then it ranks the available data
according to the semantic similarity with the user query (the selected concepts) and
finally it provides the results as a set of hyper links.

4.2.3 The “Semi-automatic classification”


In this use case the actor is the content redactor of a semantic web site. The redactor
accomplishes the daily tasks of writing new content, of reviewing pages that are in

54
4.2 – Use Cases for Functional Requirements

Figure 4.2. The “Category search” use case.

the wait list for publication, and of establishing the next steps in the site editorial
activity.
This use case (Figure 4.3) is focused onto the process of creating new content
and it is specifically devoted to clarify the content classification phase.
When a redactor has completed the editing of a new resource (article), he needs
to classify the new resource, taking into account the site knowledge base, in order
to allow users to retrieve and navigate the site pages according to their conceptual
content (see the previous use cases).
The classification is a manual process in this use case: the redactor has access
to the ontology and is required to select the concepts that are relevant with respect
to the newly created content, possibly specifying a measure of “relatedness” in the
range between 0% and 100%. To perform such task, he can navigate a tree-like
ontology representation, and select the relevant concepts.
For each concept a simple interface allows to specify the degree of correlation

55
4 – Requirements for Semantic Web Applications

between the article being classified and the concept, providing four choices for the
relation strength: low (25%), middle (50%), high(75%) and very high (100%).
As the ontology could contain hundreds or thousands of different and complexly
related concepts, the classification task easily becomes infeasible both in terms of
time consumption and of cognitive overload on the redactor. The system must
therefore provide some aids to the classifier, possibly retaining as much as possible
the quality of the classification.
This can be done by quickly scanning the resource under examination and by
providing suggestions to the redactor. The system does not need to perform a full
and accurate classification of the resource being edited but needs only to extract the
most relevant conceptual areas that seem to be related to that resource. It has also
the ability to modify the ontology navigation interface by highlighting the suggested
concepts thus allowing its human counterpart to refine the proposed annotations or
even ignore them and proceed with the usual, manual, classification.
Once selected all the relevant ontology concepts, the redactor submits the new
article to the system that stores both the content and the classification, allowing
users to semantically query the site for relevant resources, including the one just
loaded.

4.3 Non-functional requirements


Non-functional requirements are fundamental parts of the design process, in what-
ever application, since they guide the design by defining the physical constraints
that fix the boundaries of the application for what concerns its deployment in the
final work scenario. They are usually subdivided into different homogeneous areas
such as usability, performance, security, robustness, etc.
In this thesis, the main focus is on the usability and performance requirements
because one of the goals of this work is to provide a system for adding semantics
to already deployed web sites, easily. Clearly, security issues and system robustness
are important as well, however the scope of this work is much more focused on
providing ready to use technology rather than absolutely “safe” solutions. The
design priority, in other words, is to provide a system for immediate use, trying to
overcome, although in a little scenario, the “semantic exploitation” problem.
Gathered requirements reflect exactly this choice and are subdivided in “usability
requirements” and “performance” requirements. The former are:

1. The system deployment should virtually not require a downtime of the publica-
tion system that is being integrated, or at least should constrain the downtime,
in normal operational conditions, to few hours.

2. The system should allow the integration of semantic functionalities reducing

56
4.3 – Non-functional requirements

Figure 4.3. The “Semi-automatic classification” use case.

as much as possible the additional effort required for specifying semantic de-
scriptions and more in general reducing the changes in the publication work
flow, from the user/redactor point of view.

3. The system should be integrable into different site publication systems, inde-
pendently from the server side technology adopted for content publication.

4. The system should allow the reuse of already existing technologies: databases,
servers, etc.

5. The system should be platform independent.

6. The system should allow the manipulation of different data formats, at least
HTML, XHTML and plain text.

7. The system should be extensible for incorporating new functionalities and for
allowing the handling of different resource types such as multimedia resources.

57
4 – Requirements for Semantic Web Applications

The latter, instead, include the following:

1. The system should scale up from sites with few pages (around one hundred)
to really big sites with thousands of pages.

2. The system should be time effective, i.e., the system should provide results
in reasonable time, even when it is overloaded by several concurrent requests.
The response time is effective if contained into the user attention time-frame
that, for this kind of application is around ten seconds.

The above reported requisites for semantics integration in nowadays web appli-
cations are essentially resumable with the words “easy integration”. The highest
priority requirements state, in fact: a semantics-aware system shall be deployable
without requiring extra downtime of the site publication system that it integrates.
A semantic system should be transparent, i.e. the system presence should be un-
noticeable for both editors/publishers and end users. A semantic system shall not
conflict with already deployed technologies and shall be accessible from whatever
publication framework.
These are requirements that fall in the “usability” requirements class; in the same
class there are also some low priority requirements that represent further evolutions
or specifications of what intended in the high priority requirements.
From these last, in fact, emerges the necessity to reuse as much as possible
existing technologies such as databases, web servers, etc. (see requirement 4) and
the necessity to design platform independent systems (requirement 5). Finally some
requirements tackle the ability of a semantic system to handle many media. The
web people (content publishers, editors, redactors, users) are in fact more and more
aware of the impact that new media have on clients and end users. As a consequence,
while traditional technologies such as HTML pages and their evolutions (DHTML,
XHTML, ...) are naturally included into semantic elaboration, also new information
means such as audio streams, videos, DVDs and in general multimedia, shall be took
into account and possibly supported.
Performance issues deserve at least some attention in the design process, in
fact, once fixed the other requirements, performance can consistently affect the
effectiveness of a semantic system and can strongly influence its adoption in real
world applications. Scalability and timely responses (requirements 1 and 2 under
performance) are therefore high priority requirements that must be satisfied to fill
the gap that still persists between academic applications and real world solutions.

58
Chapter 5

The H-DOSE platform: logical


architecture

This chapter introduces the H-DOSE logical architecture, and uses such
architecture as a guide for discussing the basic principles and assump-
tions on to which the platform is built. For every innovative principle
the strength points are evidenced together with the weaknesses emerged
either during the presentations of such elements in international con-
ferences and workshops or during the H-DOSE design and development
process.

The requirements analyzed in the previous chapter are at the basis of the design
and implementation of a semantic web platform specifically targeted at offering low-
cost semantics integration for already deployed web sites and applications. Such
an integration is specifically oriented to information-handling applications such as
CMSs, Information Retrieval systems and e-Learning systems.
Designing a complete semantic framework involves roughly two levels of abstrac-
tion, the first one being concerned with the so-called logical design while the second
is more focused on practical implementation issues and is called deployment design.
In this chapter the logical design of the H-DOSE1 platform is addressed while the
deployment design is tackled in Chapter 6. H-DOSE, stands for Holistic Distributed
Semantic Elaboration Platform. The reasons for calling it holistic will become clear
in the following sections.

1
The name H-DOSE comes from “Holistic DOSE”, since it integrates and reconciles different
points of view (semantic web, web services, multi agent systems, etc). It is commonly pronounced
as “High-Dose”.

59
5 – The H-DOSE platform: logical architecture

5.1 The basic components of the H-DOSE seman-


tic platform
Describing a semantic platform in a sound, complete and unique way is nearly
impossible, since there are many opinions and ideas on what a semantic platform
should provide as service. In this thesis, a semantic platform is not considered
as a general purpose framework but as a solution strongly oriented to information
management. In particular, it is assumed that the main scope of the described
platform is to provide support for indexing, classification and retrieval of web pages,
either written in HTML, XHTML or in plain text.
Under such quite restrictive conditions a semantic platform can be depicted
as composed of three main elements: one or more ontologies, a set of semantic
descriptions (or annotations) and a set of textual resources (Figure 5.1).

Figure 5.1. The logical architecture of the H-DOSE semantic platform.

Reminding the definition given in Chapter 2, an ontology can be defined as “an


explicit and formal specification of a conceptualization”. It is composed of concepts
(or classes), from a given knowledge domain, and of relationships which relate to
each other the ontology classes. The whole combination of concepts and relationships
models a knowledge domain and defines the set of topics that a semantic platform
can manipulate.

60
5.1 – The basic components of the H-DOSE semantic platform

The first difference between a syntactic information management application and


a semantic one is the scope of managed resources. While the former can virtually
manage whatever resource, given that appropriate keywords exist, the latter cannot.
In fact, the ontology that provides means for abstracting information processes
from syntax, achieving terminology independence, also fixes the limits in which
such abstraction can work. Ontology-based application are then domain specific,
while syntactic applications can be omni comprehensive. However, in the scope of
this thesis, this is not a limitation since CMSs, e-Learning systems and the related
Information Retrieval systems are naturally domain specific. After all, the on-line
learning facilities of a University, for example, provide data corresponding to the
courses offered by the same Institution, which are always well known and limited to
a single, even broad, knowledge domain.
Resources are in the more general sense “all things about which someone wants
to tell something”. In a semantic platform this general definition is restricted to
“all the resources, in a given knowledge domain, about which something can be
said by someone”. So the first restriction is newly on the domain, which must be
specific, as defined by the platform ontology. In addition, in this thesis a more
restrictive assumption is done: resources are “textual resources, either written in
HTML, XHTML or in plain text, that shall be published on a web site or that shall
be used by a web application”. Then, for the rest of this document, resources will
be texts, unless otherwise specified.
Finally, descriptions, or annotations, are the “semantic bridges” between syntac-
tic resources, i.e. texts, and semantic entities, i.e, ontology concepts. Annotations
can be expressed as simple triples in the form “the text A is about the concept C ”, or
they can include more complex information such as the strength of the “about” re-
lationship, as an example. Annotations, in the working scenario of this dissertation,
are defined either by journalists/redactors or by the semantic platform. They define
the mutual relations and similarities between resources, and between resources and
queries. Final users of the semantic web application are, in general, not allowed to
define annotations. This assumption will allow, in the subsequent sections, to not
consider trust-related issues and so on.

5.1.1 Ontology
The H-DOSE semantic platform consider multilingualism as a critical issue that
must be addressed as stated by the functional requirements 2 and 3 reported in
Chapter 4 - Section 4.1.
Multilingualism issues in ontology-based applications are still active research
topics in the Semantic Web community. For solving them two main approaches
are currently under investigation: the first is based on the integration of language-
specific ontologies via ontology merging techniques, while the second assumes that

61
5 – The H-DOSE platform: logical architecture

a common set of concepts exists, that could be shared by different languages. Ac-
cording to the first approach, people speaking different languages will model a given
domain area by defining different ontologies that will be merged to provide a multi-
lingual semantic environment. The resource requirements for language management
will be comparable, in term of human and hardware resources, to that of ontology
merging, actually being the same task.
Moreover, as many aspects of the predefined knowledge area will be common
for all languages, many redundant “synonymous” relationships between language
specific ontologies will be defined, increasing resource wasting and complexity. De-
veloping a multilingual semantic framework using the above exposed methodology,
involves then the risk of getting an unmanageable entity as an outcome, in which
great care is required to define relationships between “equivalent ontologies” and to
track changes and coherently update those relations.
The second methodology addresses multilingualism issues by using an holistic
approach: when a common knowledge domain is modeled, it is likely that most
domain experts, even working in different languages, can identify a common “core”
set of concepts. A single, language independent ontology could therefore be used to
model such area, sharing concepts between different languages.
In the initial phase, this new stream of research was affected by a common
misunderstanding, which was strictly related to concept naming. Many times, in
fact, the concept name and/or its definition is considered as the concept itself, while
the second is actually an abstract entity to which people refer by using words.
A good definition of concept can be: “a well defined sequence of mental pro-
cesses”. In fact, words are only used as a trigger to the mental association that lets
us identify and instantiate concepts. In other words, a generic concept, a dog for
example, can be correctly identified using any alphanumeric string without any loss
of generality. On the other hand, ontology designers usually define concepts using
words, or sequences of words, in order to easily identify the meaning of the described
entity. To name concepts with language specific descriptions is just a usability trick
rather than a strict requirement, and should have no semantic implications.
Ontology is by definition language independent, while its instantiation in a spe-
cific idiom is effectively achieved by the adoption of a proper set of textual descrip-
tions for each concept, in each supported language. Such approach is inherently
scalable, as new languages can be easily supported by integrating new lexical enti-
ties and definitions, and possibly by slightly restructuring a small number of ontol-
ogy nodes. Moreover, special purpose relationships could be defined (e.g., links to
Wordnet [13] entities) providing ground for sophisticated functionalities.
The H-DOSE approach uses a language independent ontology2 where concepts

2
Firstly presented at SAC 2004, ACM Symposium on Applied Computing, Cipro

62
5.1 – The basic components of the H-DOSE semantic platform

are defined as high-level entities for which language dependent definitions are spec-
ified. Such semantic entities are linked to a set of different definitions, one for each
supported language, and to a set of words called “synset”. Operationally they can
be defined as:
concept :== concept ID, lex
lex :== (lang ID, description, synset)+
synset :== (word)+
A concept definition is a short, human readable, text that identifies the concept
meaning as clearly as possible, and that is expressed in a specific language. A synset,
instead, is composed of a set of near-synonymous words that humans usually adopt
to identify the concept.
The process of linking ontology concepts and lexical entities can be deployed
according to three different approaches: integration, annotation and hybrid. In the
integration approach, lexical entities are included in the ontology as new semantic
entities, a sort of special instances defined for each ontology concept. This approach
makes automated reasoning on lexical entities much more easier. The disadvantage
is that, whenever a new lexical entity must be added or an already existing term
must be modified, the entire ontology is involved.
The annotation approach solves this shortcoming by keeping separate the ontol-
ogy and the lexical entities. Lexical entities, which can be, for example, the senses
of a Wordnet-like lexical network, are then tagged as “related” to a given ontology
concept. Tagging allows to modify, update or even delete lexical entities without
modifying the ontology.
The latter hybrid approach mixes the first two approaches by keeping separate
lexical entities and ontology concepts, and by restructuring the networks between
lexical entities (Wordnet) in order to better reflect the ontology view of the given
knowledge domain. This approach is a little more flexible than the integration
approach, since modifications in lexical entities do not necessarily impact on the
ontology definition. However, the allowed changes can only be relatively small,
otherwise the structure of lexical entities would likely require a revision in order to
keep reflecting the domain view imposed by the ontology.
H-DOSE adopts the annotation approach by keeping the ontology physically
distinct from definitions and synsets. This allows a separate management of concepts
and language-specific information, and a complete isolation of the semantic and the
textual layers. Language specific semantic gaps are supported by including in some
concepts the definition and synsets in the relevant languages, only. This assumption
guarantees sufficient expressive power to model conceptual entities typical of each
language and, at the same time, reduces redundancy by collapsing all common
concepts into a single multilingual entity. The final resource occupation is, by a

63
5 – The H-DOSE platform: logical architecture

great extent, comparable to that of a monolingual framework, thanks to redundancy


elimination, while expressive power is as effective as needed.
Synsets and textual definitions should be created by human experts through an
iterative refinement process. A multilingual team works on concept definitions by
comparing ideas and intentions, aided by domain experts with linguistic skills for at
least two different languages, and formalizes topics in a mutual learning cycle. This
interaction cycle produces, at the end, two sets of concepts: general concepts and
language-specific concepts (Figure 5.2).

Figure 5.2. The H-DOSE approach to multilingualism.

The two sets are modeled in the same way inside the ontology. However, concepts
belonging to first category will be linked to definitions and synsets expressed in each
supported language, while those belonging to the second set will be linked to smaller
subsets of languages.
The complex interaction between ontology designers, users and domain experts
required by this approach at design time must build upon the availability of an inter-
national network in which people cooperate to model a defined knowledge domain.
Such kinds of networks have already been proposed, for example in the EU Socrates
Minerva CABLE project [3]. CABLE involves a group of partners with proved skills
in learning and education and promotes cooperation to define learning materials
and case studies for continuous education of social workers. In CABLE, teams of
experts in social sciences and education cooperate, with the support of “multilin-
gual” domain experts, in defining case studies, teaching procedures and identifying
the related semantics in a so-called “virtuous cycle”. The CABLE project can ef-
fectively apply the H-DOSE approach to multilingualism, leveraging its “virtuous

64
5.1 – The basic components of the H-DOSE semantic platform

cycles” and defining a multilingual ontology for education in social sciences. The
resulting ontology, definitions and synsets, can constitute an effective core for the
implementation of multilingual, semantic, e-Learning environments.

Formal definitions
An H-DOSE ontology is formally defined as a set O of concepts c ∈ C and relations
r ∈ R, together with a set of labels L, descriptions E and words S associated to the
concepts:
O : {C, R, χ, L, E, ψ}
where
C : concepts c ∈ C
R : relations r ∈ R
R⊆C ×C
χ : R → [0,1]
Wlang : words f or a language << lang >> w ∈ Wlang
S : set of near − synonymous words s ∈ S
s ⊆ C × Wlang
ψ : S → [0,1]
L : labels of concepts l ∈ L
l ⊆ C × Wlang
E : descriptions of concepts e ∈ E
e ⊆ C × Wlang
C is the set of concepts c in the ontology O while R is the set of relations r relating
ontology concepts to each other with a given strength χ in a range between 0 and
1. W is, instead, the set of all possible words w in a given language, S is the set of
near synonymous words associated to each ontology concept c with an association
strength evaluated by ψ in the range between 0 and 1. L is the set of labels l
associated to a given ontology concept c in a given language and E is the set of
descriptions e associated to such concepts, in the same, given, language.
Ontologies, in H-DOSE, shall conform to this approach, however they are allowed
to refer or to have links to other already existing ontologies. In such a case, multilin-
gualism is supported only for the internal ontology, or better, only for the ontologies
adopting the above format. Multilingualism support for external ontologies depends
on how such a concern has been addressed by the designers of the external models.

65
5 – The H-DOSE platform: logical architecture

5.1.2 Resources
Resources in the H-DOSE platform are considered to as texts either written in
HTML, XHTML or plain text. The main motivation for such an assumption is that
the platform has been designed for supplying semantic services to nowadays web
applications, and texts are currently the most diffused resources on the Web. How-
ever, in order to take into account the 7th non-functional requirement for a semantic
platform, resource support is designed to be easily extensible at the multimedia case.
In H-DOSE, therefore, a resource is “something, in the platform knowledge domain,
about which some information can be provided”. Where the term “something” is
assumed to be a document, either textual or, in a future extension, multi medial.
Documents can be entire web pages but they can also be simple, homogeneous
chunks of text (or video in a near future). In this last case, support for defining
relationships between fragments is provided, allowing to specify which fragment
belongs to which page.
Semantic annotation and retrieval of fragments is one of the innovative points of
the H-DOSE platform. Fragments, in fact, allow to take into account different levels
of granularity in classification and retrieval of resources. This allows, for example, to
extract and retrieve only those document pieces that are similar to a given user query,
eliminating all the disturbing content that usually contours relevant information such
as banners, links, navigation menus, etc. Moreover, the ability to specify the mutual
relationships between fragments, and between fragments and pages, allows to reduce
redundancy at most, still maintaining the relevance of provided results.
So, if many fragments of a given web page are relevant with respect to a user
query, the entire page is retrieved rather than each of its component. In other cases,
if a document about the jaguar life is well articulated in sections, including, for ex-
ample, an introduction, some more detailed sections and a conclusion, the retrieved
result may vary depending on the level of granularity of the user query. Therefore,
if a user requires “documents about the life of jaguar” only the introduction may be
retrieved, while if the query is more specific: “documents about jaguar nutrition in
the Bangladesh jungle”, or if the user chooses to deepen the previous query, the H-
DOSE platform can retrieve the internal sections of the document, better adapting
results to the user query.
Besides the granularity with which resources can be manipulated by H-DOSE,
also the storage policy deserves some attention. In H-DOSE documents are not
managed directly by the platform. That is to say, H-DOSE does not store indexed
document into its internal database. Such design choice allows, from one side, to
leverage the already existing and probably more efficient storage facilities of CMSs,
learning systems and IR systems. On the other side, it permits to semantically
describe resources that do not belong to the site in which the platform is deployed.
H-DOSE does not make any assumption on the property of annotated resources,

66
5.1 – The basic components of the H-DOSE semantic platform

they can be owned by whom deploys the platform as well as by other actors. In
any case they can be indexed, classified and retrieved by the platform through the
adoption of an external annotation scheme where resources are identified by means
of URLs and XPointers.
With respect to the ontology formalization of the previous section, resources in
H-DOSE are formally described as follows:

D : documents d ∈ D

be they entire pages or fragments.

5.1.3 Annotations
Annotations are the most important component of a semantic platform since they
allow to describe, from a conceptual point of view, the resources, and since they
constitute the mean for providing conceptual based functionalities such as semantic
search.
H-DOSE, according to the functional requirement number 13, adopts an ap-
proach that keeps annotations and resources well separated. With this approach
several critic issues can be addressed: as first, one could annotate non-owned re-
sources, i.e. resources for which the classifier has not editing rights, secondly, anno-
tations are only loosely coupled with the resource they annotate. This allow a page
to change formatting or slightly change content, while the corresponding annota-
tion can stay unchanged. On the other side, if a resource becomes obsolete an it is
retired, the system can address such an issue by simply deleting the corresponding
descriptions. Annotations in H-DOSE are formally defined as:

A : annotations a ∈ A

A⊆C×D
ρ : A → [0,1]
Where D is the set of all resources d suitable for annotations, A is the set of semantic
annotations relating the resources in D with the ontology concepts in O and ρ is the
association weight between resources and ontology concepts. The weight function
ρ allows to specify different degrees of “relatedness” to ontology concepts, in a
way similar to what does the Vector Space model for classical information retrieval
systems, thus obtaining a flexible way of representing knowledge and of tackling
resource ambiguity. In RDF notation, a H-DOSE annotation looks like:

<hdose:annotation rdf:ID=’15643’>
<hdose:topic rdf:about=’’#jaguar’’/>

67
5 – The H-DOSE platform: logical architecture

<hdose:document rdf:about=’’#doc123’’/>
<hdose:weight rdf:datatype=’’&xsd;double’’>0.233</weight>
<dc:author> H-DOSE </dc:author>
<hdose:type> auto </hdose:type>
</hdose:annotation>

Every resource semantically classified by the platform is pointed by many anno-


tations, each relating the document with a given concept in the platform ontology,
with a certain weight. Clearly, the number of annotations per document can easily
become very high and difficult to manage, as ontology concepts may be numerous
and documents can span a great number of different, related, topics. Therefore,
in this work, a new knowledge representation is introduced that retains the abil-
ity to include information from ontology structure, and from knowledge discovery
processes such as logical inference, through the definition of an expansion operator.
Such a representation3 , allows to collapse all the annotations referred to a resource
in a single “Conceptual Spectrum”.
A conceptual spectrum is formally a function mapping a given concept c to
a positive real number σ(c) expressing the relevance weight of such concept with
respect to a given resource:
σ : C → +

Together with the ability to merge fuzziness of documents with the crispness of the
ontology specification, spectra are also useful for performing some visual inspection
of knowledge bases. In fact, they can be visualized using the ontology concepts as
the x-axis and the σ(c) values as the corresponding y-values (Figure 5.3).
Since concepts do not possess an implicit ordering relation, the x-axis can be defined
in several ways, allowing the analysis of different aspects of a knowledge base. Depth-
first navigation of the ontology using the “subclassOf” relationship, as an example,
orders similar concepts, at the same granularity, into nearby positions, allowing good
discrimination of the ontology sub-graphs involved into a document annotation.
Breadth-first navigation, instead, allows the detection of the abstraction level of
the indexed resources by putting together concepts lying at the same level in the
ontology. Anyway, the resulting graphs have exactly the same capabilities in terms
of expressive power and matching properties, being different only in their visual
interpretation.
A conceptual spectrum σd associated to a document d is a conceptual spectrum
measuring how strong the association between ontology concepts and the document
is, taking into account the contribution of semantic relationships involved in the
3
Firstly presented at ICTAI 2004, International IEEE Conference on Tools with Artificial In-
telligence, Boca Raton, Florida

68
5.2 – Principles of Semantic Resource Retrieval

Figure 5.3. A “raw” conceptual spectrum (as obtained by simply composing


annotations).

knowledge domain. Formally:



σd : σd (c) = ρ(c ,d )
(c ,d )∈A∧c=c ∧d=d

i.e., for each ontology concept c, the document conceptual spectrum value is defined
as the sum of ρ contributions extracted from all the annotations associating the
document d with the concept c.

5.2 Principles of Semantic Resource Retrieval


Several concept-based search services proposed in the Semantic Web community
rely, to a certain degree, on logic inference to extract resources from a given domain
in response to a user query. Basically, they use a reasoning engine to verify an
input clause built from the user query and they subsequently rank retrieved results.
However, some problems arise when dealing with document retrieval where resources
are annotated through a “speak about” relationship rather than being instances of
specific ontology concepts.

5.2.1 Searching for instances


Conceptual search engines provide information to users by working on concepts
involved by the user query and occurring into indexed resources. Their action is
organized in two phases: the first uses inference for extracting conceptually relevant
instances, while the second tries to discriminate retrieved resources, with respect to
the user query or to the kind of retrieval process, assigning a different relevance value

69
5 – The H-DOSE platform: logical architecture

to each of them. The resulting resource set is ranked according to the computed
resource relevance and proposed to the user.
The first phase of a conceptual search includes logic reasoning or logic inference.
The user query is represented as a clause or as a set of clauses to be logically
satisfied by facts and axioms defined in the domain ontology, and the knowledge
base is subsequently surfed to find suitable instances. In other words, instances and
ontology concepts are merged into a cumulative graph that is surfed by the reasoning
engine for finding a match between the modeled knowledge and the user query. In
case no matching is found nothing could be deducted except that the knowledge
base does not model the given domain with enough information to answer to the
user query. Otherwise, the set of facts and axioms satisfying the query is provided,
allowing relevant resources identification.
It is important to notice that all provided results are equally relevant with respect
to the user query since they are all able to satisfy the query logical clauses. However,
discrimination should be performed, according to some external measures, in order
to provide a small, highly relevant set of results to the final user; this issue is
addressed in the second phase.
In that phase an ordering function is defined among the set of resources obtained
through logical inference. As an example, we might want to query our knowledge
base for lovely cats living in our city. The inference process would provide all cat
instances for which the properties “live in (city)” and “is lovely” hold. However
there are no means to evaluate how much lovely is a cat, from the logical point of
view. Some more information can therefore be taken into account for ranking results
according to user needs, the amount of “loveliness”, for instance, expressed as the
percentage of persons judging a given cat lovely.

5.2.2 Dealing with annotations


Although the inference process is effective on instance search, there are some issues
that should be addressed when dealing with documents and semantic annotations.
Annotations can be defined both by humans, by carefully analyzing the contents of
resources with respect to the ontology, or by machines, using information extraction
algorithms. In the first case annotations possess a great degree of trustworthiness
since they are defined by “experts”, but they are relatively few, since they are very
expensive to create. On the other hand, machine extracted annotations would fairly
be in a great number but it is likely that they will be less precise than the human
generated ones.
A relevance weight specified into each annotation predicate allows therefore tak-
ing into account those different degrees of trustworthiness and reliability, supporting
the definition of different association strengths between documents and correspon-
dent concepts in the ontology. In other words, since documents would generally

70
5.2 – Principles of Semantic Resource Retrieval

span arguments broader than a single concept, they will be pointed by a consid-
erable amount of weighted annotations each relating the document with a given
ontology concept.
Annotations, in opposition with instances, deal with knowledge sources that
retain a certain degree of fuzziness, which is addressed through the definition of
annotation relevance; however ontology constraints are still valid, and the domain
model of concepts and relationships still needs to be taken into account in order
to provide semantics rich services. In particular, inference is still useful to enable
systems to discover previously unknown knowledge; the only further requirement
is that fuzziness of resources and crispness of logic inference merge together in a
common environment.
In H-DOSE the mean for such a merging is provided by the “Expansion Opera-
tor” defined for conceptual spectra. Considering the spectrum definition provided in
section 5.1.3, it is easy to notice that spectra are simply a way for collapsing many
annotations in a single object. They do not take into account the ontology struc-
ture, i.e. the conceptual specification of domain semantics provided by concepts
and relationships in the ontology. Moreover, when annotations are automatically
extracted, they retain a considerable noise component that affects the resource spec-
trum, adding uncertainty to the correctness of the conceptual representation: there
can be wrong annotations, missing annotations, etc.
In order to overcome such an issue it is mandatory to exploit the ontology model
including all available information, with a particular focus on semantic relationships
between modeled concepts. In particular, sub-symbolic information, expressed as
the strength of relationships, must be taken into account in order to correctly eval-
uate conceptual spectra, reducing the sensitivity to the annotation noise. Relation
weights, in fact, allow to take into account how much a concept is related to another
concept in the ontology, adding a quantitative information that is complementary
to the logical constraints on ontology navigation defined by the relation semantics
(transitivity, inheritance,...). Such values are therefore critical for an information-
related application since they establish how much a relation correlates two different
concepts and how much such a relation should contribute to the definition of re-
source semantics (spectra). Relation relevance weights must be specified during the
iterative ontology design process and must be validated by domain experts in order
to assess the conceptual coherence of the resulting knowledge base.
To gain a better focus on this issue we could think about conceptual spectrum
components as concept clouds in the ontology. Strongly related concepts are grouped
together by means of relationships, and annotations can be seen as the starting
seed of these groups. Even if the related concepts do not appear into the original
annotation set, due to wrong or missing mappings, they should take part into the
document conceptual specification as they are conceptually related to the spectrum.
Therefore a new spectrum operator should be defined to discover the set of clouds

71
5 – The H-DOSE platform: logical architecture

relevant for each specific component of the conceptual spectrum associated to a


given resource, taking into account both non-explicit and sub-symbolic knowledge
such as inferred relationships and relations strength.
Such an operator is called the Spectrum Expansion operator X, formally defined
as:
X : (C → + ) → (C → + )

(Xσ)(C) = σ(c) χ∗ (c,c ) · σ(c )
(c,c )∈R∗

The expansion operator X thus provides a new conceptual spectrum Xσ as output,


whose value, for each ontology concept c, is defined as the sum of the original
spectrum value σ(c) and of the overall contribution from the concepts c related to
c. This last value, in particular, is computed by taking the original spectrum value
for each related concept c and by multiplying it by the strength χ∗ of relationships
between c and c in the transitive closure R∗ of the space of relationships R.

R∗ : transitive closure of R

P (c,c ) : path f rom c to c in R



∀(c,c ) ∈ R∗ , χ∗ (c,c ) = max χ(c ,c )
P (c,c )
(c ,c )∈P (c,c )

The transitive closure R∗ includes the definition of the χ∗ function as the maximum
strength path between c and c , where the path strength is computed by multiplying
all χ values of relationships included in the path.
In other words, the Spectrum expansion process takes a raw conceptual spec-
trum σ as input, i.e., a spectrum as it comes from the manual or the automatic
annotation process. Then, it processes such a spectrum by analyzing the ontology
and by propagating the relevance weights of each spectrum component, through
both hierarchical and non-hierarchical relationships. X consists, therefore, of graph
navigation on the ontology, where each relation has an associated weight that as-
sesses the conceptual distance between linked concepts, and an orientation, that
identifies the ways into which navigation can be performed. The navigation result
is an enhanced spectrum Xσ in which original topics, together with their relevance
weights, appear surrounded by clouds of topics extracted by means of the expansion
operator.
Expanded spectra as the one in Figure 5.4 are much more useful for search tasks
since they cover every possible nuance of document semantics according to a given
ontology. This allows to retrieve both documents where required concepts directly
occur and other, related documents, even if the original query does not exactly
match their raw (not expanded) semantic classification.

72
5.2 – Principles of Semantic Resource Retrieval

Figure 5.4. The spectrum of Figure 5.3 after the expansion.

5.2.3 Queries
One of the valuable properties of conceptual spectra is that they can be used to
represent documents as well as queries. A query is formally defined as follows:

Q : queries q ∈ Q

Q ⊆ 2W
Where 2W is the set of all possible combinations of words w in W .
Given that for each ontology concept c a synset s ∈ S is specified, then a query
conceptual spectrum can be extracted as reported below:

σq : σq (c) = ψ(c ,w  )
(c ,w  )∈S∧c=c ∧w  ∈q

For each concept c in the ontology, a query spectrum is defined as the sum of
contributions of all query terms w ∈ W modulated by the ψ function associated to
the relation S.

5.2.4 Searching by conceptual spectra


H-DOSE adopts a semantic search paradigm based on the notion of conceptual
spectrum. The search process is organized as follows: firstly the entire annotation
base is expanded into expanded conceptual spectra that form the system knowledge
base. This operation should only be performed once and can be totally off-line.
Secondly a series of operations are performed that, starting from a user query,
provide a ranked set of URIs.

73
5 – The H-DOSE platform: logical architecture

When a query is received, it is mapped into ontology concepts, obtaining the


spectrum σq , according to the query spectrum defined in section 5.2.3. The resulting
spectrum is then expanded in order to get an expanded spectrum comparable to
those stored in the operating knowledge base.
A spectrum matching algorithm subsequently extracts, from the knowledge base,
the most relevant associations, identifying a set of documents which are relevant to
the user query. Retrieved URIs are finally ranked according to a similarity measure
which takes into account the form difference between the spectra of query and of
retrieved documents, and the top URIs are provided as result.

Query matching
Document spectra are expanded at indexing time and stored into the annotation
base that will be used for query matching and relevant document retrieval. Con-
versely, queries are translated into expanded spectra at runtime, before searching
the annotation base for a match. However, from a search engine point of view, they
are both spectra into a common, homogeneous space and the retrieval task simply
corresponds to resource matching into such space.
In order to perform spectra matching, a similarity function can be defined ex-
tending the Vector Space model for information retrieval. This extension interprets
the two spectra to be compared as two vectors into an n-dimensional space, where
ontology concepts c represent dimensions and the correspondent relevance weights
σ(c), computed by means of the expansion operator X, are the vector components.
Searching for a match in terms of shape is, in that space, equivalent to search
for vectors having the minimum angular distance between them, i.e. to search for
vectors with similar directions. Therefore, the similarity Sim(q,d) can be simply
computed by evaluating the cosine of the angle between the two spectra.
σq · σd
Sim(σq ,σd ) = cos(φq,d ) =
|σq | · |σd |

5.3 Bridging the gap between syntax and seman-


tics
Until now, methods for representing conceptual descriptions of resources and queries,
and methods for matching them in the same vectorial space have been explained.
However nothing has been said on how to extract these descriptions from real world
data such as HTML or XHTML documents.
The ontology definition adopted by H-DOSE already provides means for a simple
“bag of words” classification paradigm. This paradigm simply works by searching
documents for words occurring in synsets. For each occurrence found, a score is

74
5.3 – Bridging the gap between syntax and semantics

computed, as an example by using a tf · idf weighting scheme. The contributions


of all the words in a synset S associated to a concept c are composed in the value
of a single spectrum component relative to the concept c, thus defining the final
semantic characterization of a given resource.
The “bag of words” is a quite diffused, although very simplistic, method for
classifying resources into classes (in the H-DOSE case, concepts in a ontology).
Other more sophisticated methods can be applied, for example the last version of
the H-DOSE platform uses SVM (Support Vector Machine) classifiers for associating
documents to ontology concepts. Besides discussing the effectiveness of classification
methods, which have been extensively studied in the “Information retrieval” research
field, as the TREC conference series [14] demonstrate, the main focus here is on
filling the gap between pure syntactic classification and semantics. In such a case,
the “bag of words” although very simple, is a quite powerful method since words in
synsets are defined by experts and discriminate resources quite precisely, by using
tacit human knowledge, i.e. the skills of domain experts, to ensure the correctness
of mappings. In addition, synsets so defined can also be used as seminal training
information for more complex classifiers such as SVMs. Unfortunately these expert-
defined synsets, while being assumed error-free, are clearly not exhaustive, nor they
are sufficient for performing effective enough classification. Nevertheless they can
represent a seed information for automatically widening the lexical coverage of the
semantic platform, thus allowing to better recognize the context of indexed resources
avoiding errors related to ambiguity of terms.
In H-DOSE, before applying any kind of classification method, the expert de-
fined synsets are therefore expanded to a more usable size (around 10-15 terms for
each ontology concept). Two methods4 have been developed for performing this
expansion, which can work in cooperation to maximize the expansion effect while
reducing at most the errors that the process inevitably introduces. Both methods
leverage Wordnet-like lexical networks for extracting relevant terms: the first rec-
ognizes the sense of terms by using the ontology structure, while the second uses
statistical information on term co-occurrence.

5.3.1 Focus-based synset expansion


The simplest synset expansion is the retrieval of a group of words in synonymous
relationship with the existing ones. Such step could be easily performed using lexical
nets, starting from a minimal set of words defined by experts and applying a transi-
tive closure of the synonymous relationship. However, this would neglect the sense
parameter, and it does not guarantee that the expanded synset is self-consistent

4
Firstly presented at SAC2005, ACM Symposium on Applied Computing, Santa Fe, New Mexico

75
5 – The H-DOSE platform: logical architecture

and wholly relevant with respect to concept specification. For example the con-
cept “Business broker” could be represented by the expert-defined word “Agent”.
Unfortunately “Agent” in the computer science context is a software entity with
no connection at all with financial markets. Therefore simply retrieving synonyms
from lexical nets would produce an inconsistently expanded synset in which both
software and finance terms appear. With these premises, for every word to be used
in the expansion process it is necessary to discriminate between its different senses
to identify the synonyms according to the relevant sense only, and to avoid adding
misleading results to the knowledge base. This action in the literature is typically
called sense disambiguation. To perform sense disambiguation, H-DOSE uses a tech-
nique inspired by the focus-based approach defined by Bouquet et al. [15]. In such
approach the focus is defined as a concept hierarchy containing the original node,
all its ancestors, and their children. Thus the focus of a concept is the part of the
ontology necessary to understand its context. Let us consider an example to better
clarify the definition: referring to the ontology in Figure 5.5, we want to expand the
synset of the concept “Agent”.

Figure 5.5. The focus of “Agent”.

Concepts surrounded by dashed lines compose the focus for the concept “Agent”.
If a search for “Agent” synonyms is performed in WordNet, as an example, one
of the provided terms is “broker”. However, in the above ontology, agents have
nothing in common with businessmen and are only proactive software entities. The
focus of the concept would easily allow rejecting the term “broker” since it is not
related with software, software entities and so on. Such approach, although being
quite effective in sense disambiguation, in the definition given in [15] has some
limitations. The original approach relies on label names to contextualize a concept,
and this brings some extra constraints for ontology designers. Beyond that, the
approach is not scalable to multilingual environments in which concept labels are
meaningless. Finally, relying on a single word per concept to understand the context
increases the chances of misunderstanding the correct sense. For this reason, the
technique here depicted, while finding its inspiration on the approach described

76
5.3 – Bridging the gap between syntax and semantics

and in particular on the focus of a concept, can disambiguate senses maintaining


full multilingual support and making full use of the H-DOSE ontology definition.
Instead of concept labels synsets defined by experts are used, obtaining, on one side,
a greater amount of possible terms, since the starting set is wider than a single word
(the label), and, on the other side, improving the sense detection capability because
more words concur into the focus definition allowing better sense disambiguation.
The underlying idea is that, for every word existing in the synset of a concept,
we can detect its correct sense by checking the “similarity” between every synset5
(which depends on the sense parameter) of the lexical net and the synsets of the
focus of the concept (see Figure 5.6 for the method pseudo-code).

Vector synset = getSynset(ontoClass);


Vector expansion = new Vector();
Vector focus = calculateFocus(ontoClass);
Vector focusWords = new Vector();
for-each concept in focus
focusWords.add(concept.getSynset();
end for
for-each word in synset
Vector senses = SemNet.getSenses(word);
int counter[senses.size];
for-each fWord in focusWords
for-each sense of senses
counter[sense] = occurrences of fword in
SemNet.getSynonyms(word,sense);
end for
end for
maxSense = sense for which counter[sense] is max;
expansion += SemNet.getSynonyms(word, maxSense);
end for
return expansion;

Figure 5.6. Pseudo-code of the focus-based expansion

5
There is a clash between the term “synset” adopted for noting words associated to concepts
and the term “synset” adopted in lexical networks for defining words with similar senses. Here the
term “synset” is referred to the last acceptation.

77
5 – The H-DOSE platform: logical architecture

5.3.2 Statistical integration


The focus-based method has been integrated into a more powerful and general synset
generation architecture, where also statistical information of word co-occurrence is
taken into account. The underlying assumption is that, although the effectiveness
of the focus approach is high, some relevant words could be missed or simply mis-
classified and rejected. To address such an issue, strategies have been designed for
detecting missed relevant words, using information already encoded in synsets.

Vector expansion = new Vector();


int length = synset.length;
for-each lexicalEntity in synset do
allSynonyms= get all the lexicalEntity.synonyms from
SemNet without caring of the sense;
end for
for-each syn in allSynonyms
if (syn.numberOfOccurrences > threshold*length)
expansion += syn;
end if
end for
return expansion

Figure 5.7. Pseudo-code of the statistical-based expansion

The first step of these refinement strategies takes as input a synset and performs
a search on lexical nets retrieving, for each synset term, all the available synonyms,
without caring of senses. Then a set of policies is applied basically searching the
synonyms set for co-occurrence of words. This approach is purely statistic: every
term is ranked according to its occurrence frequency and filtered using an adap-
tive threshold. Repeated terms are likely to be related with the context to which
the ontology is referred. The pseudo-code of the method which, having received a
synset, returns the expansion, is presented in Figure 5.7. To achieve better results,
it’s possible to integrate this approach with the formerly presented one by using
the output of the focus-based expansion as input for the statistical method. In
this way, the statistical integration, together with the major contribute from the
focus-based expansion process, takes part into the final synset definition allowing
for the creation of a suitable set of words. The overall word set precision, with
respect to the conceptual specification, may decrease in the expansion process, still
remaining effective enough to be used by classification engines. Assuming that the
expert-created synsets have a precision of nearly 100%, the automatic expansion,
while bringing new useful information, may in fact include some misleading term.

78
5.4 – Experimental evidence

Therefore, the increase in synset size (and conceptual coverage) is usually balanced
by a potential loss of precision.

5.4 Experimental evidence


This section provides the experimental evidence6 for the principles introduced in
the previous sections, and in particular for the multilingual approach, for the spec-
trum representation and the spectra-based search, and for the synset expansion
techniques.

5.4.1 Multilingual approach results


Multilingual functionalities in the H-DOSE platform have been tested in a sim-
ple experimental setup, aimed at assessing the feasibility of the adopted approach.
The tests used a disability ontology developed in collaboration with the Passepa-
rtout service of the City of Turin, a public service for disabled people integration
and assistance in Italy. Two HTML pages have been randomly selected from the
Passepartout web site that were available in English and in Italian. The disabil-
ity ontology, originally paired with an Italian lexicon, has been integrated with an
equivalent English lexicon.
The first test aims at showing the architecture support to multilingualism; to
achieve this goal Italian and English documents have been indexed, obtaining 85
annotations for the English version and 58 for the Italian one, automatically recog-
nizing the document language. Ideally, at the semantic level, the platform should
annotate with an equivalent set of concepts all pairs of translated page fragments.
To verify this property, the annotations stored into the repository for both Italian
and English documents have been analyzed and the correlation factor between de-
tected fragments computed, at different levels of fragmentation, i.e., at the Hx

tag level (see Table 5.1) and at the P


tag level (see Table 5.2). The adopted
correlation measure is the cosine of the angle between the vectors representing the
document fragments, defined according to the classical vector space model and con-
sidering each concept as an independent dimension. Rows and columns are labeled
according to the source of fragments, using a “language/page/fragment” syntax.
Correlation data, at each fragmentation level, has been grouped into two sets: the
first is composed by elements lying on the table diagonal while the second includes all
remaining elements. Elements lying on the table diagonal are the correlation factors
between the same fragments expressed in different languages. Figures 5.8 and 5.9
depict the correlation distributions for the two sets at the Hx
fragmentation level
6
Results in this section have been extracted using the former version of the H-DOSE platform,
simply called DOSE, which supported fragmentation at the HTML tag level.

79
5 – The H-DOSE platform: logical architecture

IT\ EN Pg1/H1[1] Pg1/H2[1] Pg2/H1[1] Pg2/H2[1] Pg2/H2[2]


Pg1/H1[1] 0.69 0.62 0.28 0.29 0.34
Pg1/H2[1] 0.73 0.74 0.19 0.18 0.27
Pg2/H1[1] 0.30 0.36 0.69 0.46 0.54
Pg2/H2[1] 0.19 0.22 0.54 0.50 0.38
Pg2/H2[2] 0.21 0.25 0.49 0.36 1.00
Table 5.1. Correlation factor between fragments at the Hx
fragmentation level.

IT\ EN Pg1/P[2] Pg1/P[3] Pg2/P[1] Pg2/P[2]


Pg1/P[2] 0.59 0.36 0.21 0.30
Pg1/P[3] 0.00 0.79 0.00 0.00
Pg2/P[1] 0.32 0.00 0.50 0.38
Pg2/P[2] 0.36 0.00 0.36 1.00
Table 5.2. Correlation factor between fragments at the P
fragmentation level.

and at the P
level, respectively. It is easy to notice that the distribution of the
first set and that of the second set are nearly separated: this results conforms to
the initial expectations and ensures that the DOSE platform can work effectively
in a multilingual environment. In fact, it shows that correspondent fragments, in
different languages, are annotated as belonging to a common set of concepts while
non-correspondent fragments are kept distinct by annotating them with reasonably
different concepts.

Figure 5.8. Correlation between fragments at the Hx


level.

80
5.4 – Experimental evidence

Figure 5.9. Correlation between fragments at the P


level.

5.4.2 Conceptual spectra experiments


Several tests have been set-up in order to assess the feasibility of the spectrum
representation together with the effectiveness of the spectra-based search process.
The tests used an ontology developed in collaboration with the Passepartout ser-
vice of the city of Turin. The ontology is composed of about 450 concepts related
by means of either hierarchical or non-hierarchical relationships. Eight types of
non-hierarchical relationships have been defined in the ontology, such as: “Implies”,
“Defined by”, etc. For each relationship a weight has been specified by domain
experts, starting from the “isA” relation that has been considered as the strongest
one. Around 1’000 documents have been indexed, from the Passepartout web site,
obtaining around 40’700 semantic annotations. Resources have been split into sev-
eral fragments and each fragment has been separately annotated in order to provide
fine granularity search results.
Starting from such a relatively small annotation base a search test has been
deployed involving 10 people in the evaluation phase. Each evaluator has been
given a set of queries (Table 5.3) to be performed and has been asked to judge
results relevance with respect to queries, assigning, to retrieved documents, values
that range from 0 (not relevant) to 100 (fully relevant). The search interface was a
simple PHP front-end for the DOSE architecture allowing keyword based searches.
Half of the evaluators used a previous, keyword based, search engine already
available in the platform while the second half exploited the proposed spectra-based
engine.
Results have been collected and grouped by query and the corresponding rele-
vance values have been analyzed and used to draw the following relevance charts. As

81
5 – The H-DOSE platform: logical architecture

Queries (Italian / English)


Cieco / Blind
Lavoro Disabile Cieco / Job Disabled people Blind
Trasporto Disabile / Disabled people Transportation
Diritto Sordo Lavoro / Rights Deaf Job
Agevolazioni Cieco / Facilitations Blind
Table 5.3. Test Queries

could easily be noticed in Figure 5.10, the spectra search engine is, on the average,
able to provide more relevant results than the keyword based one and it is also able
to provide a better ranking of retrieved documents providing more relevant results
in the first positions of the whole retrieved set.

Figure 5.10. Comparison between the previous search engine and the spectra one
on the query “lavoro disabile cieco” (Job Disabled-people Blind).

Results for the entire evaluation phase have been grouped into a chart (Figure
5.11) that shows the average relevance achieved by both search engines on the five
different queries, according to evaluators’ judgment. Such results have been weighted
by taking into account the ranking position of retrieved documents, therefore the
mean relevance value R on the y axis has been computed as follows.
 rd
R=
rank(d)

Where rd is the relevance of the document d and rank(d) its ranking order.
The overall performance of the proposed spectra search engine overcomes the one
of the traditional search engine for every query issued. The weighted combination
of better ranking results and better precision, underline the approach feasibility and

82
5.4 – Experimental evidence

Figure 5.11. Mean relevance scores for both engines over all queries.

provide an indication of the effectiveness of the approach, able to work at different


conceptual granularities for both queries and indexed documents.

5.4.3 Automatic learning of text-to-concept mappings


The synset expansion and integration techniques introduced in section 5.3 have
been tested using the Passepartout ontology adopted in the previously detailed
experiments. Firstly the focus-based expansion was applied, using WordNet as
lexical network, followed by the statistical integration. Starting from an original
set of 1973 terms associated to the ontology concepts, the focus-based technique
extracted 1200 more words, providing a size increase of about 61%. The subsequent
application of the statistical integration lead to the extraction of other 700 terms
corresponding to the 21% of the focus-expanded sysnsets. The precision of results
was higher for the latter set, still being valuable for the focus-extracted terms. Table
5.4 reports the corresponding figures.

Synset type Size Precision


Original 1973 100%
After focus expansion 3221 89%
After statistical integration 3897 91%
Table 5.4. Synset expansion results.

In order to perform a better evaluation, the impact of the expanded lexical


mappings on the process of indexing and retrieving resources has also been tested.
In order to perform such evaluation 80 pages from the the Asphi website [15], an on-
line collection of resources and services for disabled people, have been indexed using

83
5 – The H-DOSE platform: logical architecture

a simple “bag of words” indexing powered, respectively, by the original synsets, by


the focus-expanded ones and by the statistically integrated set of terms. Five queries
have been issued for each set and, for each query, only the first 20 retrieved resources
have been evaluated in terms of precision and recall. As shown by Figures 5.12 and
5.13, while the recall figure increases as more terms are adopted, the precision values
stay nearly unchanged, thus showing the effectiveness of the approach.

Figure 5.12. Recall results for 5 queries on the www.asphi.it web site.

Figure 5.13. Precision results for 5 queries on the www.asphi.it web site.

84
Chapter 6

The H-DOSE platform

This chapter describes in deep detail the HDOSE platform, focusing on


the role and the interactions that involve every single component of the
platform. The main concern of this chapter is to provide a complete
view of the platform, in its more specific aspects, discussing the adopted
solutions from a “software engineering” point of view.

The main motivation for the H-DOSE design is to provide a system actually usable
in the nowadays Web. According to the vision of this work, such a result shall
be reached through the analysis of requirements perceived by the web actors (and
reported in Chapter 4). In order to respond to some of these requirements, non-
functional requirements especially, a so-called holistic approach has therefore been
adopted. The term holistic refers to the integration of different techniques, and
in particular web services and multi-agent systems, into a common platform for
semantic elaboration.
As emerges from the non-functional and usability requirements number 1 and 3,
the H-DOSE platform shall be “easy integrable” into already existing publication
frameworks and, shall be independent from the server side technologies adopted for
publication.
Web services are the state of the art technology for supporting such functional-
ities. They, in fact, allow system interoperability, by adopting open Internet stan-
dards, by being able to describe their own functionalities and location and by inter-
acting with other Web Services. They are cost-effective since they replace tightly
coupled applications, with related problems of data passing and interface agree-
ments, by offering a loosely coupled architecture in which the business logic is com-
pletely separated from the data layer. Moreover, web services allow reuse of func-
tionalities by adopting standard description formats: WSDL for service logic, UDDI
for advertising and SOAP to communicate. The web service technology is very well
suited for “access-type” services, which are not very computationally expensive and

85
6 – The H-DOSE platform

that usually do not take advantage of replication and distribution amongst differ-
ent locations. Web services are, in fact, more suited to accomplish interface tasks
rather than to implement the internal business logic of applications, which is usually
delegated to more effective processes.
H-DOSE accounts this feature by adopting web services as standard interfaces
for providing semantic services to SOAP-enabled applications. The internal com-
plex tasks are, instead, delegated to the platform internal layers where they get
executed through a quite different technology: multi-agent systems. Agents possess
in fact characteristics that made them actually suitable for accomplishing very inten-
sive tasks exploiting replication, distribution and location-aware computing. They
are “living” software entities located into proper containers that constitute their
ecosystem, and they have an adaptive and autonomous nature, and social capabili-
ties enabling them to coordinate their own actions with others and to cooperate or
negotiate for reaching a specific designed goal.
Since the non-functional requirements, especially the requirements number 1 and
2 under the category “performances”, indicate that scalability is one of the main
concern for the platform deployment in real world applications, H-DOSE uses agents
to perform computationally intensive tasks such as automatic indexing of resources.
The adoption of agent services, in fact, allows the natural distribution of tasks, and
thus of the committed charge, toward the various information sources involved by
the classification. So, for example, if a given web site is offering a suitable container
for agents, the indexing part of the H-DOSE architecture can be replicated on that
site, allowing for local indexing and, distributing the indexing load through all the
available information sources, depending on resource location.

6.1 A layered view of H-DOSE


The H-DOSE 1 platform adopts a modular, distributed architecture that is deployed
on three different layers: the Service layer, the Kernel layer and the Data-access
(Wrappers) layer (Figure 6.1).

6.1.1 Service Layer


The Service layer implements the interface between the architecture and the external
applications that integrate the functionalities provided by the platform. In such a
layer, the available services correspond to the high-level functionalities of semantic
1
Firstly published in the proceedings of ICTAI 2003, Sacramento, California. The subsequent
revision, corresponding to the version presented in this thesis, has been published in proceedings
of SWAP 2004, Semantic Web Applications and Perspectives, Ancona, Italy.

86
6.1 – A layered view of H-DOSE

Figure 6.1. The H-DOSE architecture.

classification (indexing) and searching. They are respectively implemented by the


Indexing service and by the Search service.

Indexing service

The Indexing service offers a public, SOAP-based interface for performing semantic
classification of textual resources. It is basically a queue manager for the correspond-
ing kernel-level service, implemented as a multi-agent system. Two main operation
types are supported, namely “indexing” and “batch-indexing”.
The indexing offers a simple, interactive way for semantically classifying re-
sources. An external application invoking this operation type is basically required to
provide its name or unique identifier, and the URI of the document to be indexed.
Some additional information can be provided if, for example, the document is a
fragment belonging to a more complex resource. In such a case a “partOf” attribute
specifies the URI of the resource which contains the document. The service call, as
well as in the “batch indexing” case, is asynchronous and the application can decide
whether listening for annotation results or not.
Internally, the indexing operation simply enqueues the URI of the document
into the kernel-level agent service, and if required maintains the reference to the
calling application for notifying the indexing result. Such a reference is not a clas-
sical “pointer”, instead it is simply the endpoint of a web service (either SOAP or
XML-RPC) to which the notification shall be propagated. This design choice allows
to keep the indexing service as stateless as possible, thus avoiding to store explicit

87
6 – The H-DOSE platform

references to applications with related problems of session persistence, authentica-


tion, etc. Figure 6.2 shows the indexing operation interface, in pseudo code, where
the parameters denoted with an asterisk are optional.

void index(URI documentURI, URI applicationURI,


URI superDocumentURI*, Endpoint notificationEndpoint*)

Figure 6.2. Public interface of the “indexing” operation.

The batch-indexing operation is, as said by the name, focused on the indexing of
many resources at time. This functionality is usually required when a consistent set
of resources, e.g., a CMS article base, must be classified. The interface, in this case
is pretty much similar to the simple indexing operation, however the parameters
are stored in arrays, one for the documents to be indexed and one for the optional
“container” resources. The internal implementation simply loads the queue of the
kernel-level service with the whole bunch of documents to be indexed. In addition
an alternative interface is also provided for enabling indexing transactions in which
each simple indexing request is enqueued into a single batch-indexing request (Figure
6.3).

void batchIndex(URI documentURIs[ ], URI applicationURI,


URI superDocumentURIs[ ]*,
Endpoint notificationEndpoint*)

Or

beginBatch(applicationURI)
index(documentURI1, applicationURI,
superDocumentURI1*, notificationEndpoint*)
index(documentURI2, applicationURI,
superDocumentURI2*, notificationEndpoint*)
...
index(documentURIn, applicationURI,
superDocumentURIn*, notificationEndpoint*)
endBatch(applicationURI)

Figure 6.3. Public interface of the “batch-indexing” operation.

Typically, batch operations are accomplished by the same agent, in the kernel-
level indexing service, thus allowing to easily keep trace of the corresponding clas-
sification tasks. Unlike what happens in the simple indexing scenario, the possible

88
6.1 – A layered view of H-DOSE

notification is referred to the whole ensemble of resources: an indexing failure does


not indicate a failure for each resource in the ensemble but necessarily implies that
at least one document has not been classified due to errors. It is also important
to notice that the transaction mode violates the goal of maintaining the service
as stateless as possible, however problems related to non-terminated transactions
are easily handled by the maintenance service, which periodically purges pending
notifications, transactions, etc. from the indexing service queue.
Finally, the indexing service also provides a wrapped access to the kernel-level
annotation repository, which allows semantics-aware external applications to directly
store annotations into the H-DOSE knowledge base (Figure 6.4). However, care
must be used when adopting such a method, since the provided annotations must
be created using the same ontology that powers the platform. If this constraint is
not respected, the platform generates an error and drops the implicated annotations
in order to keep itself in a consistent state.

void store(URI documentURI, URI applicationURI,


URI superDocumentURI*, URI conceptURIS[ ],
Float weights[ ])

Figure 6.4. Public interface of the “annotation-store” operation.

Search service
As stated in the previous chapter, the H-DOSE search services are semantic, i.e. the
results relevance, with respect to applications queries, is evaluated at a language-
independent conceptual level. Such evaluation uses the methods explained in section
5.2 and provides, as an output, a ranked set of resources in form of a list of URIs.
Two operations are defined: a “search by concept” and a “what’s related” search.
In the “search by concept” functionality, the calling application must specify a
list of relevant concepts, possibly accompanied by a corresponding set of weights,
between 0 and 1, that specify the importance of each concept in the list. This two
informations are automatically combined into a conceptual spectrum, which is then
expanded using the expansion operator defined in section 5.2.2 and implemented by
the Expander service at the Kernel layer. Once expanded, the query spectrum is used
as seed information for the Annotation repository that retrieves all the document
descriptions similar to the query spectrum. These descriptions are then ranked by
the Search engine according to their similarity with the original query. It must
be noted that while descriptions are retrieved using an expanded query spectrum,
which allows to select also the relevant resources that are not explicitly correlated
with the initial query. The final ranking is performed by evaluating the similarity of

89
6 – The H-DOSE platform

document descriptions with the original query, thus filtering those possibly wrong
results due to the expansion process. In other words, the expansion process widen
the potential recall of the search system while the final ranking tries to keep as high
as possible the results precision. Figure 6.5 reports the “search by concept” interface
in pseudo-code.

URI[ ] search(URI conceptURIs[ ], Float conceptWeights[ ],


String resultLanguages[ ]*,int maxResults)

Figure 6.5. Public interface of the “search by concept” operation.

By looking at the interface pseudo-code, two more features can be identified:


as first, since the platform is actually multilingual, the provided results can be in
any of the supported languages, even in more than one at time. So for example, an
application can require results in both English and German. The second observation
is that the query is fully language independent, it is in fact composed of the URIs
of concepts in the H-DOSE ontology. This, on one side implies that the application
shall have a mean for mapping concepts and user-defined queries, as an example by
providing a directory-like search interface where each directory node corresponds to
a concept URI. On the other side, it forces the application to be at least partially
aware of the ontology used by H-DOSE. Otherwise the concept URIs cannot be
correctly specified. In order to supply a more traditional access to the search service
a slight variation of the “search by concept” functionality is also available, where
applications can provide keyword-based queries. In such a case the keywords are
converted into a conceptual spectrum by means of a sub-module of the indexing
service (known as Semantic Mapper). Then the search process continues as in the
normal case providing as final result a set of possibly relevant URIs (see Figure 6.6
for the interface specification).

URI[ ] search(String keywords[ ], String resultLanguages[ ]*,


int maxResults)

Figure 6.6. Public interface of the “search by keyword” operation.

The “what’s related” search functionality is specifically designed to respond to


the 6th functional requirement specified in section 4.1. It offers a very simple inter-
face in which a document URI shall be specified together with the maximum number
of required results (Figure 6.7 shows the corresponding interface).
Whenever the search engine service receives such a request, it simply forwards the
received URI to the “reverseSearch” method of the annotation repository, which in
turns provides as result a list of documents whose spectra are similar to the spectrum

90
6.1 – A layered view of H-DOSE

URI[ ] whatIsRelated(URI documentURI, int maxResults)

Figure 6.7. Public interface of the “what’s related” search operation.

of the document identified by the given URI. Then the retrieved results are ranked
according to their similarity with the spectrum of the query document, provided by
the annotation repository, and returned to the calling application. If the number of
available results is bigger than the maximum number of required results, only the
top maxResults URIs are provided.
A last remark shall be provided about the search engine service: the similarity
function used by this service is exactly the one defined in section 5.2.4 while, in the
annotation repository, similarity is evaluated in a very approximate form privileging
the response time with respect to precision in the evaluation.

6.1.2 Kernel Layer


The Kernel layer is where the most complex and computationally intensive tasks
take place. Such tasks are logically subdivided into classification tasks and retrieval
tasks; modules are separated according to these logical functions. In particular,
the Expander service is much more used in the retrieval process while the Indexing
sub-system, as can easily be noticed, is devoted to the classification process.
The Annotation Repository has a sort of dual nature since it is used in both in-
dexing and retrieval; it is in fact charged of annotation management and persistence.
This dual nature adds a further performance requirement to such a module which
shall be fast enough to effectively handle requests coming from both the indexing
and the search subsystems.

Expander service
The expander service implements, inside the platform, the spectrum expansion op-
erator. It basically accepts as input a “raw”, or unexpanded, spectrum and performs
the spectrum expansion using the H-DOSE ontology. The result is newly a concep-
tual spectrum, which takes into account the not explicit knowledge encoded in the
H-DOSE ontology (see Figure 6.8 for the pseudo-code of the expander interface).

Spectrum expand(Spectrum rawSpectrum)

Figure 6.8. Public interface of the expander service.

The expansion operator can be proficiently applied in the search phase, whatever
being the interaction paradigm used, except the “what’s related” one.

91
6 – The H-DOSE platform

When a user specifies a query, either by directly selecting concepts or by pro-


viding some “relevant” keyword, the resulting spectrum is usually composed of few
concepts. In this scenario the expander can browse the ontology, according to se-
mantic relationships, and expand the query specification by adding relevant, related
concepts to the initial specification. This operation potentially allows the system
to retrieve resources which are interesting for the user but that, without expansion,
would not have been retrieved because they do not have direct associations with the
concepts originally composing the query.
In classification tasks, the expander could either be used or not. However, while
in the automatic classification task the expansion can be useful by explicitly ac-
counting semantic relationships occurring in the platform conceptual domain, and
by possibly overcoming mis-classifications due to word ambiguity, in the manual
classification task it could deteriorate the annotation quality by adding noise to the
annotations entered by users that, in this case, are likely to be domain experts.
In the platform deployment adopted in this thesis the expansion process takes
place for both search and automatic indexing but not for direct annotation.

Annotation Repository
The annotation repository module is the sole responsible for annotation storage,
retrieval and management. It accepts, on one side, storage requests, and writes the
received annotations into the platform database by means of a proper wrapper. On
the other side, it listens for search requests and has the ability to provide as result
a subset of the stored annotations in which the concepts included into the received
query are contained.
The Annotation repository is the most centralized module of the platform: it
manages all the annotation storage and search requests. For this reason it shall be as
fast as possible, especially in the retrieval phase where the response time is a critical
factor for achieving user satisfaction. For the same reason it possibly constitute a
bottleneck in the platform information flow.
This risk could easily be avoided by using service replication and, thus, by prop-
erly configuring the platform to work with more than one annotation repository.
To perform this configuration both the indexing sub-system and the service level
modules should be aware of the presence of more than one annotation repository
and should have references to the proper one. Correspondence between service con-
sumers (i.e. the Service layer modules and the indexing sub-system) and service
providers (the Annotation repository copies) shall be fixed at configuration time.
Clearly, care must be paid when deploying the platform in order to correctly config-
ure H-DOSE choosing a good trade-off between the required performances and the
available computational resources.
From a more operational point of view, the annotation repository exposes four

92
6.1 – A layered view of H-DOSE

different interfaces: a “store” method, a “retrieve” function, a “reverse search”


operation and eventually an “inverse map” procedure. The store method is usually
called by the indexing sub-system (either at the service or at the kernel layer) and
allows to memorize semantic annotations into a persistent storage. Its signature is
reported in Figure 6.9.

int store(Spectrum expandedSpectrum, Spectrum rawSpectrum,


URI documentURI, URI applicationURI,
URI superDocumentURI*)

Figure 6.9. Public interface of the store method of the annotation repository.

The retrieve function is, in a sense, the inverse operator of the store function and
given a set of concepts, i.e. a spectrum provides as result a list of URIs which have
spectra somewhat similar to the specified one (see Figure 6.10). The similarity is
simply evaluated by checking the common spectra components which are not null,
and by ranking results according to the number co-occurring concepts (i.e., spectrum
components).

URI[ ] retrieve(Spectrum querySpectrum, int maxResults)

Figure 6.10. Public interface of the retrieve method of the annotation repository.

What’s related searches in H-DOSE are particularly easy to perform thanks to


the availability of the reverse search functionality in the annotation repository. Such
an operation accepts as input a documentURI and provides as result a list of URIs
of documents having similar spectra. Internally, the annotation repository works as
follows: at first the conceptual spectrum corresponding to the input URI is obtained
and then, a normal retrieve operation is performed for extracting the final result.
It must be stressed that for the reverse search as well as for the former retrieve
function, expanded spectra are used so as to increase at most the chances of finding
relevant resources. Figure 6.11 shows the reverse search signature.

URI[ ] reverseSearch(URI documentURI, int maxResults)

Figure 6.11. Public interface of the reverse search method of the annotation
repository.

The inverse map functionality concludes this section. It basically provide a mean
for querying the persistent storage in order to retrieve the conceptual description of
an indexed document. Therefore it needs as input a document URI while providing
as result a conceptual spectrum (Figure 6.12).

93
6 – The H-DOSE platform

Spectrum inverseMap(URI documentURI)

Figure 6.12. Public interface of the inverse map method of the annotation repos-
itory.

Indexing sub-system
The indexing sub-system is one of the most complex modules of the H-DOSE plat-
form. For this reason firstly a black-box description is provided, and only in a second
instance a more detailed system view is given, discussing each element of the box.
As stated by the name, the indexing sub-system provides the required function-
alities for the automatic indexing of textual resources. Two methods are offered
to external applications, which can either be the service-layer indexing module or
other modules that do not belong to the platform. These methods are respectively
for queuing resources to be indexed and for getting results. The latter is actually a
service callback registration.
Whenever a module needs to perform the automatic indexing of a resource, the
index method of the indexing subsystem is invoked. This method is available in two
variants: a single index function and a group index function. The twos only differ
for the amount of data handled at time. The latter, in particular, is used to index
a considerable group of textual resources in a unique, atomic operation. For both
methods, the corresponding signatures are reported in Figure 6.13.

void singleIndex(URI documentURI, URI applicationURI,


URI superDocumentURI*)

Or

void groupIndex(URI documentURIs[ ], URI applicationURI,


URI superDocumentURIs[ ]*)

Figure 6.13. Public interface of the “index” operation, at the kernel-level.

The indexing operation can be successful, so that the resource conceptual de-
scriptions will be stored in to the H-DOSE persistent storage, or can fail. In both
cases, calling applications may want a notification of the automatic annotation re-
sult. To obtain this notification message, they shall register themselves with the
indexing sub-system. In the H-DOSE platform this task is automatically performed
for the indexing service at the service layer, which is by default registered with the
kernel-level service. However if an external, semantics-aware application needs to
directly access the kernel-level indexing service it must register itself with the service
using the register notify method (Figure 6.14).

94
6.1 – A layered view of H-DOSE

void registerNotify(Endpoint notifyEndpoint)

Figure 6.14. Public interface of the register notify method of the kernel-level
indexing sub-system.

Indexing sub-system insights


The indexing sub-system is implemented as a multi-agent system composed of both
resident and mobile agents. Resident agents either provide coordination services for
the mobile ones, as done by the agency manager, or implement services that shall
remain centralized: as happens for the sematic mapper and the synset manager.
Mobile agents, instead, are directly involved in the active characterization of textual
resources.
As already discussed, the process of extracting knowledge from textual resources,
i.e. of semantic classification, is a quite complex and computationally intensive
process. It requires on one side the ability to distribute as much as possible the
corresponding computation load, thus the adoption of mobile agents, and on the
other side requires a well defined organization of the indexing work in order to
maximize efficiency. Both the concerns can be addressed by properly designing
the mobile part of the indexing sub-system. This part, in H-DOSE is composed
by colonies of agents which are designed to autonomously accomplish the entire
process, from text analysis to spectrum generation and storage. Such colonies are
called Indexing squads and can be migrated on whatever device offering a suitable
container for the agents to live and operate.
Migration, nevertheless, shall not be performed randomly, instead it shall take
into account the location of information sources in order to reduce as much as
possible the distance between the data and the processing code. The underlying
idea is therefore to move the indexing code toward the data instead of doing the
opposite. Such operation pays an increase in complexity with a larger decrease in the
size of information that transits on the web, between the sources and the platform.
In an intra-net setting this design choice can be questionable, however it is the only
way to ensure scalability of the platform in “wild web” environments. Figure 6.15
shows the composition of the indexing sub-system and provides a sort of “zoom”
visualization of an Indexing squad.
Indexing squads are always composed by 3 agents: the media/language detector,
the filter agent and the annotator agent. Sometimes the squad is integrated by a
custom agent which allows to directly access a given site database. This agent is
called deep search agent.
A normal indexing task (i.e, without the deep search agent being involved) works
as follows. Every time that the agency manager receives an indexing request, it
checks the documentURI(s) to extract the base site from which pages are to be

95
6 – The H-DOSE platform

Figure 6.15. The H-DOSE indexing sub-system.

indexed. If an indexing squad is already deployed on the server publishing the site,
the new indexing task is forwarded to the squad residing on the remote machine.
Otherwise a discovery process starts, trying to find whether the remote host has
an agent container accessible or not. If a container is available and access to it
can be gained, the agency manger packs a new indexing squad and migrate the
agents, together with the indexing request, to the newly found container. Instead,
if no containers are available, the agency manager backs up the request on a set of
“friend” machines. The “friend” machines differ from normal container providers
as they not only allow the deployment of new agents on the container, but also
allow agents to perform tasks which involve other machines on the Web. A typical
example of a friend machine is the host on which the centralized part of the indexing
subsystem works.

96
6.1 – A layered view of H-DOSE

At the end of the “migration phase”, the URI(s) of resources to be indexed are
received by the filter agent of an appropriate indexing squad. This agent, firstly
contacts the media detector agent for understanding both the language in which the
text is written and the format (HTML, XHTML, plain text) of information. With
this data it can configure its internal parser and code filter, and can then extract
the simple textual information required by the annotation agent. The output of
the filtering process depends on the kind of annotation technique adopted by the
annotator agent. As an example, a simple bag of words annotator may require all
the word stems in the document, while a SVM based classifier can require the set of
distinct words which occur in the textual resource, accompanied by a tdf · if weight
that specifies their ability to discriminate the document from the others already
indexed. Whatever design choice is taken, the filter agent and the annotator agent,
in a squad, shall be designed to work in symbiosis, with the same expected formats
and elaborations.
In the end, the annotator agent takes as input the filtered data and builds up
a conceptual spectrum. Such spectrum is then stored in the H-DOSE knowledge
base through a call to the store method of the annotation repository. At last, the
annotator notifies the indexing result (annotations stored or annotation failed) to
the agency manager, which then propagates the same information to the applications
registered with the kernel-level indexing service.
The deep search agent is involved when some particular agreement comes into
being between people running H-DOSE and an organization managing a given site.
In such a case the organization might allow a restricted access to its database (DB)
tables through a certain interface defined in the agreement. A special purpose agent
can therefore be designed and deployed on the machine running the site. This
special agent, the deep search agent, has the access rights and capabilities required
to directly query the site database, extracting not only published information but
also metadata associated to the table structure of the DB. The deep search agent can
be very useful for highly dynamic sites where information changes very quickly and
semantic search services must be provided, being always up-to-date. Under these
conditions, this agent can be used to constantly track the site database, detecting
changes and triggering new indexing cycles whenever needed.

6.1.3 Data-access layer


The data-acces layer is deployed at the deepest level of the H-DOSE architecture
and includes all the utility classes used for accessing and manipulating ontologies,
for managing the platform persistence on different database servers and, for locating
resources. Although the technology adopted at this level is not particularly new,
and the contained innovation is pretty low, this layer is a critical component for
the platform. In fact it performs all those tasks that require little intelligence but

97
6 – The H-DOSE platform

that ensure the platform operations by giving access to the business objects, i.e. the
ontology, the resources and the storage of classification data (annotations). Three
main modules are deployed at this level: the ontology wrapper which provides pro-
grammatic access to ontologies either written in RDF/S, DAML+OIL or OWL, the
annotation database that defines a set of high level primitives for the persistent
storage of conceptual spectra, and a document handling sub-system which manages
document-related issues such as fragmentation2 , pre-processing, etc.

6.1.4 Management and maintenance sub-system


The management and maintenance sub-system, in the H-DOSE platform, has the
duty of constantly monitoring the platform status and the responsibility to take
proper actions whenever failures occur. Rather than being a completely sound
system it is, in its current configuration, mainly focused at resolving semantics-
related issues and problems. In other words, although reliability and recoverability
features are critical for the platform deployment in real world scenarios, the current
solution is only designed to maintain the annotation base coherent and constantly
up-to-date. However, to accomplish this task innovative, agent-based techniques are
used, stemming from the Autonomic Systems research field.
The basic assumption on which this subsystem works is that the core component
of a semantic platform is the richness of the database into which semantic annota-
tions are stored. The quality of search results, in fact, directly depends on the
number and the quality of information stored in this semantic index (Annotation
Repository).
H-DOSE has the possibility, when needed, to automatically maintain the anno-
tation base up-to-date by autonomously performing searches on the Web. This type
of autonomic maintenance has proved to be quite questionable, as resulted from
many discussion that took place in international conferences where the platform has
been presented. Besides the personal opinions, however, having a semantic platform
that autonomously perform searches on the Web and adds “non-verified” data to
its knowledge base can actually be an unwanted behavior, especially for trust-based
or mission critical sites where all the available information must be certified by well
known entities. In order to account these requirements, on one side, and at the
same time, to support the author research interests as well as the exigences of more
collaborative sites, the maintenance sub-system has been designed to be activated
or deactivated in the platform configuration phase.
Supposing that the autonomic features have been activated, they work follow-
ing two self-management paradigms: “uniform topic coverage” and “user triggered
2
Although fragmentation is supported by the platform design, in the current setting documents
are only manipulated as wholes. Fragment-level elaboration will be available in the next platform
versions.

98
6.1 – A layered view of H-DOSE

knowledge integration”. The “uniform topic coverage” paradigm aims at maintain-


ing a certain degree of uniformness of the topic coverage in the repository. This
implies an automatic triggering of focused indexing processes whenever some top-
ics have a low number of annotations. However, the paradigm can only improve
knowledge that is already known, since the Annotation Repository knows only the
existence of concepts for which at least an annotation exists. For covering new topics
enrichment processes shall act at a higher logical level, where the ontology is known.
A transparent triggering of not-annotated concepts coverage can be achieved by
monitoring user requests at the Search Engine and by detecting the knowledge areas
modeled by the ontology, i.e., for which the system is able to provide a conceptual
description, that are uncovered into the repository due to the lack of annotations.
This second mechanism is called “user triggered knowledge integration”.
Uniform topic coverage has been implemented in to ways: the first one is basically
a search for the minimum occurrence of topics on the set of stored annotations.
All covered topics in the repository are ordered by annotation occurrence and the
lowest ten percent is selected as “low covered set” and provided to a newly designed
enrichment agent that is charged of triggering the indexing of new, suitable resources.
The second way involves some statistical considerations: basic indexes are computed
for evaluating the statistical properties of the topic coverage in the repository such
as: the mean occurrence value, the variance and the standard deviation. After the
evaluation of these figures, a threshold based algorithm selects all the topics located
under a given fraction of the standard deviation, starting from the mean occurrence
value, and triggers an enrichment cycle. The threshold value has been manually
selected by performing different experiments and strongly depends on the shape
of the topics occurrence distribution, which is near-uniform only if the amount of
stored annotations is reasonably high.
Statistically speaking this technique tends to transform the topics coverage dis-
tribution from an unknown non-uniform one to a uniform distribution in which all
topics have the same occurrence value. To implement “user triggered knowledge
integration”, the Search Engine service is constantly monitored by the maintenance
sub-system, trying to dynamically face changes in user habits and interests, and to
discover modeled topics currently uncovered by annotations. When a query is issued
to the search service, results are monitored in terms of relevance. If retrieved re-
sources have relevance weights under a given threshold, or, worse, if no resources can
be provided as result, due to a lack in the Annotation Repository, the enrichment
agent, located in the maintenance sub-system, forces a new indexing cycle focused
on the uncovered conceptual areas.
Self-monitoring and self-optimization functions into the core services of H-DOSE
require the design and development of a new agent, called Enrichment Agent, for
managing the intelligent update of the Annotation Repository. This new agent shall
be able to discover new information for semantic indexing and to understand the

99
6 – The H-DOSE platform

specification of a conceptual domain consequently focusing the resource selection,


in order to reach a satisfying annotation coverage with respect to that area. In
other words, the enrichment agent accepts as input a set of concepts, performs some
internal operations which may involve collaboration with other agents, and provides
as output a set of URIs that, when indexed, should generate annotations covering
the topics received as input.

Figure 6.16. The sequence diagram of autonomic features in H-DOSE.

In normal operating conditions the enrichment agent stays idle, monitoring the
behavior of services such as the Annotation Repository and the Search Engine.
If a critical situation is detected, the agent extracts, from the services, a list of
concepts whose coverage should be enhanced and contacts the Synset Manager agent,
in the indexing sub-system, to find lexical entities associated to such topics. It
subsequently composes, textual queries to be issued toward classical, text-based
search engines. Once textual queries have been composed, two concurrent processes
start, one interacting with the Agency Manager in order to trigger incremental
indexing on already known sites and the second interacting with search web services
(the Google web API [9], as an example) in order to retrieve a list of possibly relevant
URIs. At the end of such processes the Enrichment Agent performs some filtering on

100
6.2 – Application scenarios

retrieved URIs, deleting not understandable resources such as “pdf” and “doc” files,
in the current setting H-DOSE can in fact only support HTML, XHTML and plain
text. After the filtering process, the agent composes a list of resources to be indexed,
identified by URIs, and sends an indexing request to the indexing sub-system which
subsequently performs semantic annotation and updates the Annotation Repository;
Figure 6.16 shows the corresponding sequence diagram.

6.2 Application scenarios


In this section are provided some examples on how the platform works when indexing
or search requests are received. The well known representation method of sequence
diagrams and interaction diagrams is adopted for explaining the various operations
involved by the two working scenarios.

6.2.1 Indexing
The indexing scenario is characterized by many interesting aspects because it in-
volves several advanced techniques such as focused semantic indexing, code mobility,
deep search and collaboration. Whenever a web resource or a set of resources must
be indexed by the platform the indexing process starts, performing several steps
ending in the addition of new knowledge to the platform KB (Figure 6.17).
As first, when a set of resources, identified by their URIs, is scheduled for index-
ing, the agency manager agent divides the URIs by location. All resources published
on the same Internet site become sets to be indexed by possibly different indexing
squads. After this “by site” separation the manager agent checks, for each location
if an existing indexing squad is available. If the check result is true, the correspon-
dent set of resources is passed to the remote squad for “in site” indexing; instead, if
the location does not hold an indexing squad but offers a suitable agent container,
a new squad is created and sent over the Web, toward such location. In the case of
sites not offering agent containers, squads are migrated to a set of “friend” hosts in
order to balance the platform workload.
Once each indexing squad has been launched, the agents composing the squad
start to collaborate in order to perform semantic classification of required resources:
if the web site under classification uses a known site-wise search engine, the squads
directly interface such engine by means of a “deep search” agent, accessing informa-
tion stored in the site database, and thus providing classification of resources lying
in the so-called deep web. Otherwise normal indexing is performed.
The results of the semantic classification process are conceptual spectra which
are sent back to the agency manager, which, in turn, calls the Annotation Repository
service for persistently storing the extracted semantic information.

101
6 – The H-DOSE platform

Figure 6.17. Interaction diagram for the indexing task.

At now, the indexing squads are able to classify every resource whose conceptual
domain overlaps the conceptual domain of the H-DOSE ontology. However some
improvements can be foreseen. For example, it is possible to design a new indexing
squad able to perform the so-called “focused indexing”. In focused indexing, a site
is firstly semantically analyzed, identifying to which part of the H-DOSE ontology
resources are relevant, then, a light-weight indexing squad, able to work only on the
identified ontology subset, is migrated to the same site. The advantage of doing
such operation is to reduce the computational load required to host machines, for
performing indexing process, and to reduce the amount of information that shall
transit into the network, between the agency manager and the indexing squads.
This advantage is balanced by a sensible increase in complexity for managing the
site semantic characterization and the focused squads, and to design effectively
cooperating agent colonies thus avoiding code replication on remote sites.

6.2.2 Search
The search scenario (Figures 6.19, 6.20) is organized as follows: an external appli-
cation requests a search on the platform knowledge base by interfacing with the

102
6.2 – Application scenarios

Figure 6.18. Sequence diagram of the indexing task.

Search web service and by specifying a proper set of concepts. The Search service,
firstly contacts the Expander service for obtained an extended version of the query
spectrum. Then it starts communicating with the Annotation Repository service for
retrieving relevant annotations (i.e., document spectra). The Annotation Repository
takes the query spectrum given by the Search service and searches the database in
which annotations are stored to find relevant matches. If there are suitable resources,
i.e., resources annotated as relevant with respect to the concepts occurring in the
query conceptual spectrum, the Annotation Repository service provides the set of
retrieved annotations to the Search service. Otherwise, if no resources are available,
the Annotation Storage service throws an error that is caught by the Search service.
In the former case, i.e, when resources are available, the Search service checks
the similarity of retrieved spectra with the original user query (not the expanded
one). Then it ranks the results according to the just computed similarity values and
returns to the caller application a list of URIs. The list is as long as specified by
the application in the search request.
In the second case, where resources are not available, the Search service man-
ages the error by providing an empty list to the caller application. At the same
time, it notifies the maintenance and management sub-system, and in particular
the enrichment agents, that some concepts of the H-DOSE ontology are not covered
by annotations. This notification, in turn, triggers the autonomic features of the
platform that start a new indexing cycle, which possibly “repairs” the source of the
search error. Figure 6.16 shows how the H-DOSE autonomic features can achieve
this result.

103
6 – The H-DOSE platform

Figure 6.19. The search interaction diagram.

Figure 6.20. The sequence diagram of the search task.

104
6.3 – Implementation issues

6.3 Implementation issues


The H-DOSE platform has been fully implemented in Java, both for web services
and for collaborative agents; the last ones have been developed by adopting the
JADE framework [16].
The JADE software framework has been developed by TILAB (formerly CSELT)
and supports the implementation of multi-agent systems through a middle-ware that
claims to be fully compliant with the FIPA specifications [17], in order to be able
to inter-operate with other FIPA compliant systems such as Zeus (British Telecom)
[18] and FIPA-OS [19]. JADE agents are implemented as Java threads and live
within Agent Containers that provide the runtime support for agent execution and
message-handling. Containers can be connected via RMI and can be both local and
remote; the main container is associated with the RMI registry. Agent activities are
modeled through different “behaviors” and the execution is managed by an agent
internal scheduler.
The execution environment in which Web Services are deployed and run is the
Apache Tomcat servlet container [20], integrated by the Apache Axis SOAP engine
[21]. They are both developed in an open and participatory environment, part
of the Jakarta project [22] and released under the Apache Software License. The
description language used to publish Web Services is standard WSDL and it is
automatically supported by the Axis SOAP engine.
H-DOSE modules are pure-Java services released under the LPGL license for
what concerns the newly developed functionalities. Many libraries are used by the
platform services:

• the ontology wrapper uses the HP Jena library [23] for accessing RDF/S [4],
DAML and OWL [5] ontologies;

• the persistent storage module uses the PostgreSQL jdbc driver [24] for inter-
facing a PostgreSQL database server in which spectra are stored;

• the document-handling service, which is currently under development, uses the


Saxon API [25] and the JTidy API [26][27] for performing the pre-processing
of textual resources;

• the expander service is powered by the JGrapht API [28] that handles all the
issues related to the modeling of the H-DOSE ontology as a directed, weighted
graph;

• the Sun JAX-RPC [29] library is used by all web services to implement com-
munication methods while all the platform agents use the jade library for
implementing their behaviors and communication methods.

105
6 – The H-DOSE platform

H-DOSE is currently at the second beta release, namely “hdose2.1”, and it is


publicly available at the sourceforge.net3 site. The platform is a constantly evolving
open source project: at now, two main streams of research are pursued. The first
one is designing the evolution of the H-DOSE platform for multi-media applications,
while the second analyzes the requirements and modifications needed by the plat-
form to be applied in the field of competitive intelligence. At now some third party
organizations have expressed their interest in experimenting with the platform, in
particular the IntelliSemantic company is actively working on the platform appli-
cation to semantics-aware search systems, in the field of patent management and
discovery.

3
Sourceforge is one of the most famous open source software repositories available on the web,
it can be reached at http://www.sourceforge.net. The H-DOSE web site, on sourceforge, can
be found at http://dose.sourceforge.net

106
Chapter 7

Case studies

This chapter presents the case studies that constituted the benchmark of
the HDOSE platform. Each case study is addressed separately starting
from a brief description of requirements and going through the integration
design process, the deployment of the HDOSE platform and the phase of
results gathering and analysis.

As part of the work presented in this thesis, the H-DOSE platform has been
used for adding semantic search functionalities to 4 already deployed web applica-
tions. The first is a legacy publication system written in PHP and owned by the
Passepartout service of the city of Turin. The second is a well known and widely
adopted e-Learning environment: Moodle [1]. The third is the case-based e-learning
environment developed in the context of the European project CABLE [3] and the
last is a transparent search engine built on top of the Muffin intelligent proxy [2].
The following sections will address the integration of semantic functionalities into
these applications by using H-DOSE as semantic back end.

7.1 The Passepartout case study


The Passepartout service of the city of Turin, is a service of the Turin’s munici-
pality that provides information about disability aids, norms and laws in the Turin
metropolitan area. To accomplish its informative mission, the Passepartout pub-
lishes and maintains a web site in which disabled and elderly people can find relevant
information, allowing them to effectively interact with the public health institutions,
and to access the facilitations and services they are entitled to.
The Passepartout informative system is based on the principles of distributed
redaction and separates the process of content editing, done by journalists, and
the review and publication processes, done by redactors. The whole work flow,

107
7 – Case studies

starting from the content creation to the final publication on the web, is managed
and supported by a legacy system written in PHP.
In the context of a collaboration between the author’s research group and the
Passepartout service, this publication system has been integrated with semantic
functionalities by means of the H-DOSE platform. Currently the semantic function-
alities supported by the Passepartout web site include the manual classification of
published documents, the search by category and the semantic what’s related (re-
quirements 4, 5, 6 and 8). The integration has been deployed as follows: as first the
legacy system has been extended for supporting the exchange of SOAP messages
with the semantic platform. A previously-built module for SOAP communication
has been included in the system libraries and a new PHP module for managing the
communication with the semantic platform has been developed. This last module is
barely composed by a set of function wrappers, for the services offered by H-DOSE,
and adds, where needed, the intelligence required to perform more complex tasks.
Secondly, the template for publishing pages has been integrated to include the
semantic what’s related functionality: each published page has been integrated with
a link pointing to the semantically-related pages. When a user clicks that link, the
integrated system interacts with H-DOSE using the what’s related search function
that takes as input a resource URI and provides as result a set of ten other resources,
which are conceptually similar to the starting one.
As a third step, a semantically populated directory has been built in PHP; such
a directory is pretty similar to classical directory systems such as the yahoo’s one or
the dmoz.org open category. However, the resources belong to categories depending
on their conceptual description. As a consequence they could occur, at the same
time, in different category branches. Categorization is, in fact, directly related to the
ontology that defines the knowledge domain in which the system works: disability
in this specific case.
From the implementation point of view, the category tree for this test case has
been built off-line and then included into the Passepartout system. This choice is
only related to performance issues: building the same tree at runtime means that
every time the category is required, an ontology full navigation shall be performed.
Since the ontology is usually big, in this small scenario includes more than 80 con-
cepts and 20 different relationships, the time for page composition quickly grows too
high and does not satisfy usability criteria.
Eventually, the Passepartout publication system has been modified to support
the conceptual classification of pages. This last modification has a little or not im-
pact on the publication process. As the system originally included a site-wise search
engine based on manually specified keywords, the publication interface can be fully
retained whereas the keyword choice is now limited to the set of ontology concepts.
This simple solution allows to keep the interference of semantics introduction into
the usual work flow negligible, thus reducing the time required for the integration

108
7.1 – The Passepartout case study

of the new functionalities and, as a consequence, reducing the possible complaints


about the system complexity and usability. Figure 7.1 shows the final deployment
of the integrated system.

Figure 7.1. The passepartout system.

7.1.1 Results
In order to extract relevant information about the effectiveness of the proposed
approach to semantics integration, a full test plan has been set-up that involve
three different test groups. Groups are devoted respectively to the evaluation of the
conceptual representation of the Passepartout domain and of the semantic functions
newly introduced into the publication system, to the evaluation of the performance
of the semantic modules in terms of standard measures such as precision, recall and
F-measure and, to the evaluation of the effort required to use the new conceptual
based interfaces, i.e. the usability evaluation.
The first tranche of tests has been called “ontological tests” and involves three
different sessions. In the first session, the ontology, developed in collaboration with
the Passepartout (about assistive services for disabled people; 80 concepts and 20
different semantic relationships), is tested for completeness, i.e., it is tested the
ability of the ontology to cover the Passepartout conceptual domain. In this test at
least six different redactors are required to conceptually classify a set of ten pages,
pseudo-randomly selected: a minimal amount of overlap between the pages tested
by different operators is granted in order to obtain suitable and comparable results.

109
7 – Case studies

For each page the classification is captured onto paper sheets and the redactors
opinions are collected too. The aim of the test is to analyze whether or not the
ontology is complete enough to cover the contents published by the Passepartout
service.
The second session aims at identifying inconsistencies into the ontology model
by performing focused classification tests, i.e. classification of resources belonging to
the same conceptual area at a different granularity. As in the first session, at least six
redactors perform the test and are required to provide both the pages classification
and opinions about the ability of the ontology to describe resource contents.
The second session is different from the first one by aims and scope, in fact,
while the first session aims at evaluating if, given a uniform distribution of contents,
the resulting classification is also uniformly distributed, the second session aims at
evaluating if, given a general topic and a uniform distribution of content granularity,
the resulting classification is uniformly distributed in deep, i.e. if the classification
results are uniformly distributed among the nodes descending from the given general
topic, in the ontology hierarchy.
The last session has the objective of identifying the ontology areas that are poorly
modeled, i.e. the conceptual areas in which there are multiple collisions of resource
classifications. For collision, it is intended the coexistence of several annotations be-
tween indexed resources and a given concept. If the number of collisions significantly
exceeds the ontology mean value, it is likely that the ontology has a modeling prob-
lem for the concept under examination. As an example because multiple, different
concepts have been modeled as a single one.
The second group of tests is aimed at evaluating the effectiveness of the semantic
search functionalities in terms of precision (p) and recall (r) and of their combination
in the F-Measure (F = 2·r·p
r+p
).
The sessions involved are three, as in the previous case, referred respectively to
the “what’s related” function for the first session and to the directory search for
the last two sessions. In the semantic “what’s related” test, the semantic system is
checked for both search and classification effectiveness: as first the degree of simi-
larity between the conceptual descriptions of the starting page and of the retrieved
pages is evaluated and secondly, the correctness of page annotation is evaluated.
In such a test, the operators involved are domain experts, the redactors of the
Passepartout site, therefore, although the evaluation still remains subjective, it is
supposed to be significant. The test is deployed as follows: each operator is pro-
vided with a set of 5 pages extracted pseudo-randomly from a test set composed of
20 different pages. He/She is required to browse the Passepartout site, to reach the
pages in the set and select, for each of them, the link “related pages” that activates
the semantic “what’s related” functionality. For each page the operator shall com-
pare the starting page and the retrieved pages. The similarity between pages shall
then be evaluated and reported on a proper test sheet. On the same test sheet, the

110
7.1 – The Passepartout case study

operator is also required to report his evaluation about annotation correctness, i.e.
to report if pages are correctly classified according to his knowledge of the domain.
The directory search test is further subdivided in two steps: in the first step
the ability of the system to retrieve all the relevant information and no other is
tested. This case represents a quite unusual operation scenario that is mainly useful
for testing the platform functionality: the user can only select a concept in the
site directory and the system is required to retrieve all the resources indexed as
relevant to that concept (in this test the expander module of the H-DOSE platform
is disabled).
Theoretically, assuming that the classification process has been well performed
all relevant resources should be retrieved and only them. In terms of precision and
recall measures, both values should be 100%. However, the classification process is
never perfect, therefore, the retrieved results will possess lower values of precision
and recall; these values will be as near to the maximum as the manual classification
has been well performed.
On the other side, assuming that the classification is “perfect”, since it is provided
by domain experts, this test allows to detect possible problems in the platform
operations. Practically, the test involves 10 operators , each required to perform
3 predefined searches (different for each operator). The search results are reported
by the test operators onto proper test sheets and the collected results are then
elaborated and organized.
In the second step, the directory search is tested in its full potential: the test
operators are allowed to select as many concepts in the directory as they like. Such
initial concept specification is then expanded by the H-DOSE platform using the
not-explicit knowledge modeled by the platform ontology and finally, resources are
compared to the resulting conceptual specification, ranked, and provided back to
the user.
The test involves ten different operators, each of them is asked to perform a given
search task, described by means of a goal statement, and to evaluate the relevance
on retrieved pages. Recall, precision and the F-measure are evaluated on the bases
of the evaluation results collected from all the operators.
Finally the last group of tests is designed to evaluate the additional effort required
to content editors (journalists) and to redactors, to include the semantic informations
required for the site operations. This kind of evaluation has been performed as a
set of interviews to the Passepartout site crew, which has been required to use the
integrated system for a month. It is important to notice that the usual publication
load is of about four new pages published per day and per redactor, that means a
total amount of around 500 operations in a month.
Unfortunately, as it is easy noticeable by the absence of tables and graphs, the
envisioned time frame for evaluations has been widely trespassed, and, at now, the
first test phase is not yet completed. The other phases, are still to be started.

111
7 – Case studies

These problems are not related to the platform deployment; in such sense some
preliminary results have been collected which seem to indicate a relatively simple
integration. Instead, the main problems are related to communication issues between
the research group and the Passepartout crew and to the coincidence of different
situations (such as the 2006 winter olimpic games, held in Turin) that prevented the
timely execution of the test program. Nevertheless results are slowly being collected
and, once complete, they will be submitted, as a paper, to an international journal
on semantic web technologies and applications.

7.2 The Moodle case study


Moodle is a course management system (CMS): “a free, open source software package
designed using sound pedagogical principles, to help educators create effective on-
line learning communities”. The design and development of Moodle is guided by
a particular philosophy of learning, a way of thinking that is usually referred to,
in shorthand, as a “social constructionist pedagogy” . The courses provided by
using Moodle as e-Learning environment can be about whatever argument a teacher
reputes valuable and interesting for the students enrolled in such a course. However,
the availability of a semantic classification of the course contents is really helpful
for both teachers, as an example by providing support for the automatic definition
of different learning paths, personalized per each student according to the already
accomplished learning tasks, and students that could easily access the courses about
the topics they are interested to.
This motivation has fostered the application of the H-DOSE approach to se-
mantics integration to the Moodle CMS with the intent to demonstrate, from one
side the general applicability of the proposed solutions and on the other side the
added values that semantics integration can bring to an e-Learning environment
such as Moodle. The integration process was deployed as depicted in the following
paragraphs.
As first the design principles lying at the basis of the Moodle system and espe-
cially defining the way Moodle handles course information, have been analyzed in
order to understand how semantics can be integrated into the environment. As a
consequence a new PHP page has been introduced into the system, containing all
the relevant queries to the Moodle database, for extracting the course contents to
be subsequently classified using the H-DOSE platform.
Then a new Moodle module for allowing the automatic classification of courses
has been introduced, following the guidelines provided by the Moodle authors for
the development of new modules. Such a module, allows, by simply clicking a
button, to index all the resources available for a given course, with respect to a
given knowledge domain. The knowledge domain is the one defined by the ontology

112
7.3 – The CABLE case study

used by the H-DOSE platform for semantic classification.


When a course teacher, selects the “semantic classification” button, on the ad-
ministrative interface, the Moodle module queries the course database, by means of
the afore mentioned PHP page, and sends such information, properly formatted as
text, to the H-DOSE platform, that performs the final semantic indexing. From this
moment on, the course has been conceptually classified. Users are then provided
with a new search function that integrates the ones already available in Moodle
and that strictly resembles the “search by category” interface of the Passepartout
service.
Some qualitative tests have been performed about the H-DOSE integrability into
web applications, evaluating the effort required for such a process in terms of the
amount of man hours needed for developing the Moodle semantic module, Table 7.1
shows the results.
Phase Man hours
Moodle DB analysis 40 hr
PHP query page 4 hr
Semantic Classification module 6 hr
Search by category interface 8 hr
Test & debug 20 hr
TOTAL 78 hr
Table 7.1. Man hours required for the Moodle semantic integration.

Moreover the advantages and improvements that semantic inclusion brings to an


e-Learning environment are also under evaluation. These tests are only in the very
preliminary phase therefore, they are neither sound and valuable as a demonstration
nor they are reported here, but they are useful for capturing a taste of what services
could be offered for learning environments, and for a subsequent phase of require-
ments gathering and analysis, leading, possibly, to improvements in the H-DOSE
platform.

7.3 The CABLE case study


The primary goal of the CABLE project is to develop methodologies that enable e-
learning tools supporting educational operators, e.g. including learning facilitators,
primary and high-school teachers, university professors, schools for social operators,
etc.. The specific characteristics of this learning group prevents the adoption of
classical e-learning environments: educational operators need a learning approach
extensively based on growing personal experience (recall and elaboration), knowl-
edge of other parallel experiences (exchange, communication, comparison) as well

113
7 – Case studies

as the incorporation of theoretical studies and contributions (continuous update,


life-long learning).
The e-learning system should facilitate and encourage students to a personal
elaboration of learned material, to achieve a higher degree of awareness in their
professional actions. The CABLE project addresses this problem by exploiting a
learning approach based on the availability and exploitation of a pre-existing exten-
sive archive of case studies which can be termed as a community memory, sometimes
referred to as an organizational memory. This community memory is, by its nature,
extensible and dynamic.
CABLE postulates that only through the study and comparison of other people’s
experiences, interpreted in their specific context, personal knowledge and skills,
critically, in this domain, including intervention skills, may be improved. The full
acquisition of new knowledge can be strengthened by seeing linguistic and cultural
differences as resources rather than barriers.
The learning approach is therefore based on the development of synthetic, com-
posable didactic modules which are based on a narrative language structure, and
linked to more analytical in-depth material composed of theoretical contributions,
normative references, context descriptions, etc. Each didactical module is associated
to a set of real-world on-going case studies, that will be used as teaching examples,
as materials upon which to develop and sharpen interpretation abilities, as contact
points for social interactions, and as a basis for self-assessment and self-evaluation.
The e-learning system supporting the methodology proposed by CABLE is built
around two core entities, namely the case studies and the didactical modules. Users
of the system are students, authors of didactic modules, and contributors of new case
studies. The experiences and implicit knowledge in case studies, and the explicit
knowledge contained in the didactic modules, need to be handled in an intelligent
way by the system, in order to discover relationships, shared concepts, learning
paths, etc. As a consequence, case studies and didactic modules are categorized
by formal metadata, resorting to domain specific ontologies and semantic network
hierarchical conceptualizations. Classification of case studies and didactic modules
are dynamic processes, influenced by interactions and feedback with users.
The on-line courses therefore shall adapt themselves automatically to new case
studies, new didactic modules availability, improved classification or description of
existing case studies, and emerging common practices. Case studies may be textual
or multi medial.

7.3.1 System architecture


CABLE is both a project and a framework for supporting the learning methodolo-
gies experimented and developed during the project execution. The framework has

114
7.3 – The CABLE case study

a well defined ICT infrastructure, which reuses as much as possible already avail-
able, effective solutions, avoiding to reinvent the wheel. Three main components
emerge, composing the basic structure of the CABLE architecture: an e-Learning
environment, a repository of case studies and good practice examples, and a se-
mantic module able to leverage formal metadata associated to both learning objects
and case studies for composing and discovering associations between courses and
good practice examples. As the domain of application for the CABLE framework
requires growing experiences of users and teachers by allowing comparison and shar-
ing of similar case studies and solutions, the semantic module has the responsibility
of automatically establishing correspondences between new learning paths and ex-
isting case studies as well as the capability to correlate, at runtime, newly added
case studies to the already existing cases and learning modules. Figure 7.2 shows
the logical organization of the CABLE framework.

Figure 7.2. The CABLE framework logical architecture.

The VLE module, is the Bodington [30] learning environment, a cutting-edge,


open source, e-Learning System widely adopted by UK universities (e.g., UHI [31],
etc.), it is entirely developed in Java and runs on top of the Apache Tomcat servlet
engine. The case studies repository, instead is a Java web application developed from
scratch during the project execution. Finally the semantic module is implemented by
a customized, minimal version of the formerly introduced H-DOSE platform named
mH-DOSE (minimal H-DOSE).
The basic interaction flow is designed to be two folded, i.e., to support two
different information needs: finding case studies from a well defined learning resource
in the VLE or finding case studies relevant with respect to a significant example
(search for related case studies). In Figure 7.2 these two operational paradigms
correspond, respectively, to the left most and to the right most user interfaces. As
can easily be noticed both processes are mediated by the semantic module, which
lies in the middle.

115
7 – Case studies

The two interaction paradigms can also be described in form of interaction dia-
grams. In the first case (Figure 7.3), a user, or a teacher, uses the VLE to participate
in some learning activity. At a time, in the e-learning process, case studies shall be
analyzed for better understanding how to tackle a given pedagogical scenario. As
the CABLE framework hosts many case studies provided by several entities, in a
Europe wide environment, the resources relevant to the learning module are ex-
tracted from the cases repository at runtime. Among other advantages, this allows
to automatically take into account newly added knowledge, in a transparent way.
Therefore, following the user request, the VLE contacts the semantic module for
relevant case studies, providing, at the same time, a conceptual description of the
learning object viewed by the user. The semantic module retrieves from the case
studies repository all descriptions of case studies that match, at least partially, the
VLE specification. Then, by applying ontology navigation techniques, it ranks the
retrieved results and provides back to the VLE a list of URLs of good practice ex-
amples. The VLE retrieves the case studies and presents them to the user, in a
convenient way.

Figure 7.3. The “VLE to case studies” interaction diagram.

In the second scenario, instead, the user is already accessing the case studies

116
7.3 – The CABLE case study

repository, as an example for consulting a well known solution to specific pedagogi-


cal issue. Once read how the solution had worked and what is the scenario in which
such solution has been applied, the user might want to find whether the just learned
approach has been successfully applied to other, similar situations. He/She selects
the “related case studies” button on the user interface for retrieving similar cases.
The button pressing causes, inside the repository, the retrieval of the conceptual de-
scription of the case currently viewed. Such a description is passed to the semantic
module which, in turn, extracts from the repository a set of candidate case studies,
on the basis of the initial semantic specification, taking into account ontology rela-
tionships and concepts. Then, as in the former scenario, the semantic module ranks
retrieved results and provides a list of “relevant” URLs to the repository application.
The repository retrieves and organizes the relevant case studies and presents them
to the user, as a result.

Figure 7.4. The “case study to case studies” interaction diagram.

As can easily be noticed, in both cases there are no predefined matches be-
tween case studies, or between case studies and learning objects. Instead they are
discovered at runtime, by comparing the respective conceptual descriptions. As the
comparison is ontology-driven, not explicit associations can easily be discovered thus
leveraging the power of semantics for providing conceptually relevant results (which

117
7 – Case studies

are hopefully more relevant than the ones that would be extracted by applying
simple keyword matching techniques).

7.3.2 mH-DOSE

In order to implement the functionalities required to the semantic module of the


CABLE framework, the H-DOSE platform has been strongly customized, removing
all the not required functionalities and modules. This customization resulted in a
light-weight platform using only the search engine and the expander module. This
two modules have not been modified, since they already provided the required func-
tionalities. In particular, for supporting the CABLE search processes, it has been
adopted the “search by concept” function, which uses expanded spectra for resource
retrieval and query spectra for resource ranking. What changes, with respect to the
complete H-DOSE, is that in CABLE the search engine is directly interfaced to the
case studies repository. However, also this aspect has not required any modifica-
tion to the original platform modules, as the interface provided by the case studies
repository has been designed to be perfectly compatible to the one offered by the
annotation repository in H-DOSE.

Figure 7.5. The mH-DOSE platform adopted in CABLE.

The overall adaptation process has taken no more than one day for adaptation
and one day for testing, thus demonstrating the easy integrability of the H-DOSE
platform in other web applications.

118
7.4 – The Shortbread case study

7.4 The Shortbread case study

The system described in this section offers a semantic-based what’s related func-
tionality whose presence remains hidden to the user, at least until no relevant in-
formation can be found. In order to understand whether this goal can be reached
or not and how to effectively tackle the problems related to semantic retrieval of
related resources, the web navigation process has been analyzed. The interaction
scenario follows: the user surfs the web using a browser. For each page requested
by the user, the browser contacts a given server or a given proxy in order to fetch
the proper content. Then it interprets the received page and shows the content,
properly formatted, to the user (Figure 7.6).

Figure 7.6. Simple access to web resources.

As can easily be noticed, the interaction scenario involving a HTTP proxy (Fig-
ure 7.7)is very well suited for introducing transparent functionalities in the user
navigation process. The proxy, in fact, intercepts both user requests and server
responses, and can be exploited as access point for the semantic what’s related sys-
tem. For each incoming request, the page returned by the web server is analyzed,
semantically classified if necessary, and its semantic description is extracted. The
description is, in a sense, a snapshot of the user willings in terms of information
needs, at the semantic level. Such a snapshot is combined into a user model that,
under certain conditions can drive the retrieval of related pages, i.e., of pages se-
mantically similar to the user needs as modeled by the proxy. Every time a new
request from the user’s browser is received, the user model is updated and possibly
the related pages are retrieved. Then, a new page is automatically composed as the
sum of the requested page and of a list of suggested resources, and it is finally sent
back to the user browser.
The user can virtually be completely unaware of the search system located in
the proxy and can, in principle, think that the received pages are exactly the ones
he requested by typing a URL on his/her browser or by clicking a link on a web
page. In such a case the what’s related system is actually “transparent”.

119
7 – Case studies

Figure 7.7. Proxy-mediated web access.

7.4.1 System architecture


The proposed system is logically organized in two main functional blocks as can
easily be inferred from Figure 7.8: an intelligent proxy, and Shortbread, a semantic
what’s related system. The proxy is actually a preexisting proxy available under
an open source license and does not constitute a novelty point. The novelty con-
tribution, instead, can be found in the way the proxy has been used and in the
Shortbread system.
In the proposed setting, rather than working as a simple relay, the proxy works
as a programmable switch: when Shortbread has no information to propose to the
user, the proxy acts nearly as a normal proxy, by simply relaying information and
by spilling out the URLs of the requested pages (3). These URLs are forwarded to
the underlying semantic what’s related system. Otherwise, if Shortbread is able to
find information which is semantically related to the user needs, as extrapolated by
the user navigation, the proxy acts like a sort of mixer by adding to the page that
is coming from the remote server (2), a list of possibly relevant resources (4) that
the user can choose during his/her navigation.
The core of the proposed approach is clearly Shortbread. This semantic what’s
related system basically includes 4 different modules: the User Profile (U.P.), the Se-
mantic Information Retrieval system (S.I.R.), an XML-RPC to SOAP gateway and
a semantic platform for managing the knowledge acquired during the web navigation
(H-DOSE).
The User Profile extracts and stores information about the user navigation, in
order to perform searches for information semantically related with the user needs.
Practically speaking, for each page requested by the user through his browser, the
U.P. obtains the corresponding URL and tries to extract the semantic characteri-
zation of the resource using the S.I.R. If the URL has already been classified (and
it description stored in the H-DOSE platform), the S.I.R. provides to the U.P. the
semantic description of the resource in form of “conceptual spectrum”. Otherwise,
the resource is inserted in the classification queue of the H-DOSE platform.
Whenever a semantic characterization for the current page is available (i.e. the

120
7.4 – The Shortbread case study

Figure 7.8. The Shortbread architecture.

resource description was already present in H-DOSE), the U.P. updates its inter-
nal user model by adding the knowledge related to the current page. This process
allows the system to incrementally extrapolate the user information needs, while
the model becomes more and more accurate at each new navigation step. Clearly,
the user model is accurate if and only if the user navigation is longer than a given
amount of pages. The U.P. has therefore some policies to prevent the system from
performing suggestions when the model is poorly accurate, avoiding to provide dis-
turbing information to the user. Once the model becomes accurate enough, the User
Profile can query the Semantic Information Retrieval system for resources semanti-
cally related to the user model. In other words, a snapshot of the user model (i.e.,
a conceptual spectrum) is passed to the S.I.R. which, in turn, requires H-DOSE to
provide the URIs of those resources whose conceptual spectrum is reasonably similar
to the received user model. These URIs are then forwarded to the proxy which mixes
the suggested information with the currently elaborated page, thus providing to the
user a complete page composed by the original content plus the newly generated
links to related pages.
Referring to Figure 7.8, the arrow labeled (4) exists, therefore, only when the

121
7 – Case studies

user model is accurate enough and there are resource descriptions available in the
DOSE platform. Otherwise only the arrow (5) exists, meaning that the proxy is
actually working as a simple relay.
The Semantic Information Retrieval system is basically a query generator for
the H-DOSE semantic platform. It has the ability to compose, from one side, re-
verse queries, i.e. queries that starting from a simple URL require the semantic
descriptions associated to the page identified by the URL. On the other side it has
the capability to take as input a conceptual spectrum, which encodes the model of
user needs, and to perform a search for those resources having similar conceptual
descriptions, using H-DOSE. Eventually, the S.I.R. has also the ability to trigger the
H-DOSE indexing whenever a reverse query fails, meaning that the page currently
transiting through the proxy has not yet been classified.
The XML-RPC to SOAP gateway is a protocol translator allowing simple con-
nections between the Semantic Information Retrieval system, which uses XML-RPC,
and the H-DOSE platform which is based on web services and uses SOAP as com-
munication protocol.

7.4.2 Typical Operation Scenario


The Shortbread system has been conceived as a personal proxy: the system is, in
other words, designed to work on the user’s PC, in background. This hypothesis has
several advantages and also allows some simplification in the system design. Short-
bread has mainly one functionality, i.e. to provide links to resources semantically
related to the user navigation, which, in some sense, is assumed to be an indicator
of the user information need. To operate in a semantic fashion it needs an ontol-
ogy that defines the knowledge domain in which the user performs searches. Such
an ontology is a foundational block in the H-DOSE platform allowing the semantic
indexing and retrieval of resources.
Designing the what’s related system for working on the user PC has allowed to
completely ignore the issues related to user permissions, authentication, ontology
switching and so on. For each user of a given PC, a specific instance of Shortbread
can be run, allowing personalized navigation and transparent searches, even on
different domains. Moreover, a simple persistence mechanism permits to store the
user model , so that it has the chance to become more an more accurate at each new
navigation session. In addition, since sometimes a repeated navigation pattern can
make the user model too specialized, a utility function is also available for resetting
the model to a given previous state.
Conversely, this design choice can sometimes be restricting, especially on old PCs,
because the H-DOSE processes can be very computationally expensive. However,
the H-DOSE component was originally designed to be a distributed, server-side
system, and already possesses the functionalities for being used from remote. In

122
7.4 – The Shortbread case study

particular it allows many multiple and concurrent connections from remote clients
requiring different tasks (search or indexing). Therefore, whenever a user reputes
necessary to not run H-DOSE on its personal machine, Shortbread can be configured
to access a remote H-DOSE server and to use such server as the semantic backbone
of the system.
Currently there are still some open issues when working remotely with H-DOSE:
user authentication, user management and ontology switching functionalities are in
fact very preliminary and do not allow an efficient and secure management of user
rights and information. In other words, the current version of the H-DOSE plat-
form does not limit the user to see only the information he/she is entitled to, but
offers the same visibility to all the users connected to a given H-DOSE server. This
becomes a critical issue if the user navigation is about confidential topics/resources.
The author’s research group is actually concerned with such issues and, in the forth-
coming new version of the H-DOSE platform, these issues will be tackled allowing
for a more effective and secure operations. As a last remark it must be noticed that,
due to the modular nature of the approach and to the backward compatibility of
H-DOSE, the entire system will be able to work on future versions of the platform
without performing changes on the other modules.

7.4.3 Implementation
The system has been implemented in the Java programming language. This choice is
mainly related to the availability of a programmable proxy written in this language,
which is called Muffin [2], and to the fact that also the H-DOSE platform has been
developed in Java. In more detail, Muffin is a programmable Java proxy based on
the notion of filter. A filter can act either on an entire page, as a whole, or onto
single chunks (tokens) of the HTML code that represents the page itself. Page-level
filters can act both on the path between the user browser and the remote server,
and on the inverse path. Token filters, instead, can only work on the return path.
The proxy component of ShortBread has been implemented as a token filter in
Muffin. Basically, it sniffs the URLs of the pages requested by the user while they
are on their way back to the browser. Then, if some suggestions are available, it
modifies the HTML code of the transiting page by adding a “what’s related” section
either at the end or at the beginning of the page. The Shortbread core, constituted
by the User Profile, the Semantic Information Retrieval system and by the XML-
RPC to SOAP gateway has been developed as a set of separate Java classes which
are included in the muffin filter and thus executed as part of the muffin process.
H-DOSE, instead, is completely separated from Shortbread and it is based on
both web services and intelligent agents. It runs on a proper servlet container
(Apache Tomcat) and can be distributed among several hardware devices in order
to maximize its performances. In the preliminary experiment setup ShortBread has

123
7 – Case studies

been deployed on a pentium class PC equipped with 512MByte of memory and


40GByte of disk space. The disk space occupation of the system is very low and the
memory load nearly negligible. H-DOSE, instead has been deployed on a separate
machine equipped with a AMD Athlon XP processor (2200+) , 1GByte of RAM
and 120GByte of disk space. The DOSE impact on the PC performances is low in
terms of occupied space, around 20MByte, but is quite evident in terms of memory
load, where the maximum load can reach 40MByte.

124
Chapter 8

H-DOSE related tools and utilities

This chapter is about all the HDOSE related tools and methodologies de-
veloped during the platform design and implementation. They include a
new ontology visualization tool, a genetic algorithm for semantic anno-
tations refinement, etc.

During the H-DOSE design and development many side-issues have been ad-
dressed, being related with the main stream of research. In particular two tools
have been developed, respectively referred to the problem of annotation genera-
tion/storage, and to the problem of ontology creation/visualization.
The former research effort is an outcome of the firsts platform experimentations.
Doing automatic annotation of documents, the corresponding spectra resulted often
very dense, i.e. they usually involved many concepts in the ontology. This feature is
sometimes undesirable, since it retains a great amount of redundancy and may host
several annotation errors, which get easily undetected by being buried in many low
relevant annotations. In order to tackle such an issue an evolutionary refinement
method for annotations has therefore been developed, trying to eliminate as much
redundancy as possible, while maintaining, at the same time, the maximum anno-
tation relevance. Evolutionary techniques have been chosen since this is a classical
optimization process where a trade-off between annotation conciseness and relevance
must be reached, and these techniques are known to be well suited for this kind of
problems.
The latter tool, instead, is related to the H-DOSE experimentation in the Passep-
artout and in the CABLE test cases. One of the preponderant aspects that comes
out from these case studies is that often, even before understanding the values
added by semantics integration, non technical people do not understand the formal
knowledge model. They are in fact expert in the domain, but not in formalization
techniques. As a consequence, many times modeling errors remain undetected since
RDF or OWL are too cumbersome to be understood by non-experts.

125
8 – H-DOSE related tools and utilities

This situation has promoted, in the author’s research group, several insightful
discussion on how make more clear the formalization commitments in ontologies.
Such discussions ended-up in a new ontology visualization tool able to represent
conceptual models as enhanced trees in a 3D space. The tool has been experimented
in the CABLE final meeting as well as at the SWAP2005 conference and proved to
be quite useful. In almost all cases comments have been about the visualized models
and their correctness rather than on the tool interface and navigation paradigms.
This is clearly not a demonstration of effectiveness, however it gives some good
feedback on the tool ability to ease the process of ontology design and revision,
especially when such process is performed by not technicians.

8.1 Genetic refinement of semantic annotations


Semantics integration, on the current web, requires the ability to automatically asso-
ciate conceptual specifications to web resources, with minimal human intervention.
The huge number of available resources, and the actual presence of the so-called
deep web, not accessible to user navigation, makes manual annotation of the whole
web infeasible. Therefore tools are needed to link web resources with semantic
descriptions referred to domain ontologies.
Relevant information can usually be extracted from on-line textual resources
however, it is, sometimes, only available in form of audio and video content, thus such
formats should also be considered when an annotation process is designed. Semantic
annotation requires the definition of the sequence of information extraction and
analysis operations that are needed to describe a resource in a conceptual manner.
The process input is composed of resources identified by URIs, that are retrieved
and elaborated to produce a set of semantic descriptors pointing at them.
Several approaches have been proposed for automatic semantic annotation, stem-
ming from two main research fields: natural language processing [32] and machine
learning [33]. Natural language processing (NLP) is based on the detection of typical
human constructs in textual information and tries to map known sentence compo-
sition rules to semantics rich descriptions. As an example, using NLP the sentence,
“Peter goes to Rome” is parsed into a tree structure like the one in Figure 8.1.
Such tree could then be mapped onto a semantics rich representation by using
concepts and relationships defined in an ontology and by associating tree components
to semantical entities (Figure 8.2).
Nowadays search engines could be seen as very simple NLP annotators which
extract pseudo-semantic information from the occurrence of keywords into web re-
sources. Semantically speaking, the basic form of an NLP based resource annotation
could therefore be identified as a process that starting from the occurrence of cer-
tain words in web documents, generates a set of concepts to which the document

126
8.1 – Genetic refinement of semantic annotations

Figure 8.1. The grammar tree for “Peter goes to Rome”. (NP = noun phrase,
N = noun, VP = verbal phrase, V = verb, PP = prepositional phrase, PREP =
preposition)

is related, i.e. to which such words have been associated by human experts and/or
automated rule extraction.

Figure 8.2. Triple representation of the sentence “Peter goes to Rome”.

The other way to provide semantic descriptors currently under investigation is


machine learning. Basically machine learning means extracting association rules
and behaviors to allow machines to accomplish specific tasks as well as their human
counterparts. In the Semantic Web, machine learning principles are applied in
order to learn how human beings classify homogeneous sets of resources and to
imitate such behavior in an unsupervised process. Learning semantic association
rules implies several assumptions: as first, knowledge domains should be limited
and possibly self-contained in order to be able to provide significant training sets.
Secondly a significant amount of resources annotated by human experts are needed
in order to extract association rules reliable and at the same time expressive enough,
that is to say, to extract rules able to ensure correct annotation of documents to be
indexed.
With respect to the shallow NLP annotation of traditional search engines, a
simple implementation of machine learning principles can be organized as follows.
A set of domain specific web resources are manually annotated by human experts
looking at a conceptual domain model in form of ontology. Then data-mining is
applied to extract most reliable association between lexical entities and concepts thus
defining a set of association rules characterized by a certain degree of confidence. The
most reliable associations extracted are finally used to classify the whole resource

127
8 – H-DOSE related tools and utilities

set producing a set of semantic annotations (this is, for example, the basic principle
of SVM classification in forthcoming version of H-DOSE).
Without regarding the method used to extract semantic descriptors from syn-
tactical information, the generated set of semantic annotations has at least three
peculiar characteristics: it possesses an accuracy degree value, which describes how
well resource semantics is captured by the annotations, it is composed of many en-
tities, in a number that it is likely to be not manageable by human experts and it
usually models redundant semantic information, at different granularity levels.
Useful semantic information should be classified as humans do, trying to pre-
serve typical features of human perception of reality and being at the same time
understandable by machines. Typical features of annotations created by experts are
conciseness, expressiveness and focus. Human beings cannot handle huge amounts
of information, therefore they evolved mental processes able to extract the key el-
ements of an external object or of an external event, building concise models able
to guide decision processes when a given situation happens. At the same time since
real world can offer different similar situation manageable in the same way, a concep-
tual model should be general and expressive enough to allow reuse. Finally, mental
models are usually focused to a precise sequence of external stimuli, defining a sort
of domain specific knowledge similar to the philosophical formalization of ontology.
Effective automatic annotation should therefore mimic those characteristics in
order to be useful in the Semantic Web, however available technologies are still
not able to provide the required level of accuracy, conciseness and expressiveness of
semantic descriptors. Refinement is therefore needed in order to provide effective
semantic descriptors and from a user perspective, effective applications. Annotation
refinement requires knowledge about domain concepts, about relationships between
concepts and should take into account the granularity level of available descriptors
since they can be referred to entire resources as well as to single paragraphs of a
document. Such process can be modeled as an optimization process with several con-
straints, some of which are not explicit, the relation between semantic annotations
referred to the same resource as an example.
Genetic algorithms can be applied to face such problems providing at the same
time an highly dynamic system able to react quickly to changes into the initial
annotation base, and to effectively face changes in user behavior.

8.1.1 Semantics powered annotation refinement


After mapping syntactical information, contained into web resources, to seman-
tic annotations, a huge set of redundant, poorly expressive semantic descriptors is
available. Almost all relevant information is contained in such set but a refinement
process is needed in order to synthesize relevant data and to purge not useful parts.

128
8.1 – Genetic refinement of semantic annotations

There are several measures that allow to assess annotation quality, and to per-
form a selection of appropriate descriptors. For each semantic annotation a relevance
weight is usually provided quantifying how strongly a web resource is linked to a
given ontology concept. A first filtering phase could therefore select the most rele-
vant descriptors from the whole extracted set.
Unfortunately this simple selection method is not able to discriminate between
relevant and irrelevant descriptors, since it relies on inaccurate data. The relevance
value, in fact, is provided by the automatic annotation process and may be affected
by errors on classification, association rules extraction, etc. Even in the simplest
scenario, in which a set of words is associated to each ontology concept, the relevance
value is strongly influenced by such set, and a lack in the word definition process
can compromise the real semantic relevance of produced annotation.
On the other hand manually checking each generated annotation is infeasible,
even for small sets, and it would be undesirable since the entire annotation process
must be unsupervised.
The relevance value associated to each annotation should therefore handled with
caution, as an indicative value, and should be integrated by other considerations,
taking into account domain knowledge specification (ontology) and granularity of
annotations. Let’s clarify the scenario with an example, supposing to have a set
of seven annotations to refine. All annotations are referred to the same web page,
about dog care. The web page is composed of three paragraphs respectively about
dog nutrition, fitness and psychology. Extracted annotations are shown in Figure
8.3, while the ontology branch to which they point, is shown in Figure 8.4.
In this simple case, it is possible to figure out how an automatic process could
reach performances similar to human expert classification. As first, the automatic
process should sort annotations by raw relevance, in the example, it would pro-
duce the following sets: dog care, dog, dog nutrition, nutrition and care for the first
resource, and dog nutrition, dog and care for the second one. Secondly, it should
leverage ontological relationships between concepts, to set up selection policies based
on semantic links between topics and annotations, even if those links are not ex-
plicitly modeled (inference). As an example, it should be able to understand that
dog nutrition and dog care are sub classes of the more general concepts nutrition
and care, and that both topics can be applied to dogs. Finally, the refinement
process should analyze the granularity level of annotated resources, with respect to
ontology hierarchy, and should change the selection order in a way that more general
resources would be annotated by more general topics.
In the proposed example, resulting annotation sets would be ordered as follows:
dog, dog care, nutrition, care, dog nutrition for the first resource and dog nutrition,
dog and care for the second resource. At the end a filtering process selects most
relevant annotations providing results that, compared with the ones provided by
human experts, possess a satisfying level of conciseness and expressiveness.

129
8 – H-DOSE related tools and utilities

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="http://www.example.org/ontology#">
<!-- whole resource -->
<rdf:Description rdf:nodeID="A0">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog" />
<ex:weight>1.0</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A1">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:care" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A2">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog care" />
<ex:weight>1.0</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A3">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:nutrition" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A4">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog nutrition" />
<ex:weight>0.8</ex:weight>
</rdf:Description>
<!-- first paragraph --->
<rdf:Description rdf:nodeID="A5">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:dog" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A6">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:care" />
<ex:weight>0.2</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A7">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:dog nutrition" />
<ex:weight>3.0</ex:weight>
</rdf:Description>
</rdf:RDF>

Figure 8.3. Example of mined annotations.

130
8.1 – Genetic refinement of semantic annotations

Figure 8.4. The ”dog” ontology.

In addition to the complexity of the optimization problem, some issues arise on


real world annotation scenarios into which thousands of annotations shall be refined.
The annotation base is in fact subject to timely changes due to user requests, auto-
updating processes and so on. Moreover, in many cases annotations may possess
the same relevance value, hardening the first selection phase.

8.1.2 Evolutionary refiner


The static optimization problem, for annotation refinement, is quite complex and
basically requires a multi-objective strategy to reach a good compromise between
two opposite goals: annotation expressiveness and annotation conciseness.
Instead of developing a solution based on multi-objective optimization algorithms
the author designed a traditional evolutionary system1 based on a dynamic fitness
in order to fulfill the two opposite goals in a simpler and more effective manner.
Moreover, the optimization task, in real world, assumes other dynamic characteris-
tics, fostering the adoption of appropriate evolutionary solutions. Data set, in fact,
changes continuously, depending on external events and there is no way to predict
how would the optimum change after interaction with users as well as with other
systems.
Running multi-objective algorithms every time the annotation base changes is
clearly infeasible, because they are computationally expensive and a semantic plat-
form should support many concurrent accesses, and should be able to perform con-
current resource indexing. Evolutionary algorithms, are, instead, well suited for
incremental optimization and could run continuously, working to reach a global op-
timum which takes into account every single variation in the annotation base. They
1
Firstly presented at CEC2004, Congress on Evolutionary Computation, Portland, Oregon,
USA.

131
8 – H-DOSE related tools and utilities

are also able to track global optimum changes, i.e. changes into the fitness land-
scape, storing time dependent information into the population state [34], and thus
allowing the implementation of effective and reactive annotation refinement systems.
In the evolutionary refinement of semantic annotations, the individual size in
term of annotated topics fixes the amount of relevant annotations allowed for each
indexed resource, and should be tuned to achieve the best compromise between
conciseness and expressiveness, limiting as much as possible information losses.
Mutation introduce innovations into the population, crossover changes the con-
text of already available, useful information, and selection directs the search toward
better regions of the search space. Acting together, mutation and recombination
explore the annotation space while selection exploits the information represented
within the population. The balance between exploration and exploitation or be-
tween the creation of diversity and its reduction by focusing on the better fitness
individuals allows to achieve reasonable performance of the evolutionary algorithm
(EA).

Design
The evolutionary algorithm goal is to evolve a population of semantic annotation
sets referred to specific web resources using genetic operators like mutation and
crossover. Individuals are composed of a fixed number of genes, each identifying
a single link between a topic and a web resource (annotation) together with the
corresponding relevance value (Figure 8.5).

Figure 8.5. An individual of the evolutionary annotation refiner.

The proposed solution uses a steady state evolutionary paradigm E(µ + λ): at
each generation λ individuals are selected for undergoing genetic modifications, the
resulting µ + λ population is evaluated and the best µ individuals are transferred
to the next generation. The selection operator is a tournament selection with a
tournament size of 2, equivalent to a roulette wheel on the linearized fitness.
The tournament size fixes the convergence rate of the algorithm, its value is
intentionally kept low in order to avoid premature convergence on sub optimal so-
lutions. Evolutionary annotation refinement is organized into two main phases:

132
8.1 – Genetic refinement of semantic annotations

firstly, annotations available for each indexed resource are collapsed into a small set,
according to relevance weight, topic occurrence into the annotation base, and struc-
tural hierarchy. Secondly, the annotation base is updated with information from
the extracted set, taking into account the effect of the new topic coverage distribu-
tion. At the end of this last phase the process restarts, thus providing a continuous
refinement cycle (Figure 8.6) able to face changes into the annotation base.

Figure 8.6. The evolutionary refinement cycle.

In the first phase, for each indexed resource the available set of annotations is
extracted. Each annotation together with its relevance value become a gene stored
into a sort of resource-specific gene repository G(R). From this repository individ-
uals are randomly created taking as DNA a manually tuned number of genes, that
fixes the conciseness degree of refined semantic descriptors. When the initial popula-
tion has been generated, the evolutionary process evolves a population of individuals
whose fitness is given by a composition of the relevance weight contribution and of
the annotation base topic coverage contribution, with respect to annotation gran-
ularity level. The annotation base has in fact a dual nature: on one side it stores
annotations pointing at web resources, thus modeling indexing relevance (the value
associated to each annotation). On the other side, it models how the conceptual
specification covers the real knowledge domain. Basically there is a certain amount
of annotations pointing to ontology concepts; this value measures how well concep-
tual descriptions are fitted onto the domain model: ontology concepts pointed by
many semantic annotations and located at deep levels into the concept hierarchy
usually identify redundant information, or information for which syntax to semantics
conversion has provided poor results.

133
8 – H-DOSE related tools and utilities

Therefore such annotations shall be penalized in order to allow the evolutionary


strategy to improve semantic expressiveness of individuals in the population.
The fitness function is defined according to the dual nature of the annotation
base, and is composed of three main elements taking into account, respectively, the
indexing relevance value, the topic coverage factor just discussed and the granularity
level of the resource whose annotations should be refined.
 W (g) · T (g)
F (I,R) = − D(I) · Ph
i=0→n
FR (R)
g

Where ng is the number of genes composing the DNA of an individual, W (g) is


the relevance weight associated to each annotation (e.g. to each individual gene,
g), T (g) is the topic coverage correction factor, FR (R) is the granularity correction
factor, D(I) is the dependency factor taking into account ontological relationships
and Ph is the penalty factor for ontological dependence between individual genes.
The topic coverage factor T (g) is inversely related to the number of annota-
tions (i.e. genes) pointing at that concept, according to the principles previously
discussed, while FR (R) adds information about the granularity level of individual
genes in order to make annotation refinement more effective. The basic assumption
is that a more specific resource, a paragraph as an example, should be annotated by
more specific topics, while a general resource, the entire web page as an example,
should point to more general concepts.
To achieve such behavior the function FR (R) assumes higher values for more
specific resources and lower values for more general ones.
1
T (g) ∝
#Annotations
FR (R) ∝ granularity(R)
Ph is a fixed penalty value for inter-gene dependence in the ontology and finally
D(I) is the dependency correction factor and is inversely related to the distance
between concepts used as individual genes. Formal definitions of such values follow:

D(I) = d(gi ,gj ) where g ∈ G(R)
gi ,gj

1
d(gi ,gj ) ∝
dist(gi .topic(), gj .topic())
Although the fitness function is morphological rather then time dependent, i.e. the
fitness value depends explicitly from factors related to individual physical charac-
teristics, there is a non explicit constraint that changes the overall behavior of the
algorithm into a dynamic one.

134
8.1 – Genetic refinement of semantic annotations

At each refinement cycle the annotation base is, in fact, updated associating
to the whole annotation set, the topic coverage distribution reached in the last
optimization cycle. In the subsequent refinement cycle, individuals having the same
DNA of the ones in the current cycle, will possess a different fitness value according
to results of the last optimization. Moreover, indexing cycles may happen during
the refinement phase adding variability into the fitness landscape that the algorithm
walks to reach the optimum. In such sense, the overall evolutionary refinement cycle
assumes a highly dynamic behavior.
Currently only one mutation operator has been defined and one crossover oper-
ator. The mutation operator simply extracts a new gene from the gene repository,
associated to a resource, and substitutes a randomly chosen gene in the individual
DNA.
Recombination is performed uniformly, by selecting two individuals and alter-
natively cloning individual genes. In other words, since individuals have the same
DNA size, each DNA element is selected with a uniform probability distribution
between individuals participating in crossover. Such operator can provide invalid
individuals, i.e. individuals having one or more duplicated genes. Those individuals
are invalid since resources referred more than one time to an ontology concept are
not as meaningful as needed. The author designed a simple policy to avoid such
eventuality: individuals having duplicated genes are automatically dropped out and
cannot take part in the new population. Table 8.1 summarizes the adopted genetic
operators.

Operator Functional group


Uniform crossover Recombination
Substitution Mutation
Table 8.1. Genetic operators for annotation refinement.

A stop criterion has also been defined for the evolutionary refinement of each
resource annotation, based on population uniformness: whenever the diversity in
individuals falls under a given threshold, or, as well, whenever the best individual
fitness stops to increase, the cycle is interrupted and the best set of semantic de-
scriptors (individual) is selected as local optimum. The iterated nature of refinement
assures that the global optimum could be reached, taking into account the fact that
there is no absolute recipe to define the optimum from a user point of view, which
is the ultimate goal of refinement.

Implementation
The evolutionary refinement strategy has been deployed as a Java class defined by
an high level interface, for abstraction from implementation details, as well as done

135
8 – H-DOSE related tools and utilities

with individuals and genes.


The gene repository has a variable size depending on the amount of annotations
available for each indexed resource, while the individual size is fixed and can be
manually selected at start-time. Those elements have been integrated into the H-
DOSE platform, to assess the approach feasibility. A new H-DOSE module called
“Evolutionary refiner” has been developed, which practically performs the evolu-
tionary refinement, and supports SOAP communication allowing interaction with
existing H-DOSE modules. The Evolutionary refiner has been integrated into the
management logical area of the platform. Since many parameters could be tuned
to optimize the strategy performance, the evolutionary annotation refiner has been
integrated with a configuration file in which those values are stored (Table 8.2).

Parameter Value
Penalty Ph 1
Dependancy d(gi,gj ) 1/dist(gi.topic(), gj .topic())
Topic coverage factor T (g) #Annotations
Annotation granularity FR (R) #( / ∈ resource URI)
Table 8.2. Fitness-specific parameters.

Results
Several testbeds have been defined for the evolutionary refinement of semantic anno-
tations proposed in this section. As first the evolutionary algorithm parameters have
been set up for performing fair comparisons on different annotation bases (Table 8.3).
Experiments used the last available version of the H-DOSE platform which allows
semantic indexing of web resources in many languages. The underlying ontology
has been developed by the H-DOSE authors in collaboration with the Passepartout
service of the city of Turin.
The ontology counts nearly 450 concepts organized into 4 main areas, for each
ontology concept a definition and a set of lexical entities has been specified [35]
for a total amount of over 2500 words, allowing shallow NLP-based text-to-concept
mapping.
The first experiment involved information from the Passepartout web site: 50
pages were indexed using the standard methods provided by H-DOSE (simple bag of
words) and corresponding annotations were stored into the Annotation Repository.
The newly created module for evolutionary refinement ran on such data produced
as output a runtime version of the Annotation Repository accessible from the search
engine service. Initial evaluation aimed at assessing feasibility of the approach and
validity of results in a simple static scenario. Annotations stored into the repository

136
8.1 – Genetic refinement of semantic annotations

Parameters Values
Diversity threshold 0
Fitness increase threshold for the best individual 0
Number of individuals (µ) 50
Number of new individuals (λ) 20
Probability of mutation v.s. crossover 0.1
Individual size 3
Table 8.3. Evolutionary strategy parameters.

were therefore refined a fixed number of times (10) without allowing modifications
on the original semantic descriptors set.
A total amount of 276 annotations corresponding to 28 relevant resources was
obtained as indexing result, the 22 remaining documents were judged not relevant
by H-DOSE. After evolutionary refinement, the runtime version of the Annotation
Repository, contained about 97 annotations referred to a total amount of 21 re-
sources. The difference on the amount of annotated resources was caused by fixed
size individuals which were not able to model resources for which less then 3 (size
of individuals) annotations were present. Beside the fact that resources having a
very low number of annotations are likely to be too specific or, even, wrongly anno-
tated, such problem can be fixed by propagating into the refined annotation base,
descriptors referred to such resources.
Semantic annotations stored into both repositories (the original and the refined
one) were evaluated by human experts which specified a relevance value between
0 and 100 for each annotation. The relevance mean was evaluated both at the
single-annotation level and at the resource level, Table 8.4 resumes the obtained
results.
As could easily been noticed the two mean relevance values seems to be in con-
trast to each other, but this is not the case. In fact, relevance at the single-annotation
level increases since many annotations, with low relevance values, were judged not
relevant by experts, and purged by the evolutionary refinement thus increasing the
relevance mean figure. On the other side, at the resource level, each annotation
contributes to the total relevance value, therefore it is straightforward that, without
regarding other parameters, as higher is the amount of annotations for each resource,
as higher is the corresponding relevance value (Figure 8.7).
In the better scenario, all annotations purged by the evolutionary system were
judged completely non-relevant by human experts, therefore relevance mean values
were the same for the original and the refined annotation base.
Overall performance was interesting since the total expressiveness reduction of
the annotation base, in terms of relevance at the resource level, was around 36%

137
8 – H-DOSE related tools and utilities

Quality figures Original base Refined base


Total annotated resources 28 21
Resources with size ≤ Individual.size() 7 0
Total amount of annotations 276 97
Single annotation relevance (mean) 1.92 3.13
Resource annotation relevance (mean) 9.67 6.19
Expressive density de 0.035 0.063
Table 8.4. Evolutionary refinement results on the passepartout web site.

Figure 8.7. Human-evaluated relevance at the resource level (Passepartout).

while the repository size reduction was of about 65%. The expressive density de ,
defined as the relevance over the number of stored annotations, was respectively of
0,035 for the original annotation base and of 0,063 for the refined one.

mean(relevance)
de =
#Annotations
The second experiment involved a real time scenario, in which the H-DOSE
platform works on-line and indexing can be triggered by automatic processes, ex-
ternal applications and search failures, at whatever time in the refinement process.
Since the evolutionary refinement works continuously, taking a sort of snapshot of
the Annotation Repository content at each refinement cycle, changes in the original
repository propagates to the refined one with a maximum delay of one cycle.
The experiment involved the indexing of on-line web resources from the “asphi”
web site [36]. Asphi is a private foundation that promotes the adoption of informatics
aids for people with disabilities, by financing several research projects.
A snapshot of the two repositories was taken and the same evaluation done on

138
8.2 – OntoSphere

the static scenario was performed. Table 8.5 details corresponding results, while the
chart in figure 8.8 depicts the corresponding relevance, at the resource level, for each
indexed fragment.

Quality figures Original base Refined base


Total annotated resources 19 16
Resources with size ≤ Individual.size() 3 0
Total amount of annotations 242 63
Single annotation relevance (mean) 0.64 1.33
Resource annotation relevance (mean) 7.47 3.81
Expressive density de 0.031 0.060
Table 8.5. Evolutionary refinement results for the www.asphi.it site.

On-line work and user related changes into the original Annotation Repository do
not affect too much the refiner behavior, allowing to reach satisfying performances
even on dynamic fitness landscape. As could be easily seen looking at the experi-
mental data, the relevance loss can be estimated around 50% while the repository
size reduction is of about 75%, with a correspondent expressiveness density of 0.060
for the refined annotation repository and of 0.031 for the original one.

Figure 8.8. Human-evaluated relevance at the resource level (www.asphi.it).

8.2 OntoSphere
In the last years the Semantic Web has been constantly evolving from a vision of few
people to a tangible presence on the Web, with many tools, portals, ontologies,etc.

139
8 – H-DOSE related tools and utilities

Such a great evolution involved many researchers, from different countries, and
has been primarily focused on technologies. At now a Web developer can start to
seriously consider the opportunity to provide semantically tagged content as the
needed tools and standards are available. However, the current web panorama
shows a very little adoption of semantics. The motivations for such a low adoption
can be various and related to very different aspects: technology immaturity, failing
dissemination, user and developer resilience to changes, etc. In the sea of possible
failures and shortcomings, interfaces have a relevant role often discriminating good
solutions from bad ones. This is particularly true for tools related with knowledge
modeling and visualization, where the involved information can be quite complex
and involve multidimensional issues.
Several attempts aim at providing effective interfaces for knowledge modeling,
i.e. for ontology creation and visualization. Protégé [37] and OntoEdit [38] for ex-
ample are complete IDEs (Integrated Development Environments) that address in
a single application all the aspects related to ontology creation, checking and visu-
alization (through proper plug-ins). Such tools, although adopting rather different
paradigms for editing and inspecting ontologies, have in common a bi-dimensional
approach to ontology visualization. The bi-dimensional approach can be variously
efficient and there are actually good solutions available on the web: GraphViz [39],
Jambalaya [40] and OntoViz [41], just for naming a few. Nevertheless, mapping the
many dimensions involved by an ontology like the concepts hierarchy, the seman-
tic relationships, the instances and the possible axioms defining a given knowledge
domain, on only two dimensions can sometimes be too restrictive.
The author, in collaboration with some colleagues2 , proposes OntoSphere3 , a
new tool for inspecting and, in a near future, for editing ontologies using a more
than 3-dimensional space. The proposed approach visualizes the mere topological
information on a 3D view-port, thus leveraging one more dimension with respect to
the current solutions. This allows, at least, to better organize the visual occupation
of represented data. Being the 3-dimensional view quite natural for humans, espe-
cially for what concerns navigation, the proposed approach can be more effective in
browsing as it involves “manipulation-level” operations such as zooming, rotating,
and translating objects.
In addition, many more dimensions are introduced to convey information on
the visualized knowledge model (meta-information). The extension of the subtrees
lying under the currently viewed concepts is, for example, visually rendered by
increasing the size of the visual cues adopted for them. The same approach is
applied to colors, which are used to add insights on the representation: blue spheres,
for example, indicate that the corresponding concepts are terminal nodes in the
2
Alessio Bosca which developed the 3D visualization panel and Paolo Pellegrino
3
Firstly presented at SWAP2005, Semantic Web Applications and Perspectives, Trento, Italy

140
8.2 – OntoSphere

ontology. Transparency is used for distinguishing inherited, or inferred, from direct


relationships; shapes are used for differentiating concepts and instances and so on.
Together with the ways to convey more information to the users through sev-
eral visual dimensions, the proposed work also aims at tackling the space allocation
issues for ontology visual models. In fact, in the traditional solutions, big ontolo-
gies can easily lead to overcrowded representations that are difficult to browse and
that can be more confusing than aiding. Some attempts exist to overcome these
problems, as in OntoRama [42], where the nodes being inspected are magnified with
respect to the other nodes in the ontology. However, even these approaches tend
to collapse when visualizing big ontologies such as SUMO [43], counting over than
20’000 concepts. The proposed application, instead adopts a dynamic collapsing
mechanism and different views, at different granularities, for granting a constant
navigability of the rendered model.

8.2.1 Proposed approach


Applications for ontology visualization abstract from the formalism of the underlying
data, and graphically model the information contained in a given knowledge base
(KB). Each tool presents data according to its own approach but generally all of
them share the same choice of using a 2D space. The proposed approach, instead,
leverages the use of a 3D space as a mean to effectively represent and explore data
through an intuitive interface.
The application objective consists in enhancing the performances of current solu-
tions in terms of completeness and readability; in fact the OntoSphere application is
able to graphically represent both the taxonomic and the not taxonomic links as well
as selecting and presenting information on the screen at an appropriate detail level
according to what is relevant to the user’s interest. Furthermore, the tool provides
an intuitive navigation interface, featuring direct manipulation of the scene (rota-
tion, panning, zoom, object selection, etc.) and is designed to particularly meet the
demands of domain experts who have little technical skills in the field of Semantic
Web, and therefore specifically rely on graphical interfaces.
The choice of a three-dimensional environment constitutes the starting point in
designing the tool, as a 3D space offers one more dimension than traditional 2D
approaches to represent ontology data, so simplifying its interpretation. In order to
achieve completeness and readability, OntoSphere operates according to two different
principles:

• To increase the number of “dimensions” (colors, shapes, transparency, etc.)


which represent concepts features and convey additional information without
adding the burden of further graphical elements, such as labels, on the scene.

141
8 – H-DOSE related tools and utilities

• To automatically select which part of a knowledge base has to be displayed


and the detail level that has to be used in the process, on the base of user
interaction with the scene.
In particular, the latter principle is particularly important for improving the overall
system performances, since scale factor indeed constitutes a strong issue in visu-
alizing complex graph-like structures, i.e., ontologies. As cardinality of elements
increases, the number of items to be concurrently displayed on the screen worsens
the graphical perception of the scene and complicates spotting details. When the
amount of visualization space needed to represent all the information within the
KB outnumbers the one available on the screen, a few options remain available: to
scale down the whole image to the detriment of readability, to present on the screen
just a portion of it and allow its navigation or to summarize the information in a
condensed graph and provide means for its exploration and expansion. As the ef-
fectiveness of these options depends on the use case involved (consistency checking,
domain comprehension, KB updates) a combined usage of them can offer a better
solution than the separate adoption of one of the three.
OntoSphere tackles ontology visualization by exploiting different scene mangers
(respectively the RootFocus,the TreeFocus and the ConceptFocus scene manager)
that present and organize the information on the screen, each according to a differ-
ently detailed perspective. These managers alternate each other in managing the
graphical space as the user attention (inferred from user’s interaction) shifts from
one concept to another.
The proposed user interface is very simple and allows direct manipulation of the
scene through rotation, panning and zoom; it permits to browse the ontology as well
as to update it and to add new concepts and relations. Every concept within a given
scene is clickable with two different results: a left click performs a focusing operation,
shifting the scene to a more detailed level, while a right click maintains the current
perspective and simply navigates through elements. For example, left-clicking on a
concept in the global scene would lead to the visualization of the related concept
tree while right clicking on it would lead to the visualization of its children in the
same perspective, as explained in more details in the next sub-sections.

RootFocus scene
This perspective presents a big “Earth-like” sphere bearing on its surface a collection
of concepts represented as small spheres (Figure 8.9). The scene does not visualize
any taxonomic information and only shows direct “semantic” relations between el-
ements of the scene, usually a not completely connected graph. Atomic nodes, the
ones without any subclass, are smaller and depicted in blue while the others are
colored in white and their size is proportional to the number of elements contained
in their own sub-tree.

142
8.2 – OntoSphere

This view is particularly intended for representing the ontology primitives, i.e.,
the root concepts, but can also be used, during the navigation of the ontology, in
order to visualize direct children of a given node; a pretty useful option in case
of heavily sub-classed concepts. Topmost concepts within the ontology and the
relations between them define the conceptual boundaries of the domain and provide
a very good hint to the question: “what is the ontology about?”

Figure 8.9. The OntoSphere “RootFocus” scene.

TreeFocus scene
The scene shows the sub-tree originating from a concept; it displays the hierarchi-
cal structure as well as semantic relations between classes. Since usage evidence
proves that too many elements on the screen, at the same time, hinder user at-
tention, the scene completely presents only three fully-expanded levels at a time
and, as user browses the tree, the system automatically performs expansion and
collapse operations in order to maintain a reasonable scene complexity. The reader

143
8 – H-DOSE related tools and utilities

may note in Figure 8.10, on the left, how focusing the attention on the concept
“ente pubblico locale”, on the left in the figure, causes (with a simple mouse click)
the vanishing of the uninterested branches, then collapsed in their respective parents
(Figure 8.10, on the right). Collapsed elements are colored in white and their size is
proportional to the number of elements present in their sub-tree; instead concepts
located at the same depth level within the tree have the same color in order to eas-
ily spot groups of siblings. Hierarchical relationships within the scene are displayed
with a neutral color (gray) and without label, whereas other semantic relations in-
volving two concepts already in the scene are displayed in red, accompanied by the
name of the relation (as in the “RootFocus” perspective). If an element of the tree
is related to a node that is not present on the scene a small sphere is added for that
node in the proximity of its visual clue, so terminating the end of the arrow: in such
cases, incoming relations are represented with a green arrow, while outgoing links
with a red one.

Figure 8.10. The OntoSphere “TreeFocus” scene.

ConceptFocus scene
This perspective depicts all the available information about a single concept, at the
highest possible level of detail; it reports the concept’s children and parent(s), its
ancestor root(s) and its semantic relations, both the ones directly declared for the
given concept and the ones inherited from its ancestors.
Semantic relations are drawn as arrows terminating in a small sphere: red if the
relation is outgoing and green otherwise (Figure 8.11). Direct relations are drawn

144
8.2 – OntoSphere

close to the concept and with an opaque color, while inherited ones are located a
bit farther from the center and depicted with a fairly transparent color.
This scene is pretty useful during consistency checking operations because it ease
the spotting of inconsistent relations whenever a concept inherits from an ancestor
a property that “conceptually” contrasts with other features of its own.

Figure 8.11. The OntoSphere “ConceptFocus” scene.

8.2.2 Implementation and preliminary results


The work presented in this section has been entirely developed in Java. The choice is
related to the current panorama of ontology editors and of tools for ontology creation
and maintenance, which are in the majority of cases developed in this language.
Among the other advantages, Java permits to use such tools in different operating
environments, from devices with low computational power, to high performance
workstations.
The visualization engine uses the Java 3D API to produce a three-dimensional
interactive representation of ontology concepts and relationships. This API is di-
rectly linked with an underlying Open GL engine that provides the required graphics
capabilities. The ontology related part, instead, is based upon the well-known Jena
semantic framework from HP [23] which allows to easily load, manage and modify
ontologies and taxonomies written either in RDF [4], RDFS, DAML, OWL [5] or
N-triple. These two main parts are the core modules of the implemented applica-
tion, conciliating in a single working space capabilities for visualizing and editing
ontologies in various formats.

145
8 – H-DOSE related tools and utilities

In order to understand if the proposed approach is valuable and scalable enough,


the authors set up three different test beds. The first one assesses the compliance
of the tool with the initial requirements; the second one evaluates the tool utility
when it is applied to real world cases and the last test bed investigates whether the
current deployment is able to manage complex ontologies or not.
In the first experiment the application has been tested according to the praxis
for agile development and for requirements satisfaction checking. All the modules
composing the platform have been developed starting from a rather precise specifi-
cation and have been tested according to predefined JUnit [44] tests. After passing
the basic functionality checking, the entire application has been tested against three
different use cases including: simple ontology browsing, “conceptual consistency”
checking and ontology development. In the ontology browsing case, a group of 8
users has been required to load and browse 5 different ontologies. The goal was to
guess the domain of the chosen ontology and to analyze the granularity of the knowl-
edge model. The 5 ontologies used in this experiment are: the well-known Pizza
ontology from Protègè, the SUMO (Suggested Upper Merged Ontology) ontology, a
music ontology developed from scratch by the authors, the CABLE ontology and the
Passepartout ontology developed by the authors in collaboration with the Passep-
artout service of the Turin’s municipality. Results for each of the ontologies are
reported in Table 8.6.

Ontology Domain Topic and level of detail identification


U1 U2 U3 U4 U5 U6 U7 U8
SUMO (OWL) General  × ×   ×  
MUSIC (RDF/S) Music      ×  ×
CABLE (OWL)        
PASS. (OWL) Disability  ×  × ×   
PIZZA (OWL) Pizzas        
Table 8.6. Results for the ontology browsing use case.

Checking the ontology for “conceptual consistency” is rather different from the
formal consistency checking done by logic reasoners. What has to be checked is
not the ontology consistence for reasoning and inference, but whether a user can
detect domain-related inconsistencies created during the ontology design process.
For example, a concept may inherit some relationships that are not appropriate
for it, either because of a wrong parent-child relation or because of a previously
undetected error in the domain modeling: in this case the ontology is formally
consistent but not conceptually.
The ontologies involved in this test were the same used in the previous one, as
well as the users. Some interesting aspects came out from the experimentation: the

146
8.2 – OntoSphere

detection of “conceptual inconsistencies” through the observation of the ontology


representation appears, in fact, strongly dependent on the dimension of the knowl-
edge domain and on the expertise that the user has in that domain. So, for example,
by looking at Table 8.7 it is clearly noticeable that in the SUMO ontology no in-
consistencies were found, as it is, in fact, huge and well designed, while the involved
testers had a very poor background on the SUMO domain. On the other side, in
the CABLE ontology almost all inconsistencies were detected since the ontology is
small (80 concepts) and the domain was well known by all the experimenters. In
conclusion, determining whether the proposed OntoSphere application is or is not
able to evidence inconsistence is very difficult, since the involved factors are diverse
and can interact in complex patterns.

Ontology Known Detected inconsistencies


U1 U2 U3 U4 U5 U6 U7 U8
SUMO (OWL) / 0 0 0 0 0 0 0 0
MUSIC (RDF/S) 6 0 0 4 2 6 0 4 2
CABLE (OWL) 2 2 2 2 2 0 2 0 2
PASS. (OWL) 12 4 8 0 3 10 3 0 7
PIZZA (OWL) 0 0 0 0 0 0 0 0 0
Table 8.7. Results of the “conceptual inconsistencies” checking.

When the proposed application is used for ontology development, the support
provided for detecting conceptual inconsistencies is much more evident. The adop-
tion of OntoSphere for inspecting the work in progress allows, in fact, to easily detect
modeling errors. In particular, the mostly recognized errors are about relationship
propagation along the ontology hierarchy and wrong definitions of parent-child (isA)
relationships.
Although it is quite difficult to fill-up a table for showing how, and to what ex-
tent, the proposed application supports the process of ontology creation, interviews
with users evidence that many times the experimenters are able to quickly spot the
modeling errors. Their opinion indicates the intuitive visualization and the capa-
bility to visually represent inherited and inferred relationships as the main factors
for achieving success in their own modeling process. This last experiment actually
lies between the functional tests and real world test cases. However, to provide a
more grounded experimentation (please note that the results here presented are still
very preliminary) the authors performed a real world test in the occasion of the fi-
nal meeting of CABLE, a European MINERVA project on “CAse Based e-Learning
for Educators”. In that meeting, a demo of the OntoSphere application has been
presented to visualize the ontology developed in the context of the CABLE project.
The exciting result is that, rather than complaining about the complexity of the

147
8 – H-DOSE related tools and utilities

provided interface, or about the appearance or the controls for browsing the ontol-
ogy, the first observation was: “No! That relation can not subsist between those two
concepts!”. What surprisingly happened is that the application was able to highlight
the inherited relations so that the errors were spotted in few minutes of ontology
browsing. This is clearly a not scientific result since experiments are to be conducted
in a controlled environment, shall have a clear objective and must be carried on by
a significant group of users. And the aim of this paragraph is not to sustain the
thesis that assumes such reaction as a good result. However, the user reactions in
the CABLE meeting are encouraging signals that the still preliminary OntoSphere
application can be a valuable instrument in ontology design and development.
As last experiment, a simple scalability test was performed: the goal was to un-
derstand if OntoSphere is able to load and visualize ontologies having great amounts
of concepts and relationships. The entire SUMO ontology was therefore loaded and
browsed and the loading process took around 3.5 seconds, while navigation was
performed in real-time. SUMO is the Suggested Upper Merged Ontology and it is
currently released under a GPL license. It counts about 20’000 concepts related by
over 60’000 axioms.
There are still some issues to be fixed when browsing really huge ontologies: the
visualized concepts tend to clash if the number of concepts visualized at the same
time is high. Also the labels tend to overlap making the visualization more difficult
to manage (as in many other viewers). Moreover, since a human cannot take into
account more than a reasonable number of objects at a time, huge graphs shall be
collapsed and different ontology navigation patterns and interfaces shall be provided.

148
Chapter 9

Semantics beyond the Web

This chapter discusses the extension of HDOSE principles and techniques


to non-Web scenarios, with a particular focus on domotics. An ongoing
project on semantics reach house gateways is briefly described highlight-
ing how the lessons learned in the design and development of HDOSE
can be applied in a complete different scenario, still retaining their valu-
ability.

In this chapter a simple overview of a currently on-going work is provided, describing


the first steps done by the author, together with some colleagues working in the same
research group, to integrate semantics into domotic applications. In the presented
approach only a seminal, yet interesting, notion of semantics is adopted, which is
rather “encoded” inside the presented approach.
Semantics is seen at two different granularity levels: at a device level, each object
in a house that is capable of communicating with a PC, either directly or through
pre-processing of data, is organized into a taxonomy which describes the functional
knowledge associated to it. So, for example, a dimmer light, i.e. a light whose
intensity can be modulated, is classified as a subclass of the more generic light
device that only supports on and off commands.
At the house logic level, instead, semantics is seen as a composition of rules
(either specified by households or automatically learned by observing the behaviors
of house inhabitants) and inferences from general purpose knowledge. The vision
to which tend can be depicted more clearly with an example: let us imagine that
every time the dark is approaching, a person closes the house shutters and lights-up
the illumination in those rooms where he/she is located. During the Winter such a
behavior is repeated every day, at around the 5 p.m. (in Northern Italy) while in
the Summer the same happens around 10 p.m. since the daylight stays for a longer
time. A simple rule-based logic level would probably classify the two situations as
different behaviors and would probably have several difficulties at deciding whether

149
9 – Semantics beyond the Web

closing the shutters at 5 or at 10 p.m. A semantically enabled system, instead,


would probably possess some general knowledge on the house environment, indoor
and outdoor. For example, it might have a notion of daylight, and as a consequence
of dawn and dusk. Therefore, in the same scenario, the semantically enabled domotic
controller would probably infer that every day, at dusk, shutters shall be closed and
lights activated in rooms where household are present. In such a case seasons have
not any influence on the automated process since the time at which dusk happens
changes according to them and since the action is triggered by the dusk state rather
than by the day hour.
Deciding “a priori” which kind of knowledge shall be encoded in the house, and
to what extend learning shall be enabled is a rather difficult task, however the
main goal of the on-going research effort is not of providing omni-comprehensive
solutions, or “human-level” intelligence for house automation systems. Instead, the
main goal is to understand which advantages and functionalities can be obtained by
applying available, off-the-shelf technology in rule-based systems and semantics to
home automation systems, and to what extend such technologies can be of aid for
easing the everyday life of elderlies or disabled people.
The system here presented is only a first tentative of defining an house con-
trol architecture able to provide normal automation services as well as rule-based
behaviors. Semantics is still implicit in the taxonomic organization of controlled de-
vices (functional level) while general purpose knowledge and the related inferences
still have to be introduced. However the design effort aims at providing a platform
architecture able to support the seamless integration of these functionalities.

9.1 Architecture
In this section an overview of the Domotic House Gateway (DHG) developed by the
author and his colleagues is presented, followed by a more detailed description of
involved components. The proposed system has been designed with the intent of
supporting intelligent information exchange among heterogeneous domotic systems
which would not otherwise natively cooperate in the same household environment.
The concept of event is used to exchange messages between a device and the
DHG. As will be seen, these low-level events are converted to logical events internally
to the DHG so as to clearly separate the actual physical issues from the semantics
that lies beyond the devices and their role in the house (functional semantics). In
this way, it is also possible to abstract a coherent and human-comprehensible view
of the household environment.
The general architecture of the DHG is deployed as shown in Figure 9.1, and can
be divided into three main layers: the topmost layer involves all the devices that
can be connected to the DHG and their software drivers, which adapt the device

150
9.1 – Architecture

technologies to the gateway requirements. The central layer is the main responsible
for routing low-level event messages to and from the various devices and the DHG,
and also includes the generation of special system events to guarantee platform
stability.

Figure 9.1. Domotic House Gateway Architecture

The last layer is the actual core of intelligence of the system. It is devoted to
event management at logical (or semantic) level and to this purpose it includes a
rule-based engine which can be dynamically adjusted either by the system itself or
manually through external interfaces. The rules define the expected reactions to
incoming events, which can be either generated by the house, let’s say the “door is
opening” as an example, or by an application: “open the door”.
Complex scenarios may involve the rule engine: for instance, a rule might gen-
erate a “switch on the light in room x” event if two events occur: the room x is
dark and a sensor revealed the presence of somebody in that room. Additionally, an
automatic rule learning algorithm is under study: the logged events are processed
to infer common event patterns that are very likely to be repeated in the future.
New rules can be firstly proposed to the households and then possibly accepted and
added to the existing rule set. Some interfaces permit to check, modify, add or
delete each of the rules. Different rule sets may eventually be used.
In the next sub-sections, a detailed description is presented for each block of
Figure 9.1.

151
9 – Semantics beyond the Web

9.1.1 Application Interfaces, Hardware and Appliances


In a domotic house, a person interacts with many devices that may be connected to
the house gateway, from a simple light switch to a set top box, from a mobile phone
to a computer. The type of connection as well as the configuration for each device
may vary and often depends on it use: low-cost wired domotic buses are well suited
for controlling simple devices and smart appliances; wireless connections facilitate
a continuous intercommunication between different locations but may be disturbed
and not always reliable. Ethernet connections offer a high bandwidth which can be
used to transmit videos, etc.
Therefore, in order to uniformly control all the devices it necessary to offer a
common point of aggregation, the DHG. Domotic systems usually include a control
device which permits to manage all the devices connected to the domotic bus. In this
case this single device can be connected to the DHG. Other recent smart appliances
and devices (digital TVs, handhelds, etc.) can be accessed by modern domotic
installations, but are often also accessible via computer oriented connections, as
they generally embed simple computer systems.
In general, any device may communicate with the DHG using its preferred pro-
tocol, as the information exchange is handled by a specific driver for each type
of device. So, a web application running on a common PC could use the SOAP
protocol; a simple terminal might use raw sockets, and so on.
The configuration of each device is based on a simple and generic structured
(taxonomic) model of the house (which could actually be extended to more general
environments), as depicted in Figure 9.2.

Figure 9.2. Sample device instances in a House Model.

Basically, at low level, each device is associated to the appropriate driver, while
at high (or logical) level the device is assigned a unique identifier and type, as well as
a logical placement in a location of the house (e.g., the living room). For instance,
a light might be registered as the 11-th device within the driver that controls its
domotic system (and which is presumably connected to its main control device),
while its type is a light, supporting on and off events, and the location is the kitchen

152
9.1 – Architecture

(Figure 9.3). Of course, special devices such as application interfaces running on


mobile devices can be located into “virtual” rooms.

<house>
<roomname=’’kitchen’’>
<devicename=’’light’’ devID=’’11’’
devType=’’Light’’ driver=’’BTicino’’ />
</room>
<room name=’’living room’’>
<devicename=’’set tob box’’
devID=’’192.168.1.33’’
devType=’’STB’’ driver=’’STB’’ />
</room>
</house>

Figure 9.3. Example of device configuration

Configurations are provided by XML files, possibly specifying predefined sets of


rules that can enhance the use of the devices. The device drivers are dynamically
plugged into the DHG in order to support the most diverse house configurations,
especially if only temporary (e.g., in case of guest mobile devices) or changing from
time to time (e.g., a new computer has been added).

9.1.2 Device Drivers


The device drivers in the DHG are responsible for translating low level or hardware
states and activities of the devices (a light switch, a door, a software application, etc.)
into events. As mentioned above, each device may need to use a specific protocol
to communicate with the DHG. Therefore, it is necessary to develop specific drivers
for each type of device. To this purpose, some simple guidelines are provided for the
development and integration of new drivers: when plugged into to the DHG, each
new driver must:
• register itself with a unique identifier;
• for each supported device:
– if the device type (class) does not exist, register it as a subclass of an
existent type (e.g., root);
– register the new device with the associated type;
– correctly handle (receive and possibly send) events for each device ac-
cording to its type (and “super-types”).

153
9 – Semantics beyond the Web

It should be noted that the registration of a new device type should imply the
extension of an existent type (class or concept), whose events are inherited, and
may also involve the registration of new events (Figure 9.4).
A number of predefined types of devices are initially provided, as well as a list
of supported events for each device type. This information will become part of the
house knowledge base (or House Model), as should facilitate the design of drivers
and devices especially in terms of interoperability.

Figure 9.4. Sample device types and supported events.

9.1.3 Communication Layer


The main task of the Communication Layer consists in routing low-level events to
the correct destination: events coming from a device are sent to the Event Handler,
whereas events from the EH must be sent to the correct device driver for further
processing (e.g., to switch on the correct light). The association between devices
and drivers is created whenever a new device is instantiated by its corresponding
driver. Additionally, a special instance property maintains the “address” (or ID)
which identifies the device within the scope of the driver. For instance, a driver
controlling a domotic system may use a unique single number for each of the devices
which are under its control, whereas a driver that interfaces computer applications
may use IP addresses or URLs.
The Communication Layer is also responsible for the management of some driver
related issues, such as loading configurations or handling possible driver errors. For
instance, it may generate system events which are sent to the Event Handler for
further logging or error recovery, thanks to special rules.

154
9.1 – Architecture

9.1.4 Event Handler


The EH translates low-level events into logical events according to the house model,
and viceversa. The main task of the EH is the conversion of the device driver
addressing (the instance ID) to a high-level name, which correlates the device to
its function or role in the house. In this way, it is possible to hide specific device
addressing details to the Domotic Intelligence System (logic and/or rules), which al-
lows the DHG to autonomously and automatically perform actions on the connected
devices.
The uniqueness of a device instance is guaranteed by its logical name, which
includes the location (even if fictitious) and a unique name within such location,
consistently with the house model. E.g.: {driver “BTicino”, devID “11”, event
“...”}⇔{room “kitchen”, devName “light”, event “...”}

9.1.5 House Model


As already mentioned, the house model represents, with a structured and logical
scheme, the house devices and the events they support. The house environment is
subdivided in a collection of rooms and for each room the corresponding devices
are specified as instances of a supported device type. This representation facilitates
the design and configuration of the various devices, as well as their utilization, even
through the definition of scenarios, which coordinate the control of multiple devices
through a single action (e.g., pressing a button).
Beside the house configuration, the types of devices are also structured in a
taxonomy in order to explicitly correlate devices as specific subclasses of simpler
devices: for example, a light dimmer is a specific case of light. Each device type is
linked to the supported events, and these links are automatically inherited by the
descendant device subtypes (or subclasses...), to guarantee that specific devices can
always be controlled as a simpler ancestor. So, for instance, it should always be
possible to use a dimmer as a simple light bulb, supporting “switch on” and “switch
off” events (Figure 9.4).
The house model is (re)populated whenever a driver is plugged into the DHG,
and new device types may be registered as well as new events that they may support.
A minimal set of device types and of events is provided through a built-in fictitious
device driver, which may be used for testing.
The house model is also used to map the existent devices, room by room, to
the correct driver and to a registered device type and events according to simple
XML configuration files. In addition, it may also be exploited into the Domotic
Intelligence System to improve its intelligence and dependability, though at the cost
of increased complexity.

155
9 – Semantics beyond the Web

9.1.6 Domotic Intelligence System

The DIS permits to generate new (logical) events at run-time basing either on events
coming from the house through the Event Handler or on predefined or inferred
“rules” that may, for instance, act at a specified time. The current implementation
of the DIS is based on a run-time engine that processes rules according to the events
received by the Event Handler.
When certain conditions are met, new events may be generated and sent back
to the Event Handler that routes them to the correct devices through the Commu-
nication Layer. The rules can be preloaded or added either manually via external
interfaces, such as a simple console, or through the Rule Miner, which examines
the event log (see the Event Logger) to infer new rules. At this moment, some rule
control mechanisms are being studied to prevent annoying or dangerous situations.
At the very least, it is possible to save the rules and the status of the DIS at any
time for future reloading.

9.1.7 Event Logger

The event logger receives events from the Event Handler and saves them in a file in
order to facilitate the identification of possible erratic behaviors. Both logical events
and system events can be logged, and through external interfaces it is possible to
specify filters in order to limit the amount of stored data. The Event Logger is
also used by the Rule Miner to (semi-automatically) generate rules for the Domotic
Intelligence System.

9.1.8 Rule Miner

The Rule Miner tries to infer new rules for the Domotic Intelligence System by
reading and analyzing the event log. The idea is to identify common event patterns
so as to forecast and automatically generate events according to such patterns.
However, this is still work in progress, as it is also necessary to keep into account
dangerous situation as well as possible conflicting actions, which should actually
prevent the execution of certain inferred rules. To this purpose the Rule Miner
might exploit the House Model and other sources of information as knowledge base
to achieve a more consistent understanding of what the event represent, and therefore
of what is happening in the house.

156
9.2 – Testing environment

9.1.9 Run-Time Engine


The Run-Time Engine receives events at run-time from the Event Handler, parses
them according to a rule-based system and, if such input events meet certain con-
ditions, it generates new events, which are then sent to the Event Handler and
consequently to the Communication Layer for routing to the devices. The event
processing and generation mechanisms could actually be implemented using any
technique (generally related to Artificial Intelligence). However, according to the
authors, techniques like artificial neural networks or genetic algorithms render rather
difficult to precisely and securely control events. Therefore, as mentioned since the
beginning, current implementation of the RTE is based on a rule system, nominally
the open source Drools framework [45].
In this way, the status of the engine can be easily initialized by loading a con-
figuration file. Additional configuration files or user interfaces (such as a simple
command-line parser or even interfaces based on Natural Language Processing tech-
niques) permit to modify, restore or save the RTE status at run-time in order to
adjust or fix improper or erratic settings manually.

9.1.10 User Interfaces


A number of different interfaces (a console, a Natural Language Processing interface,
other graphical user interfaces) can be provided to permit the manual configuration
of the Run-Time Engine so as to fully keep under user’s control the Domotic In-
telligence System. At now, a simple command-line interface is available, but a
NLP-based interface is under study to facilitate the interaction with non expert
users.

9.2 Testing environment


In order to test the actual feasibility of the proposed system, which has been de-
veloped in Java mainly for portability issues, a number of different devices and
interfaces have been used. They are briefly presented in the following paragraphs
and explained in more details in subsequent sections, followed by the explanation
of the actual experimental setup. The most relevant element is the domotic house
near the authors’ laboratories and it is equipped with a home automation system
produced by BTicino, a leading Italian industry for house electrical equipment. The
house is part of a scientific and technological park maintained by C.E.T.A.D. [46]
and dedicated to promotion, development and diffusion of technologies and innova-
tive services for rehabilitation and social integration of elderly and disabled people.
To complete the picture of the tested devices, two additional elements are to be
cited. The first one is a simple parallel port connected to eight small led and driven

157
9 – Semantics beyond the Web

by a common PC running Linux. The second and last one is the MServ open source
program [47], again running under Linux, capable of remotely choosing and playing
music files.

9.2.1 BTicino MyHome System


The MyHome system [48] developed by BTicino is a home automation system able
to provide several functionalities as requested by the increasing needs of users for
smart and intelligent houses. These functionalities cover several aspects of domotics
such as comfort configurations, security issues, energy saving, remote communication
and control. The common framework in which every available device is deployed is
based on a proprietary bus called “digital bus” that conveys messages among the
connected devices and that provides them the required electrical power.
The most salient characteristic of the Bticino system is what they call the control
sub-system, i.e., the ability to supervise and to manage a home by using a PC, a
standard phone or a mobile phone. The control in the BTicino system can be
either local through a LAN connection, as experimented in this paper, or remote
through an Internet connection or a telephonic network. Through a proprietary
protocol, it permits to manage all the devices of the house, e.g. lights, doors,
shutters, phones, etc. This component permitted to interface the BTicino system to
the DHG by simply exploiting its features, instead of connecting each device to the
DHG. Therefore, a single driver has been prepared to handle the communication
between the control and the DHG.
Additionally, a number of specific modules has been provided to manage basic
devices such as lights, doors, shutters, and alike. It should also be noted that this
driver must poll the control to check the status of the domotic devices and convert
the returned information into events for the DHG. This is due to the fact that
the BTicino system installed in the testing environment does not support events
natively.

9.2.2 Parallel port and LEDs


Eight Light Emitting Diodes have been wired to the data lines of a standard IEEE
1284 parallel port connected to a PC running Linux. The driver for the DHG is
based on a simple TCP/IP server that drives the parallel port as a generic parallel
device using Linux-specific calls, so that the DHG is enabled to control each of the
eight lights. They have been used to test the rule system, as well as to demonstrate
the flexibility of the proposed work. Legacy or special purpose devices are in fact
still based on this simple technology (and the standard RS-232 is another example).

158
9.3 – Preliminary Results

9.2.3 Music Server (MServ)


This open source Linux application is basically a music player that can be controlled
remotely, therefore acting as a server. It exposes a TCP/IP connection to receive
commands such as play, next song, stop, etc., and to provide its status (e.g., which
song is being played, as it is normally randomly chosen). MServ is normally ac-
cessible through a simple telnet application, but also HTTP CGI and several GUI
applications are available for a more easy interaction with the system.
The DHG driver for MServ is therefore rather simple even in this case. In fact,
it only needs to exchange simple strings through a TCP/IP connection and to parse
them appropriately (some informative message may be related to the status change
caused by other connected clients, etc.). Events like play and stop are accepted as
commands, while relevant status messages are sent to the DHG for logging purposes
as, at now, no visual interface has been provided.

9.2.4 Experimental Setup


The DHG has been installed on a common PC, running a Java Virtual Machine
(JVM), on an AMD 1800+ processor with 512MByte of RAM (the DHG is expected
to work fine on fairly less performing PCs). Two other similar PCs have been used for
the parallel port and the MServ application respectively. A simple Ethernet switch
served as physical connection for the three computers and the BTicino control server,
which also exposes a standard RJ45 connector and supports TCP/IP connections.
Once all the systems were ready and functional, the three drivers mentioned in
the previous sections have been plugged into the DHG. The house model has been
provided as a simple XML configuration file, specifying the actual device instances
viewable through each of the drivers. So, for instance, each of the LEDs has been
identified through a number from 0 to 7, as this is sufficient for its driver, while their
type is a light, as all of them can only be switched on or off.
Some basic rules have also been added to the DHG, mainly to test the rule system
and to understand if any unexpected issue may arise in practice. So, for example,
whenever the entrance door is opened, a rule is activated and generates an event
that makes the LED 4 to switch on. Conversely, when the door is closed, another
rule sends a “switch off” event to the same LED.

9.3 Preliminary Results


Two aspects have been considered: the device drivers and the rule engine. In the
first case, all the drivers have been created and with rather little effort (a few man-
hours for each of them, including debug), and the interactivity with the DHG proved
to be very stable as no crashes have ever been registered.

159
9 – Semantics beyond the Web

The simple rules inserted into the system were also appropriately executed. So,
for instance, the light was automatically switched on after the windows in the same
room where all shuttered, the shutters being controllable domotic devices. However,
particular attention is necessary when inserting rules into the system. In fact, re-
ferring to the previous example, if the shutters remain closed one must still be able
to switch off the light, without causing the system to switch it on shortly there-
after. These types of issues, as well as cases regarding more critical devices such as
the oven which may contain improper items, are to be considered with great care,
especially when designing an automatic way to infer new rules.

160
Chapter 10

Related Works

This chapter presents the related works in the field of both Semantic Web
and Web Intelligence, with a particular focus on semantic platforms and
semantics integration on the Web.

On May 2001, Tim Berners-Lee wrote the Semantic Web manifesto on “Scientific
American”, in that article he proposed a new vision of the web: “The Semantic
Web (SW) is not a separate Web but an extension of the current one, in which
information is given well-defined meaning, better enabling computers and people to
work in cooperation.”. In his view the next generation of the web will be strongly
based on semantics in order to allow effective communication between humans and
machines, leading to a powerful collaboration between them in accomplishing tasks.
As he said, the Semantic Web will bring structure to the meaningful content of
Web pages, creating an environment where software agents roaming from page to
page can readily carry out sophisticated tasks for users. Such an agent coming to a
clinic’s Web page will know not just that the page has keywords such as “treatment,
medicine, physical, therapy” (as might be encoded today) but also that Dr. Hartman
works at that clinic on Mondays, Wednesdays and Fridays [7].
The ideas formalized by Berners-Lee came out after years of research on Arti-
ficial Intelligence and from the relatively recent research on the Web, and grouped
many researchers from all the world promoting further explorations toward the next
generation of the web. During the past five years the Semantic Web community
has been one of the most active research community in the world, producing many
diverse technologies and applications trying to put in reality the SW vision. Many
milestones have been reached in this endless and exciting process, in particular a
large enough agreement on semantics integration on the web has been reached.
Actually there is no unique recipe to insert “meaning descriptors” on the existing
web, but it is quite clear what are the main requirements to satisfy for the devel-
opment of scalable and useful semantic applications. Researchers found that for an

161
10 – Related Works

effective inclusion of semantics on the current web the meaning information should
be definable by people or machines potentially different from content creators, and
the common agreed way to fulfill such requirement is the definition of entities called
“semantic annotations” pointing at described resources. Consequently, several works
proposed techniques to provide semantic information through independent annota-
tions, offering services for annotation editing, storage and retrieval [49].
As those systems reached a significant diffusion in the academic world some
problems were noticed, in particular it was clear that the task of manually annotating
the whole existing web was not feasible, thus the subsequent evolution in research
involved the design of automatic annotation platforms.

10.1 Automatic annotation


A number of annotation tools for producing semantic tags are currently available.
Protégé 2000 [50] is a tool supporting the creation of ontologies either in RDF/RDF-
S [4] or OWL [5]. Annotea [51] provides RDF-based markup of web pages but it
does not support automatic information extraction and it is not well integrated
in semantic-powered publication frameworks. OntoAnnotate [52] is a framework
allowing manual and semi-automatic annotation of web pages. AeroDAML [53] is a
tool that starting from an ontology and a given web page produces a semantically
tagged page that should be validated by humans.
Several projects aiming at providing automatic annotation tools have also been
developed trying to overcome the heavy burden of manually annotating the whole
web, which is clearly infeasible. A. Dingli et al. [33] proposed a methodology for
learning to automatically annotate domain-specific information from large repos-
itories, requiring a minimal user intervention. Dill, Eiron et al. [54] proposed a
combination of two tools respectively named “SemTag” and “Seeker” for enabling
the initial semantic web bootstrap by means of automatic semantic annotation.
They applied their platform to a very large corpora composed by approximately 264
million of web pages and generated 434 millions of corresponding semantic metadata.
There are also some holistic approaches that try to provide annotation services
in the context of comprehensive platforms designed to support all tasks required to
provide semantically retrievable information.
The KIM platform [55], as an example, is a semantically enhanced information
extraction system which provides automatic semantic annotation with references to
classes in one ontology and to instances. The system has been tested on about 0.5
million news articles and proved to be stable and effective enough. Using the ex-
tracted annotations, it offers also semantic based indexing and retrieval where users
can mix traditional IR (information retrieval) queries and ontology-based ones. Sim-
ilarly the Mondeca [56] system provides semantic indexing, annotation and search

162
10.2 – Multilingual issues

facilities together with a semantic-powered publication system.


While the first approaches are only partial attempt to solve the semantic infor-
mation inclusion on the web the second ones are more general since they try to take
into account all aspects of semantic information retrieval. However they differ from
the approach proposed in this thesis as they are standalone, monolithic approaches
to the problem where the user or the developer is forced to adopt the “producer”
tools and approaches to the publication tasks, without the ability to seamlessly in-
tegrate the provided functionalities in his existing applications. In a sense, while the
first approaches differ from the one proposed in this thesis since they address only
specific sub-areas of the semantic information retrieval process, the seconds differ
from the philosophy point of view adopting an “application level” approach rather
than a “middle-ware” approach.

10.2 Multilingual issues


The relationships between ontology development and multilingual support can be
seen from two different points of view: using an ontology to ease cross-lingual cor-
respondences [57], or developing ontologies that are usable with different languages
[58]. The approaches for tackling this problem are various, and Oard [59] classified
them as Controlled Vocabulary or Free Text approaches. Clearly, Internet-scale ap-
plications require Free Text solutions, where semantics can be inferred by textual
and statistical techniques (Corpus-based, in Oard’s terminology) or by semantic ones
(Knowledge-based). In semantic approaches, Agnesund [60] distinguishes between
purely conceptual ontologies, language-specific ones, and the so-called interlingua
ontologies, that should be able to represent all the distinction that can be made in
all (or a subset of) languages. At least two well known efforts to extend the well-
known WordNet lexical network [13] to support multilinguality are EuroWord-Net
[58] and MultiWordNet [61]. These approaches tend to develop language specific
sub-networks, although integrated with a common conceptual top-level hierarchy,
enriched by “synonymous” relationships. EuroWordNet [62] explicitly models con-
cepts in different languages, and then builds an “Interlingual Index” composed of
disambiguated subsets of the language-specific concepts that are common to all mod-
eled languages. MultiWordNet explicitly recognizes the presence of “lexical gaps” in
the correspondence between different languages, due to missing direct translations of
some words. These approaches in some sense inherit the general WordNet structure,
and cannot be truly interlingual.
The approach proposed here is simpler, and aims at developing a conceptual
ontology, and at linking it with lexical representations in different languages. Instead
of trying to model the meaning of words in all languages, shared concepts are defined,
and with them some words (or sentences) to express their meaning.

163
10 – Related Works

10.2.1 Term-related issues

The words associated to each ontology concept, for each supported language, play
a crucial role in the process of automatic or semi-automatic annotation of web
resources. Simple “bag of words” techniques can use these words as a mean to au-
tomatically annotate a given document. The occurrence of a term in a document,
in fact, indicates that the document is probably related to the associated concept.
Other types of mapping rely on Natural Language Processing (NLP) [63] [53] and
adaptive information extraction [64]. For all of them, a preliminary phase for learn-
ing and training is needed; this phase is dependent from the scenario in which the
mapping is applied, and depends on the ontology and on the corpus of documents
to be classified (i.e. on significant terms).

Sense disambiguation becomes compulsory, in this scenario, as terms can assume


different meanings, even in the same document, depending on the context in which
they are used. The focus-based approach to synset expansion introduced in chapter 5
takes into account this issue, being inspired by the work of Bouquet et al. [15] on this
theme, and it is only one between many possible approaches. Sense disambiguation,
in fact, is strictly related with the evaluation of semantic similarity.

Evaluating semantic similarity corresponds, in the most straightforward solu-


tions, to measuring the distance (in terms of “isA” links) between the nodes cor-
responding to the concepts being compared. A widely acknowledged problem in
this approach, however, is that it relies on the notion that links in the taxonomy
are uniform distances. In real taxonomies, this is often false affecting the similarity
evaluation and making results unreliable. Other methods have therefore been pro-
posed to overcome this issue: Resnik [65], as an example, proposed to exploit the
information content of the taxonomy and, on the basis of probabilistic evaluations,
he defined similarity between two concepts. The limiting factor of his approach
is the implicit evaluation of the statistical distribution of the taxonomy concepts,
which cannot be calculated easily. Another interesting approach builds on the so-
called Conceptual Density (Agirre and Rigau [66]): given a word and a context
(represented by other words), the Conceptual Density evaluates the closeness, in
the ontology, of the context terms with the given word. Unfortunately also this
measure is based on the evaluation of distance between words in the taxonomy and,
while in the cited solutions it is calculated in a more sophisticated way to avoid some
of the problems above, it still lacks of precision. Moreover, since every word of the
context may participate in different senses and nodes of the ontology the complexity
of the algorithm is high and the possibility of mis-recognizing the correct sense of
the word is not negligible.

164
10.3 – Semantic search and retrieval

10.3 Semantic search and retrieval


The currently available literature contains several approaches related to the design of
semantic search engines. One of the most recent is the work of Rocha, Schwabe et al.
[67] who proposes a new approach for semantic search, combining traditional engines
and spread activation techniques. They stress the importance of taking into account
some sub-symbolic information for improving search results provided by means of
ontology navigation. Therefore they associate a weight to each ontology relation
in order to distribute into the whole ontology the “activation” of single ontology
nodes identified by the user query. The main difference between this solution and
the one presented in the previous chapters is the adoption of the ontology-instance
framework as operating environment.
Another semantic searcher has been presented in [68]. It relies on a Semantic Web
infrastructure and aims at improving traditional web searches by integrating relevant
results with data extracted from distributed sources. The user query is also mapped
onto a set of ontology concepts and ontology navigation is performed. However,
such process only provides other concept instances that are strongly related to the
query ones, by means of a breadth-first search. The issue of performing semantic
inference by means of graph navigation is not addressed and, moreover, the system
does not work on annotations.
The paper from Stojanovic, Studer et al. [69] proposes a new paradigm for
ranking query results in the Semantic Web. As they state, traditional IR approaches
evaluate the relevance of query results by analyzing the underlying information
repository. On the other hand, since semantic search is supported by ontology,
other relevant resources could be considered for assessing the relevance of results:
the structure of the underlying domain and the characteristics of the search process.
QuizRDF [70] is an interesting system proposed by Davies et al., from BTexact
technologies, which combines traditional keyword querying on web resources with
the ability to query against a set of RDF annotations about those resources. Re-
sources, as well as RDF annotations, are indexed by the system, providing means
for keyword query on both bases. The resulting index thus allows queries against
both the full text of documents and the literal values that occur within RDF an-
notations, along with the ability to browse and query the underlying ontology. As
stated by authors, the approach they propose is “low threshold, high ceiling” in
the sense that where RDF annotations exist they are exploited for improving the
information-seeking process, but where they do not exist a simple search capabil-
ity is still available. Although the approach is powerful, particularly focusing on
the ability to combine traditional searches with ontology-based searches, no infer-
ence nor navigation is supported thus failing to capture the semantic relationships
between ontology concepts in the search task.
A motivating work for semantics integration in the search process is represented

165
10 – Related Works

by the early work on semantic search engines by Heflin and Hendler [71]. In their
approach, the user is allowed to perform query by examples, using data from ontolo-
gies. Basically the user is presented with a set of ontologies about different domains.
He should select the ontology related to the domain for which he wants to perform
a search, and a tree navigation interface is then provided in order to select concepts
and instances similar to the desired ones. The user can therefore select resources
he is interested in and a search template is automatically built using that selection.
Finally the search is issued to the query subsystem providing relevant results.

10.4 Ontology visualization


The existing techniques for the visualization of ontologies can be summarized in
four main visual schemes, possibly cooperating in more complex scenarios: network,
tree, neighborhood, and hyperbolic. The network view represents an ontology as a
generic network of connected elements and is usually exploited when the knowledge
elements cannot be conveniently organized in hierarchies. The tree (or hierarchical)
view, instead, is generally used for more structured ontologies. However, the simple
hierarchical representation provided by this view is unable to represent connections
between distinct sub-trees that violate the dominant taxonomic structure. In such
a case, the connections violating the hierarchy are indicated in separate views, so
complicating the navigation of the structure. The most common examples of tree
views are based on indentation, as in file system browsers, or on diagrams with nodes
and arcs. However, a tree-map view has also been proposed by Schneiderman [72],
at the Maryland University, which uses nested rectangles to represent sub-classes
(Figure 10.1, C).

Figure 10.1. TreeViews: indented (A), nodes and arcs (B), TreeMap (C).

The main advantage of tree views is that they can be displayed with rather little

166
10.4 – Ontology visualization

effort in comparison with network-oriented views. More importantly, entire sub-


trees can be easily collapsed (i.e., temporarily hidden) to concentrate the attention
on the rest of the knowledge base. The next two schemes apply similar principles on
network-based structures: in fact, both the neighborhood and the hyperbolic views
(Figure 10.2) focus the attention on a chosen node and its nearest neighbors. In
the former case only the semantically nearest nodes are displayed, whereas in the
latter case the nodes are displaced onto a semi-spherical surface, projected onto the
visual window, therefore magnifying the central nodes while shrinking the peripheral
nodes.

Figure 10.2. Neighborhood View (A), Hyperbolic View (B).

The aforementioned representation schemes have been utilized in numerous ap-


plications with assorted enhancements.
Protégé [37] is an open source ontology editor providing support for knowledge
acquisition. Its framework natively allows the interactive creation and visualization
of classes in a hierarchical view. Each concept in the tree can be displayed along
with additional information about the related classes, properties, descriptions, etc.,
which can all be quickly edited. Other panels manage class instances, alternative
user interfaces, queries, and possibly other extensions which can be easily added to
the framework as plug-ins. Particularly, various plug-ins are available for enhancing
the visualization of the ontology and are therefore here discussed.
The OntoViz [41] plug-in displays a Protégé ontology as a graph by exploiting an
open source library optimized for graph visualization (GraphViz [39]). Intuitively,
classes and instances are represented as nodes, while relations are visualized as
oriented arcs. Both nodes and arcs are labeled and displaced in a way that minimizes
overlapping, but not the size of the graph. Therefore, the navigation of the graph,
enhanced only by magnification and panning tools, does not provide a good overall
view of the ontology, as the graphical elements easily become indistinguishable.
This problem is less critical in Jambalaya [40, 73], another ontology viewer for
Protégé, based on a tree-map scheme or rather nested interchangeable views, namely

167
10 – Related Works

Simple Hierarchical Multi-Perspective (SHriMP). SHriMP is a domain-independent


visualization technique designed to enhance how people browse and explore complex
information spaces. An animated view of the ontology graph facilitates the navi-
gation and browsing at different levels of abstractions and details, both for classes
and relations, while keeping low the learning curve through well-known zooming
and hypertext link paradigms. However, text labels and symbols tend to overlap
when the ontology grows in complexity and it is difficult to understand the relations
among classes or instances.
TGViz [74], similarly to OntoViz, visualizes Protégé ontologies as graphs. In this
case however, the displacement of nodes and arcs is computed using the spring layout
algorithm implemented in the Java TouchGraph library [75]. The main advantage
of this approach is the optimized exploitation of the bi-dimensional space in which
the nodes and arcs are dynamically distributed. However, the level of detail is not
adjusted according to the level of zoom, often resulting in overcrowded pictures.
The ezOWL [76] plug-in, differently from the previous viewers, enhances Protégé
with a graph-based editing of ontologies, though reducing to a minimum the opti-
mizations for the graph organization. Even in this case it may be difficult to main-
tain both a good understanding of the overall ontology and a sufficient level of detail
about a chosen sub-graph.
OntoEdit [77] is a commercial Java-based tool that, similarly to Protégé, offers a
graphical environment for the management and development of ontologies, and can
be enhanced with various plug-ins. In particular, the Visualizer plug-in proposes
a bi-dimensional graph-based view of the ontology using colored icons as nodes
accompanied by contextual tooltips, such as colored borders or spots other than the
usual labels, which unfortunately are often hidden or overlapping.
IsaViz [78] is another graph-based visual editor for RDF models based on the
GraphViz library. In this case, the principal enhancement to the previously men-
tioned approaches based on graphs is the Radar View, which, similarly to Jambalaya,
displays a simplified network overview of the overall ontology in a small window,
highlighting the currently edited region in a rectangle. In addition, icons and colors
are also exploited to concentrate information, while different visualization styles and
layouts are supported through the GSS (Graph Style Sheet) language, derived from
the well-known CSS (Cascading Style Sheet and SVG (Scalable Vector Graphics)
W3C recommendations. However, it is still not possible to customize the level of
details for big ontologies.
OntoRama [79] is an ontology browser for RDF models based on a hyperbolic
layout of nodes and arcs. As the nodes in the center are distributed on more space
than those near to the circumference, they are visualized with a higher level of detail,
while maintaining a reasonable overview of the peripheral nodes. In addition to this
pseudo-3D space, OntoRama also introduces the idea of cloned nodes. Since the
browser supports generic ontologies, with properties for classes, multiple relations,

168
10.5 – Domotics

sub-classing, and multiple inheritance, certain nodes and their sub-trees are cloned
and visualized multiple times so that the number of crossed arcs can be reduced, and
the readability enhanced. The duplicate nodes are displayed using an ad-hoc color
in order to avoid confusion. Unfortunately, this application does not support editing
and can only manage RDF data. Eventually, the approaches and functionalities for
each of the mentioned application are summarized in the following table.

View Scheme
Viewer Editor Network Hierch. Neighb. Hyperb.
Protégé  ×  × ×
OntoViz ×  × × ×
Jambalaya ×    ×
TGViz ×  ×  ×
ezOwl   × × ×
OntoEdit  ×  × ×
Visualizer   ×  ×
IsaViz   × × ×
OntoRama ×    
Table 10.1. Summary of ontology visualization tools.

10.5 Domotics
In the context of domotics, solutions to device connectivity are driven by local com-
mercial leaders. North America, Europe and Japan are three main areas oriented to
different wiring technologies and protocols, and the picture gets even more confused
when looking at Nation-wide level. However, three main approaches can be identi-
fied: the installation of a specific and separate wired network is a common approach
and is either based on proprietary solutions (EIB/KNX [80], LonWorks [81], etc.)
or on more diffuse technologies (e.g., Ethernet, Firewire), the latter being preferred
for computer related networks and wide-bandwidth requirements. On top of these,
various protocols are implemented (X10 [82], CEBus [83], BatiBus [84], HBS, just
to name a few) and none of them has yet prevailed on a global scale. The Konnex
open standard [85], involving EIB, EHS and BatiBus, is one of the major initiatives
in Europe aiming at global interoperability.
Another common approach is based on the reuse of the existing wirings, such as
power or phone lines (the former being more frequently used as a carrier, as in EHS,
X10, Homeplug [86]). However, these solutions generally present higher noise levels
than dedicated wirings and are therefore less versatile. Lastly, wireless technologies
are also becoming more and more attractive, and mostly based on either infrared

169
10 – Related Works

technology, or radio link (e.g. IEEE802.11b or Bluetooth [87]). The latter is usually
adopted for guaranteeing connectivity at higher distances.
As each of these alternatives is better suited in a given context, they are still
far from convergence into a unique solution, even because of strong commercial
influences. Instead, in a home one can frequently find different technologies, such as
Ethernet for multimedia connectivity and Bluetooth for personal connectivity, while
special buses reliably allow home automation tasks.
However, as processing complexity is increasingly coming at lower costs, tech-
nologies that are well established for common PCs are being transferred to numerous
devices. It is not therefore surprising that simple computer systems are being used
to bridge this interconnection gap, especially for what concerns ambient intelligence.
In this context, most of the efforts are divided into limited domains: each domotic
system vendor usually proposes intelligent approaches to the management of the
supported devices, and often remote control is also offered through the Internet or
with a phone call. However, the intelligence provided to the system is generally ba-
sic, unless enhanced using a more complex and versatile device, such as a computer.
In effect, some programming software can be found, but is generally only oriented to
the control of one specific technology, and offers little support to ambient intelligence
techniques (ETS [88], commercial; EIBcontrol [89], open source).
Furthermore, smart devices or information appliances (like PDAs, set top boxes,
etc.) are not usually produced by domotic system vendors, and are only recently
facing the intercommunication and standardization issues related to domotic systems
[90]. It becomes therefore desirable a “neutral” device capable of interfacing all these
parts. In this context, the OSGi Alliance [91] has already proposed a rather complete
and complex specification for a residential gateway (and alike). Here, protocol and
integration issues, as well as intelligence related behaviors, are demanded to third-
party “bundles”, that can be dynamically plugged into the OSGi framework at
run-time. An additional effort is therefore required to enhance the system with a
compact yet general approach to intelligent and automatic interoperability among
the involved devices.
Other researches, in the context of ambient intelligence, include the GAS (Gad-
getware architectural Style) project [92], which proposes a general architecture sup-
porting peer-to-peer networking among computationally enabled everyday objects;
the event-based iRoom Operating System (iROS) [93], which is mainly focused on
communications about devices in a room, and does not take in particular consid-
eration the possibility of a general and proactive interaction of the environment
with the user. The one.world project [94] offers a framework for building pervasive
applications in changing environments. Other researchers propose a fuzzy and dis-
tributed approach to device interaction, by exploiting a special rule-based system
[95]: special purpose agents mutually collect, process and exchange information,
although at the expense of higher complexity, especially in terms of adaptation of

170
10.5 – Domotics

device operations to the agent system requirements.

171
Chapter 11

Conclusions and Future works

This chapter eventually concludes the thesis and provides an overview on


possible future works.

11.1 Conclusions
The work presented in this thesis shall be seen as an engineering exercise rather
than a work on theoretical aspects involved in the Semantic Web, which are only
partially tackled in the previous sections. The motivation of such a “practical”
approach is to demonstrate to which extent the currently available technologies and
solutions can promote the adoption of semantic functionalities in nowadays web
applications. According to this vision, the work presented in these pages is a sort of
software engineering project that, starting from the analysis of applications, on the
web, which more or less explicitly deal with knowledge elaboration, exchange and
storage, extracts the requirements for a semantic platform readily usable in those
applications, with successful results.
These requirements clearly impose several constraints on the extent to which
semantics can be applied in this context. Available solutions are to be gathered and
harmonized in order to provide common middle-ware platforms, easily accessible
through the adoption of available off-the-shelf technologies such as SOAP and Web
Services. H-DOSE, rather than being a complete, sound solution to this exercise is
a first attempt to provide an answer to the demand of exploitable technologies that
seems to arise from the actors of the current Web.
What shall be rather clear, after reading the presented work, is that the exercise
can be effectively solved and semantic functionalities can be introduced on the cur-
rent web. H-DOSE, in this sense, has the merit of practically providing a sample
solution that allows the aforementioned integration in nowadays web applications

172
11.1 – Conclusions

such as content management systems, e-learning systems and search systems. Re-
sults, in terms of performance, are sometimes preliminary or they are still being
gathered, however performance is not the main goal of the platform. The main goal
is to practically demonstrate that research on the Semantic Web field can be easily
exploited by the nowadays Web, although in a relatively low-power version. Per-
formance is important too, as demonstrated by the executed and by the on-going
experiments, however the Web people is rather aware of difficulties involved in the
technology maturation process and seems to pay much more attention on integra-
tion issues than on performance of the preliminary adoption. On a long time frame,
performance will indeed make the difference between successful and failing solutions,
however, at now, integration and readily exploitation is the main concern and the
main goal to be reached for spreading Semantic Web technologies on the real Web.
The approach presented in this thesis moves exactly in this direction and do
not aims at providing commercial-level solutions. It is still based on the traditional
search paradigm used by current keyword-based engines. Information, therefore, is
still seen as encoded into documents and searches are still for relevant documents
instead of being for relevant knowledge. Applying this paradigm limits, greatly, the
potentials of semantics adoption both in terms of user satisfaction and of provid-
able results. However, it has the advantage of trespassing the barriers that usually
prevent the adoption of new technologies in already deployed applications.
One of the main problems of the Semantic Web is, according to literature, the
absence of a killer application. Nevertheless, as demonstrated by the Web itself,
killer applications are not the only way for a technology to be adopted. Permeation
is another channel through which the same result can be reached. The key to
permeation is the ease of integration, the ability to readily use new technologies
without changing what already offered to the users. The result is a sort of silent
invasion of the Web made by semantic technologies. If permeation takes place, the
transition between the nowadays Web and the full Semantic Web will be nearly
unnoticeable and will lead to the materialization of the powerful theories now being
developed by the research community. H-DOSE is designed having in mind this idea
of permeating the currently available solutions instead of substituting them as in the
case of a killer application. The small experiments presented in the previous chapters
shows that the first factor for achieving permeation, i.e., ease of integration, can be
reached by the available semantic technologies, the next step is then to increase the
importance of this silent invasion, favoring the final exploitation of the full-powered
Semantic Web.
The approach adopted by H-DOSE although promoting this idea of permeation,
moves also along the path traced by another consideration: “for the Semantic Web,
partial solutions will work and even if an agent will not be able to reach a human-
level of understanding and thus will not be able to come to all conclusions that a
human might draw, the agent will still contribute to a Web far superior than the

173
11 – Conclusions and Future works

current Web”.
The functionalities provided by the platform are clearly much less sophisticated
that what is now being investigated by SW research initiatives. However, it seems
rather difficult that sophisticated semantic search paradigms and solutions will be
able to spread-up the Web until more simple semantics will be adopted. Except for
very specific domains in which solutions and agreements on technological issues can
be defined, such as for the Multi-Agent system community.
Moreover, even when the full power of logic and inference can be applied, these
technologies often prove to be too rigid to deal with real world applications where
uncertainty and contradictions exist. To tackle this issue several researches attempt
to define a common framework for logic reasoning in presence of contradictory and
inconsistent environments. However these solutions, although effective as research
elements, still appear too preliminary to be successfully engineered into off-the-shelf
products.
Concluding this section, some philosophical issues should be addressed as they
still agitate the sea of Semantic Web researchers and since they will probably char-
acterize the various phases of SW exploitation. The ontology-based modeling of
real-world entities and situations, and of their relations, seems in fact to be too
rigid and too distant from the way humans approach the representation of the same
knowledge. A growing group of skeptic people, either working inside or outside the
SW initiative, is now involved in a deep discussion of the foundational technologies
on which the Semantic Web is/will be built. The underlying idea is that represent-
ing human knowledge, which is intrinsically fuzzy, uncertain and not formal with a
rigid, formal model is probably a not good solution, although at now seems more
feasible than others.
Moving from this critical movement inside the SW, the idea of using “Concep-
tual Spaces” [96] instead of ontologies is, for example, more appealing since the
representation invented by Peter Gardenfors allows to model real objects, and sit-
uations in a much more natural way. The notion of conceptual space is based on
the so-called “quality dimensions”. According to the Gardenfors’ definition, quality
dimensions are those mechanisms which allow to evaluate the “qualities” of objects.
They correspond to the ways different stimuli are judged to be similar or different.
An introductory example of quality dimensions encompasses temperature, weight,
brightness, pitch and the three ordinary spatial dimensions height, width and depth.
Dimensions are seen as the building blocks of the conceptual level. Without
going in too deep details (see [96] for further explanations), a conceptual space is
defined as a class of quality dimensions D1 ,...,Dn . And a point in a conceptual
space is subsequently represented as a vector v = d1 ,...,dn
with one index for
each dimension. Each dimension is endowed with a certain topological or metrical
structure. As an example, the weight dimension is isomorphic with the half-line of
non-negative numbers, the “taste” quality can be represented on a 4-dimensional

174
11.1 – Conclusions

space whose components are generated by four different types of sensors: saline,
sour, sweet and bitter.
A natural concept in a conceptual space is then defined as a convex region of
the space. From this assumption several properties can be derived which justify the
adoption of such a theory as a semantic representation of the real world. However,
there is a main factor that still prevents a rapid development of applications of
conceptual spaces: the lack of knowledge about the relevant quality dimensions.
It is almost only for perceptual dimensions that the psychophysical research has
succeeded in identifying the underlying geometrical and topological structures. For
example, we have only a confuse understanding of how we perceive and conceptualize
things according to their shapes.
In addition to the criticisms about the representation of concepts, the idea of
ontologies developed by every actor playing a role on the web appears a little bit
visionary. Redundancy in fact is likely to diverge with very few ontologies, as can
already be seen for available knowledge models, and the task of merging, linking
and sharing the enormous amount of differently modeled knowledge will soon be
infeasible. If it is already difficult to reach ontological agreements when only few
people are involved, on a worldwide scale this is clearly impossible. SW people usu-
ally confute the above assertions by porting the current Web as a successful example
of this self-organization approach. Nevertheless, it must be said that the current
web is actually a completely not homogeneous repository of information where the
only standard agreements are on format of published data and not on meaning, or
metadata. This characteristic is actually the key of the Web success but, at the
same time has imposed severe problems of data passing between organizations and
between software applications, as demonstrated by the several initiatives aimed at
solving this problem, XML is an example. Whenever the data to be exchanged must
also be understood, by machines especially, the wild reality of the web demonstrates
to be too wild to allow effective knowledge manipulation. Agreements can be cer-
tainly reached on formats as happened for the nowadays Web (RDF/S and OWL are
examples), but reaching the same agreement for knowledge modeling seems quite
visionary. Clearly on small, specific, domains effective solutions can be found but
the point is on the scalability of this approach rather on its feasibility in “controlled”
scenarios.
Solutions will probably involve some “bottom-up” approach to the problem,
mimicking what humans to in their everyday interactions. It is likely that the final
solution will include some commonly agreed general model for basic facts (almost
all humans can recognize other humans, without being specifically instructed, for
example) and some domain specific knowledge defined autonomously by people and
shared between interacting parties, as happens for example when two persons of
different nationalities meet themselves and must find a common knowledge base
on which to build their subsequent interaction. Independently from these future

175
11 – Conclusions and Future works

evolutions, which are hard to predict, at now it is clear that the nowadays “top-
down” semantics can be fully exploited only on limited domains where agreements
on meaning can be reached, and reasoning can deal with inconsistencies, uncertainty
and contradictions. A whole Semantic Web, although stimulating as a vision, still
seems far from being reached and research efforts are still needed to walk in this
exciting direction.

11.2 Future Works


The work presented in this thesis is not an isolated research effort, confined to the
context of the involved years of research. Instead it is a work well integrated into
a more general panorama of researches that are taking place in the e-Lite research
group of the Turin’s Polytechnic. As part of a collaborative environment, the H-
DOSE platform does not end up with this thesis but is still an active research topic
and a supported public platform for semantics integration on the Web.
At now, H-DOSE is being adopted by a rather new company, Intellisemantic
s.r.l, that plans to use the semantic functionalities offered by the platform inside
its applications of business intelligence and patent discovery. In coordination with
Intellisemantic, the author, as well as its colleagues in the e-Lite group, is working on
the next version of the platform, namely “hdose v2.2”, which will likely be released
at the end of July 2006. In this new release several improvements will be introduced
aimed, from one side at supporting the contemporary adoption of different ontologies
as knowledge models for the platform, and on the other side to better support the
integration of platform services into already deployed applications, by means of
logging, authentication and security mechanisms.
In parallel with the evolution of the platform carried out with the Intellisemantic
collaboration, the H-DOSE platform is being currently redesigned to completely
support multimedia information, at the desired detail level, from single objects in a
movie scene to entire clips. The resulting new platform, named MM-DOSE, will be
provided as an open source project at sourceforge.net, and will, in the first release,
be offered as an alternative to the H-DOSE platform. In a long term vision, the two
platform will be condensed in a single semantic framework which will provide the
functionalities offered by both platforms, preserving as much as possible the service
interfaces for enabling easy migration from older to newer versions.
Beside these software engineering efforts, the research group, and the author,
are also actively working on the evolution of the domotics gateway presented in
chapter 9. The effort is on introducing semantics-powered operations in the cur-
rently available DHG, in particular context-aware interpretation of the behavior
of households and sematic-based automatic rule generation. In the same scenario,
another research effort is starting aimed at defining an intelligent layer of agents

176
11.2 – Future Works

able to translate the interaction between users and domotic homes from the current
command-based paradimg to an objective based one.
For what concerns theoretical research, the author and some of its colleagues are
now starting to work in the context of the so-called Semantic Desktop initiative.
The main objective is to migrate the technologies developed in the wider context
of the Semantic Web initiative, to the users’ computer desktops, enabling a more
effective interaction between humans and machines in performing everyday tasks. In
particular, the work under research involves a semantic-based cataloging system able
to better organize files and directories on the user’s machine, allowing for an easier
retrieval and modification of them. This semantic-powered file search system will
then be integrated by a semantic-based service composition framework, currently
developed by Alessio Bosca, a researcher working in the e-Lite group, and applied
to a project developed in collaboration with Tilab.
The service composition will allow, as a vision, to accomplish complex and coor-
dinated tasks in an actually natural manner. For example, for sending a fax, a user
will simply compose a pseudo-natural request such as “I want to send a fax to Julie”.
The machine will then provide a text editor for editing the fax content. When the
fax will be complete, the semantic composition service will check the availability
of a fax device on the user’s computer. Let us suppose that this device does not
exist, in this case the semantic composition service will look up the Web for a free
fax service, for example. Then, it will convert the text of the fax in the correct
format. In doing this conversion it will look up the user’s agenda for “Julie” and it
will notice that two “Julie” appear. Therefore, it will ask explanations to the user
and eventually it will finalize the fax sending process.
The last front on which the e-Lite group is moving is about conceptual spaces.
In this context, the author is planning to experiment a first, very simple implemen-
tation of the Peter Gardenfords ideas to demonstrate the feasibility of the approach.
This work, differently from the others introduced above is still in a ideation phase
where available knowledge and solutions are gathered and time for doing design and
implementation has still to be allocated.

177
Bibliography

[1] The moodle e-learning system. http://moodle.org.


[2] The muffin intelligent proxy. http://muffin.doit.org.
[3] CABLE: CAse based e-learning for educators.
http://elite.polito.it/cable, http://cable.uhi.ac.uk.
[4] O. Lassila and R. Swick. Resource description framework RDF model and
syntax specification. World Wide Web Consortium, 1999.
[5] Deborah L. McGuinness and Frank van Harmelen. Owl web ontology lan-
guage. W3C Proposed Recommendation, 2003.
[6] D. Bonino, F. Corno, L. Farinetti, A. Bosca. Ontology driven semantic search.
WSEAS Transaction on Information Science and Application, 1(6):1597–1605,
2004.
[7] Berners-Lee Tim, Handler James and Lassila Ora. The semantic web. The
Scientific American, 5(1), 2001.
[8] I. Nonaka, H. Takeuchi. The knowledge creating company. Oxford University
Press, 1995.
[9] Google. http://www.google.com.
[10] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information retrieval. Addison-
Wesley, 1999.
[11] Yahoo. http://www.yahoo.com.
[12] Altavista. http://www.altavista.com.
[13] WordNet: a lexical database for english language.
http://www.cogsci.princeton.edu/ wn/.
[14] The trec conference series. http://trec.nist.gov/.
[15] P. Bouquet, L. Serafini, S. Zanobini. Semantic coordination: a new approach
and an application. In ISWC03 conference, Sanibel Island, Florida, USA.,
pages 130–145. LNCS, Springer-Verlag, 2003.
[16] Jade. http://sharon.cselt.it/projects/jade.
[17] Foundation for intelligent physical agents. http://www.fipa.org.
[18] Zeus agent toolkit. http://labs.bt.com/projects/agents/zeus.
[19] Fipa-os. http://fipa-os.sourceforge.net.
[20] Apache tomcat. http://tomcat.apache.org.

178
Bibliography

[21] Apache axis. http://ws.apache.org/axis/.


[22] The apache jakarta project. http://jakarta.apache.org.
[23] McBride B. Jena: a semantic web toolkit. IEEE Internet Computing, 6(6):55–
59, 2002.
[24] Postgresql. http://www.postgresql.org.
[25] The saxon api. http://saxon.sourceforge.net.
[26] D. Ragget et al. HTML Tidy project. http://tidy.sourceforge.net/.
[27] Jtidy. http://jtidy.sourceforge.net.
[28] Jgrapht. http://jgrapht.sourceforge.net.
[29] Sun jax-rpc. http://java.sun.com/webservices/jaxrpc/.
[30] Bodington. http://bodington.org.
[31] The uhi millennium institute. http://www.uhi.ac.uk.
[32] Li J., Yu Y., Zhang L. Learning to generate semantic annotation for domain
specific sentences. In knowledge markup and semantic annotation workshop
in K-CAP 2001., 2001.
[33] Ciravegna F., Dingli A., Wilks Y. Automatic semantic annotation using un-
supervised information extraction and integration. In K-CAP 2003 - Work-
shop on Knowledge Markup and Semantic Annotation, Sanibel Island, Florida,
USA., 2003.
[34] D.Bonino, F. Corno, G. Squillero. Dynamic prediction of web requests. In
CEC03 - 2003 IEEE Congress on Evolutionary Computation, Canberra, Aus-
tralia, pages 2034–2041, 2003.
[35] D. Bonino, F. Corno, L. Farinetti. Dose: a distributed open semantic elab-
oration platform. In ICTAI 2003, Sacramento, California, pages 580–589,
2003.
[36] Asphi web site. http://www.asphi.it.
[37] Holger Knublauch. An ai tool for the real world: Knowledge modeling with
protégé. JavaWorld, 2003.
[38] York Sure et Al. Ontoedit: Collaborative ontology development for the se-
mantic web. In 1st International Semantic Web Conference,Sardinia, Italy,
2002.
[39] Gansner E. R. & North S. C. An open graph visualization system and
its applications to software engineering. Software Practice and Experience,
30(11):1203–1233, 1999.
[40] M.A. Storey et al. Interactive visualization to enhance ontology authoring
and knowledge acquisition in protégé. In Workshop on Interactive Tools for
Knowledge Capture, Victoria, B.C. Canada, 2001.
[41] Ontoviz tab: Visualizing protégé ontologies.
http://protege.stanford.edu/plugins/ontoviz/ontoviz.html.
[42] P.W. Eklund, N. Roberts, S. P. Green. Ontorama: Browsing an rdf ontology
using a hyperbolic-like browser. In The First International Symposium on

179
Bibliography

CyberWorlds (CW2002), Theory and Practices, IEEE press, pages 405–411,


2002.
[43] The suggested upper merged ontology. http://ontology.teknowledge.com/.
[44] Junit. http://www.junit.org.
[45] Drools. http://drools.org/.
[46] C.e.t.a.d. service (italy). http://www.cetad.org/ and
http://www.domoticamica.it/.
[47] Mserv. http://www.mserv.org/.
[48] BTicino MyHome system (italian website).
http://www.myhome-bticino.it/ft/.
[49] KAON ontology and semantic web infrastructure.
http://kaon.semanticweb.org.
[50] N.F. Noy et al. Creating semantic web contents with protge-2000. IEEE
Intelligent Systems, 2(16):60–71, 2001.
[51] J. Kahan et al. Annotea: An open RDF infrastructure for shared web anno-
tations. In WWW10 - International Conference, Hong Kong, 2001.
[52] S. Staab, A. Maedche, and S. Handshuh. An annotation framework for the
semantic web. In 9th international World Wide Web conference, Amsterdam,
the Netherlands, pages 95–103, 2000.
[53] P. Kogut, W. Holmes. AeroDAML: Applying information extraction to gen-
erate daml annotations from web pages. In K-CAP 2001 - Workshop on
Knowledge markup and semantic annotation, Victoria, BC, Canada, 2001.
[54] S. Dill et al. Semtag and seeker: Bootstrapping the semantic web via au-
tomated semantic annotation. In Twelfth international conference on World
Wide Web, Budapest, Hungary, pages 178–186, 2003.
[55] D. Maynard, M. Yankova, N. Aswani, H. Cunningham. Automatic creation
and monitoring of semantic metadata in a dynamic knowledge portal. In
The Eleventh International Conference on Artificial Intelligence: Methodol-
ogy, Systems, Applications - Semantic Web Challenges (AIMSA 2004), Varna,
Bulgaria, 2004.
[56] The mondeca semantic portal. http://www.mondeca.com/.
[57] J. Carbonell at al. Translingual information retrieval: A comparative evalu-
ation. In Fifteenth International Joint Conference on Artificial Intelligence,
1997.
[58] Eurowordnet: A multilingual database with lexical semantic networks. Kluwer
Academic Publishers, Dordrecht, 1998.
[59] D.W. Oard. Alternative approaches for cross-language text retrieval. In AAAI
Symposium on Cross-Language Text and Speech Retrieval, 1997.
[60] M. Agnesund. Supporting multilinguality in ontologies for lexical semantics
an object oriented approach. M.S. Thesis, 1997.

180
Bibliography

[61] L. Bentivogli, E. Pianta, F. Pianesi. Coping with lexical gaps when building
aligned multilingual wordnets, 2000.
[62] J. Gilarranz, J. Gonzalo, F. Verdejo. Language-independent text retrieval
with the eurowordnet multilingual semantic database. In Second Workshop
on Multilinguality in the Software Industry: The AI Contribution, 1997.
[63] M. Vargas-Vera et al. Mnm: Ontology driven semi-automatic and automatic
support for semantic markup. In 13th International Conference on Knowledge
Engineering and Management (EKAW 2002), 2002.
[64] M. Vargas-Vera et al. Knowledge extraction by using an ontology-based anno-
tation tool. In K-CAP 2001 - Workshop on Knowledge markup and semantic
annotation, Victoria, BC, Canada, 2001.
[65] P. Resnik. Semantic similarity in a taxonomy: an information-based measure
and its application to problems of ambiguity in natural language. Journal of
Artificial Intelligence Research, 11:95–130, 1999.
[66] E. Agirre and G. Rigau. Word sense disambiguation using conceptual density.
In Coling-ACL’96 Workshop, Copenhagen, Denmark, pages 16–22, 1996.
[67] Rocha C., Schwabe D., Poggi de Aragao M. A hybrid approach for searching in
the semantic web. In WWW2004 conference, New York, NY, pages 374–383,
2004.
[68] Guha R., McCool R., and Miller E. Semantic search. In WWW2003, Bu-
dapest, Hungary, pages 700–709, 2003.
[69] Stojanovic N., Studer R., Stojanovic L. An approach for the ranking of query
results in the semantic web. In ISWC2003, Sanibel Island, FL, pages 500–516,
2003.
[70] Davies J., Weeks R., Krohn U. Quizrdf: Search technology for the semantic
web. In WWW2002 workshop on RDF & Semantic Web Applications, Hawaii,
USA, 2002.
[71] Heflin J. and Hendler J. Searching the web with shoe. In Artificial Intelligence
for Web Search. Papers from the AAAI Workshop, pages 35–40, 2000.
[72] Ben Shneiderman. Treemaps for space-constrained visualization of hierarchies.
ACM Transactions on Graphics (TOG), 11(1):92–99, 1992.
[73] Jambalaya.
http://www.thechiselgroup.org/chisel/projects/jambalaya/jambalaya.html.
[74] Tgviztab, a touchgraph visualization tab for protégé 2000.
http://www.ecs.soton.ac.uk/ha/TGVizTab/TGVizTab.htm.
[75] Touchgraph library. http://touchgraph.sourceforge.net/.
[76] ezowl: Visual owl (web ontology language) editor for protégé.
http://iweb.etri.re.kr/ezowl/index.html.
[77] Ontoedit.
http://www.ontoknowledge.org/tools/ontoedit.shtml.

181
Bibliography

[78] Isaviz: A visual authoring tool for rdf.


http://www.w3.org/2001/11/IsaViz/.
[79] Ontorama. http://www.ontorama.com/.
[80] EIB/KNX. http://www.eiba.com/en/eiba/overview.html.
[81] LonWorks. http://www.echelon.com/.
[82] X10 protocol. http://www.x10.com/home2.html.
[83] CEBus. http://www.cebus.org/.
[84] BatiBUS. http://www.batibus.com/anglais/gen/index.htm.
[85] Konnex. http://www.konnex.org/.
[86] Homeplug. http://www.homeplug.com/en/index.asp.
[87] Bluetooth. http://www.bluetooth.org/.
[88] ETS,EIBA software for KNX/EIB systems.
http://www.eiba.com/en/software/index.html.
[89] Open source eib control. http://sourceforge.net/projects/eibcontrol/.
[90] Fellbaum K., Hampicke M. Integration of smart home components into exist-
ing residences. In Proceedings of AAATE, Düsseldorf, 1999.
[91] Osgi alliance. http://www.osgi.org/.
[92] Kameas A. et al. An architecture that treats everyday objects as communi-
cating tangible components. In Proceedings of the First IEEE International
Conference on Pervasive Computing and Communications (PerCom 2003),
pages 115–122, 2003.
[93] Johanson B., Fox A., Winograd T. The interactive workspaces project: experi-
ences with ubiquitous computing rooms. IEEE Pervasive Computing, 1(2):67–
74, June 2002.
[94] Grimm R. et al. System support for pervasive applications. ACM Transactions
on Computer Systems, 22(4):421–486, 2004.
[95] Acampora G., Loia V. Fuzzy control interoperability for adaptive domotic
framework. In the Second IEEE International Conference on Industrial Infor-
matics, pages 184–189, 2004.
[96] Peter Gardenfors. Conceptual Spaces: The Geometry of Though. MIT press,
2000.
[97] F. Forno, L. Farinetti, S. Mehan. Can data mining techniques ease the se-
mantic tagging burden? In VLDB2003 - First International Workshop on
Semantic Web and Databases, Berlin, Germany, 2003.
[98] Alexander Maedche. Ontology Learning for the Semantic Web, volume 665.
The Kluwer International Series in Engineering and Computer Science, 2001?
[99] T. Bäck. Selective pressure in evolutionary algorithms: A characterization of
selection mechanism. In First IEEE Conference on Evolutionary Computation,
pages 57–62, 1994.
[100] H.P. Schwefel. Natural evolution and collective optimum-seeking. Computa-
tional systems analysis: Topics and trends, pages 5–14, 1992.

182
Bibliography

[101] J. Heitkötter, D. Beasley. The hitch-hiker’s guide to evolutionary computa-


tion, 2000.
[102] Jürgen Branke. Evolutionary optimization in dynamic environments. Kluwer
Academic Publishers, 2001.
[103] The amaya W3C editor/browser. http://www.w3.org/Amaya/.
[104] On-to-knowledge-project. http://www.ontoknowledge.org.
[105] Autonomic computing: IBM perspective on the state of information technol-
ogy. International Business Machines corporation,
http://www.research.ibm.com/autonomic/manifesto/, 2001.
[106] OWL-S 1.0 release. http://www.daml.org/services/owl-s/1.0/.
[107] L. Denoue, L. Vignollet. An annotation tool for web browsers and its appli-
cations to information retrieval. In RIAO 2000, Paris, France, 2000.
[108] F. Bellifemmine, A. Poggi and G. Rimassa. Jade Programmer’s guide,
http://sharon.cselt.it/projects/jade edition, 2003.
[109] IBM research projects on autonomic computing.
http://www.research.ibm.com/autonomic/research/projects.html.
[110] Craig Boutilier et al. Cooperative negotiation in autonomic systems using
incremental utility elicitation. In 19th Conference on Uncertainty in Artificial
Intelligence, Acapulco, Mexico, 2003.
[111] Roy Sterrit, Dave Bustard. Towards an autonomic computing environment.
In IEEE Workshop on Autonomic Computing Principles and Architectures
(AUCOPA 2003), Banff, Alberta, Canada, 2003.
[112] J. Appavoo, K. Hui, C.A.N. Soules et Al. Enabling autonomic behavior in
systems software with hot swapping. IBM systems journal, 42(1), 2003.
[113] C. H. Crawford, A. Dan. eModel: Addressing the need for a flexible modeling
framework in autonomic computing. In 10th IEEE International Symposium
on Modeling, Analysis and Simulation of computer and Telecommunications
systems, Forth Worth, Texas, 2003.
[114] Rajan Kumar, Prathiba V. Rao. A model for self-managing java servers
to make your life easier. http://www-106.ibm/developerworks/library/ac-
alltimeserver/.
[115] ETTK: Emerging technologies toolkit.
http://www.alphaworks.ibm.com/tech/ettk/.
[116] M. Koivunen, R. Swick. Metadata based annotation infrastructure offers flex-
ibility and extensibility for collaborative applications and beyond. In K-CAP
2001 - Workshop on Knowledge markup and semantic annotation, Victoria,
BC, Canada, 2001.
[117] S. Handschuh, S. Staab, A. Maedche. CREAM: Creating relational metadata
with a component-based, ontology-driven annotation framework. In ACM K-
CAP 2001 - First International Conference on Knowledge Capture, Victoria,
BC, Canada, 2001.

183
Bibliography

[118] S. Staab, A. Maedche, S. Handschuh. An annotation framework for the se-


mantic web. In First Workshop on Multimedia Annotation, Tokyo, Japan,
2001.
[119] L. Gangmin, V. Uren, E. Motta. ClaiMaker: weaving a semantic web of re-
search papers. In ISWC 2002 - First International Semantic Web Conference,
Sardinia, Italy, 2002.
[120] S. Handschuh, S. Staab, F. Ciravegna. S-CREAM: Semiautomatic creation of
metadata. In EKAW02 - 13th International Conference on Knowledge Engi-
neering and Knowledge Management, Siguenza, Spain, 2002.
[121] N. Collier, K. Takeuchi, K. Tsuji. The PIA project: learning to semantically
annotate texts from an ontology and xml-instance data. In SWWS 2001 -
International Semantic Web Working Symposium, Stanford University, CA,
USA, 2001.
[122] G. Salton. Developments in automatic text retrieval. Science, 253:974–980,
1991.
[123] W3C web services activity. http://www.w3.org/2002/ws/.
[124] David D. Lewis and Karen Sparck Jones. Natural language processing for
information retrieval. Communications of the ACM, 39(1):92–101, 1996.
[125] Porter MF. An algorithm for suffix stripping. Program.
[126] Carr L., Bechhofer S., Goble C., Hall W. Conceptual linking: Ontology-
based open hypermedia. In WWW10 - 10th International World Wide Web
Conference, Hong Kong, China, pages 334–342, 2001.
[127] XPath Explorer. http://www.purpletech.com/xpe/index.jsp.
[128] Guha R., McCool R., Miller E. Semantic search. In WWW2003 - 12th Inter-
national World Wide Web Conference, Budapest, Hungary, 2003.
[129] S. Yu, D. Cai, J.R. Wen, W.Y. Ma. Improving pseudo-relevance feedback in
web information retrieval using web page segmentation. In WWW2003 - 12th
International World Wide Web Conference, Budapest, Hungary, pages 11–18,
2003.
[130] Gupta S., Kaiser G., Neistadt D., Grimm P. DOM-based content extraction
of HTML documents. In WWW2003 - 12th International World Wide Web
Conference, Budapest, Hungary, pages 207–214, 2003.
[131] Chen Y., W.Y. Ma, H.J. Zhang. Detecting web page structure for adaptive
viewing on small form factor devices. In WWW2003 - 12th International
World Wide Web Conference, Budapest, Hungary, pages 225–233, 2003.
[132] Jgraph. http://www.jgraph.com.
[133] M. Agnesund. Representing culture-specific knowledge in a multilingual on-
tology. In IJCAI-97 Workshop on Ontologies and Multilingual NLP, 1997.
[134] G. Salton. Developments in automatic text retrieval. Science, 253:974–980,
1991.

184
Bibliography

[135] R. Ghani, A.E. Fano. Using text mining to infer semantic attributes for retail
data mining. In ICDM 2002: 2002 IEEE International Conference on Data
Mining, pages 195–202, 2002.
[136] D. Bonino, F. Corno, L. Farinetti, A. Ferrato. Multilingual semantic elab-
oration in the dose platform. In SAC 2004, ACM Symposium on Applied
Computing, Nicosia, Cyprus, pages 1642–1646, 2004.
[137] P. Resnik. Using information content to evaluate semantic similarity in a
taxonomy. In 14 th International Joint Conference on Artificial Intelligence
(IJCAI’95), Montreal, Canada, pages 448–453, 1995.
[138] M. Vargas-Vera et al. Mnm: Ontology driven tool for semantic markup. In
European Conference on Artificial Intelligence (ECAI 2002), Workshop on
Semantic Authoring, Annotation & Knowledge Markup (SAAKM 2002), Lyon
France, 2002.
[139] IST Advisory Group (K. Ducatel, et al). Scenarios for ambient intelligence in
2010. final report, 2001.
[140] eperspace, european IST project. http://www.ist-eperspace.org/.
[141] COGAIN, european IST NoE project. http://www.cogain.org/.
[142] European home systems association (EHSA). http://www.ehsa.com/.
[143] Home audio video interoperability (HAVi). http://www.havi.org/.
[144] Microsoft AURA project. http://aura.research.microsoft.com/.
[145] MIT oxygen project. http://www.oxygen.lcs.mit.edu/Overview.html.
[146] CNR NICHE project. http://niche.isti.cnr.it/.
[147] Jini network technology. http://www.sun.com/software/jini/.
[148] Universal pnp. http://www.upnp.org/.
[149] D. Bonino, F. Corno, L. Farinetti. Domain specific searches using concep-
tual spectra. In 16th IEEE International Conference on Tools with Artificial
Intelligence. Boca Raton, Florida, 2004.
[150] D. Bonino, A. Bosca, F. Corno. An agent based autonomic semantic platform.
In First IEEE International Conference on Autonomic Computing. New York,
USA, 2004.
[151] The apache jakarta project. http://jakarta.apache.org.
[152] The protégé ontology editor and knowledge acquisition system.
http://protege.stanford.edu/.
[153] The frodo rdfsviz tool. http://www.dfki.uni-kl.de/frodo/RDFSViz/.
[154] Rdfauthor. http://rdfweb.org/people/damian/RDFAuthor/.
[155] Grigoris Antoniou, Frank van Harmelen. A Semantic Web Primer. MIT press,
2004.

185
Appendix A

Publications

1. An Evolutionary Approach to Web Request Prediction, D. Bonino, F. Corno,


G. Squillero – poster at WWW2003 - The Twelfth International World Wide
Web Conference, 20-24 May 2003, Budapest, HUNGARY - (International Con-
ference)

2. A Real-Time Evolutionary Algorithm for Web Prediction D. Bonino, F. Corno,


G. Squillero – WI-2003, The 2003 IEEE/WIC International Conference on Web
Intelligence, October 2003, Halifax, Canada - (International Conference)

3. Semantic annotation and search at the document substructure level, D. Bonino,


F. Corno, L. Farinetti – poster at ISWC2003 - 2nd International Semantic Web
Conference, Florida (USA), October 2003 (Poster)

4. DOSE: a Distributed Open Semantic Elaboration Platform,


D. Bonino, F. Corno, L. Farinetti – ICTAI 2003, The 15th IEEE Interna-
tional Conference on Tools with Artificial Intelligence, November 3-5, 2003,
Sacramento, California - (International Conference)

5. Dynamic Prediction of Web Requests, D. Bonino, F. Corno, G. Squillero –


CEC03: 2003 IEEE Congress on Evolutionary Computation, Canberra, Aus-
tralia, 8th - 12th December 2003, pp. 2034-2041 - (International Conference)

6. Multilingual Semantic Elaboration in the DOSE platform, D. Bonino, F. Corno,


L. Farinetti, A. Ferrato – SAC 2004, ACM Symposium on Applied Computing,
March 14-17, 2004, Nicosia, Cyprus - (International Conference)

7. An Agent Based Autonomic Semantic Platform, D. Bonino, A. Bosca, F.


Corno – ICAC2004, First International Conference on Autonomic Comput-
ing (IEEE), New York, May 17-18, 2004 - (International Conference)

186
8. Dynamic Optimization of Semantic Annotation Relevance,
D. Bonino, F. Corno, G. Squillero – CEC2004, Congress on Evolutionary Com-
putation, Portland (Oregon), June 20-23, 2004 - (International Conference)

9. Domain Specific Searches using Conceptual Spectra, D. Bonino, F. Corno, L.


Farinetti – ICTAI 2004 the IEEE International Conference on Tools with Ar-
tificial Intelligence, 15-17 Nov 2004, Boca Raton, Florida, USA, pp.680-687 -
(International Conference)

10. Ontology Driven Semantic Search, D. Bonino, F. Corno, L. Farinetti, A. Bosca


– WSEAS Conference ICAI 2004, Venice, Italy, 2004 - (International Confer-
ence)

11. Ontology Driven Semantic Search, D. Bonino, F. Corno, L. Farinetti, A. Bosca


– WSEAS Transaction on Information Science and Application, Issue 6, Vol-
ume 1, December 2004, pp. 1597-1605 - (International Journal)

12. Automatic learning of text-to-concept mappings exploiting WordNet-like lexical


networks, D. Bonino, F. Corno, F. Pescarmona – 20th Annual ACM Sympo-
sium on Applied Computing Santa Fe, New Mexico, March 13 -17, 2005 -
(International Conference)

13. H-DOSE: an Holistic Distributed Open Semantic Elaboration Platform, D.


Bonino, A. Bosca, F. Corno, L. Farinetti, F. Pescarmona – SWAP2004: 1st
Italian Semantic Web Workshop 10th December 2004, Ancona, Italy - (Na-
tional Conference)

14. Domotic House Gateway, P. Pellegrino, D. Bonino, F. Corno – SAC 2006,


ACM Symposium on Applied Computing, April 23-27, 2006, Dijon, France -
(International Conference)

15. OntoSphere: more than a 3D ontology visualization tool, A. Bosca, D. Bonino,


P. Pellegrino – SWAP 2005 - Semantic Web Applications and Perspectives
2nd Italian Semantic Web Workshop Trento, Faculty of Economics 14-15-16
December, 2005 - (National Conference)

187

You might also like