Professional Documents
Culture Documents
SCUOLA DI DOTTORATO
Dottorato in Ingegneria Informatica e dell’automazione – XVIII ciclo
Tesi di Dottorato
Dario Bonino
Dicembre 2005
Acknowledgements
Many people deserve a very grateful acknowledgment for their role in supporting me
during these long and exciting 3 years. As first I would cite my adviser, Fulvio Corno,
which always supported and guided me toward the best decisions and solutions.
Together with Fulvio I want to thank Laura Farinetti very much too: she was there
at anytime I needed her help for both insightful discussions and stupid questions.
Thanks to all my colleagues in the e-Lite research group for their constant support,
for their kindness and their ability to ignore my bad moments. Particular thanks
shall go to Alessio for being not only my best colleague and competitor, but one of
the best friend I’ve ever had. The same goes to Paolo, our railway discussions have
been so much interesting and useful!
Thanks to Mike that introduced me to many Linux secrets, to Franco for being
the calmest person I’ve ever known and to Alessandro ”the Eye tracker”, the best
surfer I’ve ever met.
Thanks to my parents Ercolino and Laura and to my sister Serena, they have
always been my jumping board and my unbreakable backbone.
Thank you to all the people I’ve met in these years and that are not cited here,
I am very glad to have been with you, even if for only few moments, thank you!
I
Contents
Acknowledgements I
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II
4.2.3 The “Semi-automatic classification” . . . . . . . . . . . . . . . 54
4.3 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . 56
III
7.4.2 Typical Operation Scenario . . . . . . . . . . . . . . . . . . . 122
7.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 123
IV
Bibliography 178
A Publications 186
V
Chapter 1
Introduction
Tim Berners-Lee
The new generation of the Web will enable humans to gain wisdom of
living, working, playing,and learning, in addition to information search
and knowledge queries.
Ning Zhong
1.1 Motivation
Over the last decade the World Wide Web gained a great momentum becoming
rapidly a fundamental part of our everyday life. In personal communication, as well
as in business, the impact of the global network has completely changed the way
people interact with each other and with machines. This revolution touches all the
aspects of people lives and is gradually pushing the world toward a “Knowledge so-
ciety” where the most valuable resources will no longer be material but informative.
Also the way we think of computers has been influenced by this development:
we are in fact evolving from thinking computers as ”calculus engines” to consid-
ering computers as “gateways” or entry points to the newly available information
highways.
The popularity of the Web has lead to the exponential growth of published pages
and services observed in these years. Companies are now offering web pages to adver-
tise and sell their products. Learning institutions are presenting teaching material
and on-line training facilities. Governments provide web-accessible administrative
1
1 – Introduction
services to ease the citizen’s life. Users build up communities to exchange any kind
of information and/or to form more powerful market actors able to survive in this
global ecosystem.
The stunning success is also the curse of the current web: most of todays’s Web
content is only suitable for human consumption, but the huge number of available
information makes it increasingly difficult for users to find and access required infor-
mation. Under these conditions, keyword-based search engines, such as AltaVista,
Yahoo, and Google, are the main tools for using today’s Web. However, there are
serious problems associated with their use:
• High recall, low precision: relevant pages are buried into other thousands
of low interesting pages.
• Low recall: although rarer, sometimes happens that queries get no answers
because they are formulated with the wrong words.
Even if the search process is successful, the result is only a “relevant” set of web
pages that the user shall scan for finding the required information. In a sense, the
term used to classify the involved technologies, Information Retrieval, is in this case
rather misleading and Location Retrieval might be better.
The critical point is that, at now, machines are not able, without heuristics
and tricks, to understand documents published on the web and to extract only the
relevant information from pages. Of course there are tools that can retrieve text,
split phrases, count words, etc. But when it comes to interpreting and extracting
useful data for the users, the capabilities of current software are still limited.
One solution to this problem consists in keeping information as it currently
is and in developing sophisticated tools that use artificial intelligence techniques
and computational linguistics to “understand” what is written in web pages. This
approach has been pursued for a while, but, at now, still appears too ambitious.
Another approach is to define the web in a more machine understandable fashion
and to use intelligent techniques to take advantage of this representation.
This plan of revolutionizing the web is usually referred to as Semantic Web
initiative and is only a single aspect of the next evolution of the web, the Wisdom
Web.
It is important to notice that the Semantic Web aims not at being parallel to the
World Wide Web, instead it aims at evolve the Web into a new knowledge centric,
global network. Such a new network will be populated by intelligent web agents able
2
1.2 – Domain
to act on behalf of their human counterparts, taking into account the semantics of
information (meaning). Users will be once more the center of the Web but they will
be able to communicate and to use information with a more human-like interaction
and they will also be provided with ubiquitous access to such information.
These capabilities stem from several, active, research fields including the Multi-
Agent community, the Semantic Web community, the Ubiquitous Computing com-
munity and have impact, or use, technologies developed for databases, computa-
tional grids, social networks, etc. The Widsom Web, in a sense, is the place where
most of the currently available technologies and their evolutions will join in a single
scenario with a “devastating” impact on the human society.
Many steps, however, still separate us from this future web, the Semantic Web
initiative being the first serious and world-wide attempt to build the necessary in-
frastructure.
3
1 – Introduction
Semantics is one of the angular stones of the Wisdom Web. It has its own
foundation on the formal definition of “concepts” involved by web pages and by
services available on the web. Such a formal definition needs two critical compo-
nents: knowledge models and associations between knowledge and resources. The
formers are known as Ontologies while the latter are usually referred to as Semantic
Annotations.
There are several issues related to the introduction of semantics on the web: how
should knowledge be modeled? how to associate knowledge to real world entities or
to real world information?
The Semantic Web initiative builds up on the evidence that creating a common,
monolithic, omni-comprehensive knowledge model is infeasible and, instead, assumes
that each actor playing a role on the web shall be able to define its own model
according to its own view of the world. In the SW vision, the global knowledge
model is the result of a shared effort in building and linking the single models
developed all around the world, in a way like what happened for the current Web.
Of course, in such a process conflicts will arise that will be solved by proper tools
for mapping “conceptually similar” entities in the different models.
The definition of knowledge models alone, does not introduces semantics on the
web; in order to get such a result another step shall be done: resources must be
“associated” to the models. The current trend is to perform this association by
means of Semantic Annotations. A Semantic Annotation is basically a link between
a resource on the web, either a web page or a video or a music piece and one or
more concepts defined by an ontology. So, for example, the pages of a news site can
be associated to the concept news in a given ontology by means of an annotation,
in form of a triple: about(site, news).
Semantic annotations are not only for documents or web pages but can be used
to associate semantics to nearly all kinds of informative, physical and meta-physical
entities in the world.
Many issues are involved in the annotation process. So, just for mentioning some
of them, at which granularity these annotations shall be defined? How can we define
if a given annotation is trusted or not? Who shall annotate resources? Shall be the
creator or anyone on the web?
To answer these and the previous questions standard languages and practices
for semantic markup are needed, together with a formal logic for reasoning about
the available knowledge and for turning implicit information into explicit facts. At
contour, databases engines, for storing semantics rich data, and search engines for
offering new question-answering interfaces, constitute the informatics backbone of
the new information highway so defined.
Other pillars of the forthcoming Wisdom Web, which are more “technical”, in-
clude autonomic systems, planning and problem solving, etc. For them, great im-
provements are currently being provided by the ever active community of Intelligent
4
1.3 – Contribution
and Multi-Agent systems. In this case the problematics involved are slightly differ-
ent from the ones cited above although semantics integration can be interesting in
many aspects. The main research stream is in fact about designing machines able
to think and act rationally. In simple terms, the main concern, in this field, is to
define systems able to autonomously pursue goals and subgoals defined either by
humans or by other systems.
Meta-knowledge plays a crucial role in such a process, providing means for mod-
eling spatial and temporal constraints, or conflicts, that may arise between agent’s
goals and can be, in turn, strongly based on semantics. Knowledge modeling can, in
fact, support the definition and discovery of similarities and relationships between
constraints, in a way that is independent from the dialects with which each single
agent composes its world understanding.
Personalization, in the end, is not a new discipline and is historically concerned
with interaction modalities between users and machines, defining methodologies and
instruments to design usable interfaces, for every people, be they “normal” users or
diversely able people. It has impact on, or interacts with, many research fields
starting from Human Computer Interaction and encompassing Statistical Analysis
of user preferences and Prediction Systems. Personalization is a key factor for the
Wisdom Web exploitation: until usable and efficient interfaces will not be available,
in fact, this new web, in which available information will be many order of magnitude
wider that the current one, will not be adopted.
The good news is that all the cited issues can be solved without requiring a
revolutionary scientific progress. We can in fact reasonably claim that the challenge
is only in engineering and technological adoption, as partial solutions to all the
relevant parts of the scenario already exist. At the present, the greatest needs
appear to be in the area of integration, standardization, development of tools and
adoption by the users.
1.3 Contribution
In this Thesis, methodologies and techniques for paving the way that starts from
nowadays web applications and leads to the Wisdom Web have been studied, with
a particular focus on information retrieval systems, content management systems
and e-Learning systems. A new platform for supporting the easy integration of
semantic information into nowadays systems has been designed and developed, and
has been applied to several case studies: a custom-made CMS, a publicly available
e-Learning system (Moodle [1]), an intelligent proxy for web navigation (Muffin [2])
and a life-long learning system developed in the context of the CABLE project [3]
(a EU-funded Minerva Project).
In addition some extensions of the proposed system to environments sharing
5
1 – Introduction
with the Web the underlying infrastructure and the communication and interaction
paradigms have been studied. A case study is provided for domotic systems.
Several contributions to the state of art in semantic systems can be found in the
components of the platform including: an extension of the T.R. Gruber ontology
definition, which allows to transparently support multilingual knowledge domains, a
new annotation “expansion” system that allows to leverage the information encoded
into ontologies for extending semantic annotations, and a new “conceptual” search
paradigm based on a compact representation of semantic annotations called Con-
ceptual Spectrum. The semantic platform discussed in this thesis is named H-DOSE
(Holistic Distributed Semantic Elaboration Platform) and is currently available as
an Open Source Project on Sourceforge: http://dose.sourceforge.net.
H-DOSE has been entirely developed in Java for allowing better interoperability
with already existing web systems and is currently deployed as a set of web services
running on the Apache Tomcat servlet container. It is, at now, available in two
different forms, one intended for micro enterprises, characterized by a small footprint
on the server onto which is run, and one for small and medium enterprises that
integrates the ability to distribute jobs on different machines, by means of agents,
and that includes principles of autonomic computing for keeping the underlying
knowledge base constantly up-to-date. Rather than being an isolated attempt to
semantics integration in the current web, H-DOSE is still a very active project and is
undergoing several improvements and refinements for better supporting the indexing
and retrieval of non-textual information such as video clips, audio pieces, etc. There
is also some ongoing work on the integration of H-DOSE into competitive intelligence
systems as done by IntelliSemantic: a start-up of the Turin’s Polytechnic that builds
its business plan on the adoption of semantic techniques, and in particular of the
H-DOSE platform, for patent discovery services.
Eventually, several side issues related to semantics handling and deployment on
web applications have been addressed during the H-DOSE design, some of them will
also be presented in this thesis. A newly designed ontology visualization tool based
on multi-dimensional information spaces is an example.
6
1.4 – Structure of the Thesis
of currently available web applications with a particular focus on systems for infor-
mation management such as Content Management Systems, Indexing and retrieval
systems, e-Learning systems. For every category of application, the points in which
semantics can give substantial improvements either in effectiveness (performance)
or in user experience are evidenced.
Chapter 4 defines the requirements for the H-DOSE semantic platform, as they
emerge from interviews with web actors such as content publishers, site administra-
tors and so on.
Chapter 5 introduces the H-DOSE logical architecture, and uses such architec-
ture as a guide for discussing the basic principles and assumptions on to which the
platform is built. For every innovative principle the strength points are evidenced
together with the weaknesses emerged either during the presentations of such ele-
ments in international conferences and workshops or during the H-DOSE design and
development process.
Chapter 6 describes in deep detail the H-DOSE platform, focusing on the role
and the interactions that involve every single component of the platform. The main
concern of this chapter is to provide a complete view of the platform, in its more
specific aspects, discussing the adopted solutions from a “software engineering” point
of view.
Chapter 7 presents the case studies that constituted the benchmark of the
H-DOSE platform. Each case study is addressed separately starting from a brief
description of requirements and going through the integration design process, the
deployment of the H-DOSE platform and the phase of results gathering and analysis.
Chapter 8 is about the H-DOSE related tools developed during the platform
design and implementation. They include a new ontology visualization tool and a
genetic algorithm for semantic annotations refinement.
Chapter 9 discusses the extension of H-DOSE principles and techniques to
non-Web scenarios, with a particular focus on domotics. An ongoing project on
semantics reach house gateways is described highlighting how the lessons learned
in the design and development of H-DOSE can be applied in a complete different
scenario, still retaining their valuability.
Chapter 10 presents the related works in the field of both Semantic Web and
Web Intelligence, with a particular focus on semantic platforms and semantics inte-
gration on the Web.
Chapter 11 eventually concludes the thesis and provides an overview on possible
future works.
7
Chapter 2
This chapter introduces the vision of Semantic Web and discusses the
data-model, standards, and technologies used to bring this vision into
being. These building blocks are used in the design of H-DOSE trying to
maximize the reuse of already available and well tested technologies thus
avoiding to reinvent the wheel.
The Semantic Web is developed layer by layer; the pragmatic justification for such
a procedure is that it is easier to achieve consensus on small steps, whereas it is
much harder to make everyone agree on very wide proposals. In fact there are many
research groups that are exploring different and sometimes conflicting solutions.
After all, competition is one of the major driving force for scientific development.
Such a competition makes very hard to reach agreements on wide steps and often
only a partial consensus can be achieved. The Semantic Web builds upon the steps
for which consensus can be reached, instead of waiting to see which alternative
research line will be successful in the end.
The Semantic Web is such that companies, research groups and users must build
tools, add content and use that content. It is certainly myopic to wait until the full
vision will materialize: it may take another ten years to realize the full extent of
SW, and many years more for the Wisdom Web.
In evolving from one layer to another, two principles are usually followed:
8
take, at least partial, advantage from information at higher levels. So, a RDF-
aware agent should also be able to use information encoded in OWL [5], ig-
noring those elements that go beyond RDF and RDF Schema.
The layered cake of the Semantic Web is shown in Figure 2.1 and describes the
main components involved in the realization of the Semantic Web vision (due to
Tim Berners Lee). At the bottom it is located XML (eXtensible Markup Language)
a language for writing well structured documents according to a user-defined vocab-
ulary. XML is a “de facto” standard for the exchange of information over the World
Wide Web. On the top of XML builds up the RDF layer.
RDF is a simple data model for writing statements about Web objects. RDF is
not XML, however it has a XML-based syntax, so it is located, in the cake, over the
XML layer.
RDF-Schema defines the vocabulary used in RDF data models. It can be seen
as a very primitive language for defining ontologies, as it provides the basic building
blocks for organizing Web objects into hierarchies. Supported constructs include:
classes and properties, the subClass and subProperty relations and the domain and
range restrictions. RDF-Schema uses a RDF syntax.
The Logic layer is used for further enhancing the ontology support offered by
RDF-Schema, thus allowing to model application-specific declarative knowledge.
The Proof layer, instead, involves the process of deductive reasoning as well as
the process of providing and representing proofs in Web languages. Applications
lying at the proof level shall be able to reason about the knowledge data defined in
the lower layers and to provide conclusions together with “explanations” (proofs)
about the deductive process leading to them.
9
2 – The Semantic Web vision
The Trust layer, in the end, will emerge through the adoption of digital signatures
and other kinds of knowledge, based on recommendations by trusted agents, by
rating and certification agencies or, even, by consumer organizations. The expression
“Web of Trust” means that the trust over the Web will be organized in the same
distributed and sometimes chaotic way as the WWW itself. Trust is crucial for the
final exploitation of the Semantic Web vision: until users will not have trust in its
operations (security) and in quality of information provided (relevance) the SW will
not reach its full potential.
<html>
<head></head>
<body>
<h1> SpiderNet internet consultancy,
network applications and more </h1>
<p> Welcome to the SpiderNet web site, we offer
a wide variety of ICT services related to the net.
<br/> Adam Jenkins, our graphics designer has designed many
of the most famous web sites as you can see in
<a href=’’gallery.html’’>the gallery</a>.
Matt Kirkpatrick is our Java guru and is able to develop
any new kind of functionalities you may need.
<br> If you are seeking a great new opportunity
for your business on the web, contact us at the
following e-mails:
<ul>
<li>jenkins@spidernet.net</li>
<li>kirkpatrick@spidernet.net</li>
10
2.1 – Semantic Web Technologies
</ul>
Or you may visit us in the following opening hours
<ul>
<li>Mon 11am - 7pm</li>
<li>Tue 11am - 2pm</li>
<li>Wed 11am - 2pm</li>
<li>Thu 11am - 2pm</li>
<li>Fri 2pm - 9pm</li>
</ul>
Please note that we are closed every weekend and every festivity.
</p>
</body>
</html>
<company type=’’consultancy’’>
<service>Web Consultancy</service>
<products> Web pages, Web applications </products>
<staff>
<graphicsDesigner>Adam Jenkins</graphicsDesigner>
<javaDeveloper>Matt Kirkpatrick</javaDeveloper>
</staff>
</company>
11
2 – The Semantic Web vision
2.1.2 Ontologies
The term ontology stems from philosophy. In that context, it is used to name a
subfield of philosophy, namely the study of the nature of existence (from the greek
oντ øλøγια), the branch of metaphysics concerned with identifying, in general terms,
the kinds of things that actually exist, and how to describe them. For example the
observation that the world is made up of specific entities that can be grouped in
abstract classes based on shared properties is a typical ontological commitment.
For what concerns nowadays technologies, ontology has been given a specific
meaning that is quite different from the original one. For the purposes of this thesis
the T.R. Gruber’s definition, later refined by R. Studer can be adopted: An ontology
is an explicit and formal specification of a conceptualization.
In other words, an ontology formally describes a knowledge domain. Typically,
an ontology is composed of a finite list of terms and the relationships between these
terms. The terms denote important concepts (classes of objects) of the domain.
Relationships include, among the others, hierarchies of classes. A hierarchy
specifies a class C to be a subclass of another class C if every object in C is also
included in C . Apart from the subclass relationship (also known as “is A” relation),
ontologies may include information such as:
• properties (X makes Y )
• value restrictions (only smiths can make iron tools)
• disjointness statements (teachers and secretary staff are disjoint)
• specification of logical relationships between objects
In the context of the web, ontologies provide a shared understanding of a domain.
Such an understanding is necessary to overcome differences in terminology. As an
example a web application may use the term “ZIP” for the same information that in
another one is denoted as “area code”. Another problem is when two applications
use the same term with different meanings. Such differences can be overcome by
associating a particular terminology with a shared ontology, and/or by defining
mappings between different ontologies. In both cases, it is easy to notice that
ontologies support semantic interoperability.
Ontologies are also useful for improving the results of Web searches. The search
engine can look for pages that refer to a precise concept, or set of concepts, in
12
2.2 – Logic
2.2 Logic
Logic is the discipline that studies the principles of reasoning; in general, it offers
formal languages for expressing knowledge and well-understood formal semantics.
Logic usually works with the so-called declarative knowledge, which describes what
holds without caring about how it can be deduced.
Deduction can be performed by automated reasoners: software entities that have
been extensively studied in Artificial Intelligence. Logic deduction (inference) allows
to transform implicit knowledge defined in a domain model (ontology) into explicit
knowledge. For example, if a knowledge base contains the following axioms in pred-
icate logic,
human(X) → mammal(X)
P h.Dstudent(X) → human(X)
P h.Dstudent(Dario)
an automated inferencing engine can easily deduce that
human(Dario)
mammal(Dario)
P h.Dstudent(X) → mammal(X)
Logic can therefore be used to uncover ontological knowledge that is implicitly given
and, by doing so, it can help revealing unexpected relationships and inconsistencies.
13
2 – The Semantic Web vision
But logic is more general than ontologies and can also be used by agents for
making decisions and selecting courses of action, for example.
Generally there is a trade-off between expressive power and computational ef-
ficiency. The more expressive a logic is, the more computationally expensive it
becomes for drawing conclusions. And drawing conclusions can sometimes be impos-
sible when non-computability barriers are encountered. Fortunately, a considerable
part of the knowledge relevant to the Semantic Web seems to be of a relatively re-
stricted form, and the required subset of logics is almost tractable, and is supported
by efficient reasoning tools.
Another important aspect of logic, especially in the context of the Semantic Web,
is the ability to provide explanations (proofs) for the conclusions: the series of infer-
ences can be retraced. Moreover, AI researchers have developed ways of presenting
proofs in a human-friendly fashion, by organizing them as natural deductions and
by grouping, in a single element, a number of small inference steps that a person
would typically consider a single proof step.
Explanations are important for the Semantic Web because they increase the
users’ confidence in Semantic Web agents. Even Tim Berners Lee speaks of a “Oh
yeah?” button that would ask for explanation.
Of course, for logic to be useful on the Web, it must be usable in conjunction with
other data, and it must be machine processable as well. From these requirements
stem the nowadays research efforts on representing logical knowledge and proofs in
Web languages. Initial approaches work at the XML level, but in the future rules
and proofs will need to be represented at the level of ontology languages such as
OWL.
2.2.1 Agents
Agents are software entities that work autonomously and proactively. Conceptually
they evolved out of the concepts of object-oriented programming and of component-
based software development.
According to the Tim Berners-Lee’s article [7], a Semantic Web agent shall be
able to receive some tasks and preferences from the user, seek information from Web
sources, communicate with other agents, compare information about user require-
ments and preferences, select certain choices, and give answers back to the user.
Agents will not replace human users on the Semantic Web, nor will they necessarily
make decisions. In most cases their role will be to collect and organize information,
and present choices for the users to select from.
Semantic Web agents will make use of all the outlined technologies, in particular:
• Metadata will be used to identify and extract information from Web Sources.
14
2.3 – Semantic Web Languages
• Logic will be used for processing retrieved information and for drawing con-
clusions
15
2 – The Semantic Web vision
the homonym problem that has been the plague of distributed data representation
until now.
Statements assert the properties of resources. They are object-attribute-value
triples consisting respectively of a resource, a property and a value. Values can
either be resources or literals. Literals are atomic values (string), that can have a
specific XSD type, xsd:double as an example. A typical example of statement is:
One of the major strength points of RDF is the so-called reification: in RDF it
is possible to make statements about statements, such as:
This kind of statement allows to model belief or trust on other statements, which
is important for some kinds of application. In addition, reification allows to model
non-binary relations using triples. The key idea, since RDF only supports triples,
i.e., binary relationships, is to introduce an auxiliary object and relate it to each of
the parts of the non-binary relation through the properties subject, predicate and
object.
So, for example, if we want to represent the tertiary relationship referee(X,Y,Z)
having the following well defined meaning:
16
2.3 – Semantic Web Languages
As already cited RDF only uses binary properties. This restriction could be a quite
limiting factor since usually we adopt predicates with more than two arguments.
Fortunately, reification allows to overcome this issue. However some critic aspects
arise from the adoption of the reification mechanism. As first, although the solution
is sound, the problem remains that not binary predicates are more natural with
more arguments. Secondly, reification is a quite complex and powerful technique
which may appear misplaced for a basic layer of the Semantic Web, instead it would
have appeared more natural to include it in more powerful layers that provide richer
representational capabilities.
In addition, the XML syntax of RDF is quite verbose and can easily become too
cumbersome to be managed directly by users, especially for huge data-models. So
comes the adoption of user-friendly tools that automatically translate higher level
representations into RDF.
Eventually, RDF is a standard format therefore the benefits of drafting data in
RDF can be seen as similar to drafting information in HTML in the early days of
the Web.
The expressiveness of RDF and RDF-Schema (described above) is very limited (and
this is a deliberate choice): RDF is roughly limited to model binary relationships and
RDF-S is limited to sub-class hierarchies and property hierarchies, with restrictions
on the domain and range of the lasts.
17
2 – The Semantic Web vision
• a formal semantics,
• a convenience of expression.
The importance of a well-defined syntax is clear, and known from the area of pro-
gramming languages: it is a necessary condition for “machine understandability”
and thus for machine processing of information. Both RDF/RDF-S and OWL have
this kind of syntax. A formal semantics allows to describe the meaning of knowledge
precisely. Precisely means that semantics does not refer to subjective intuitions and
is not open to different interpretations by different people (or different machines).
The importance of a formal semantics is well known, for example, in the domain of
mathematical logic. Formal semantics is needed for allowing people to reason about
knowledge. This, for ontologies, means that we may reason about:
18
2.3 – Semantic Web Languages
Automatic reasoning allows to check much more cases than could be checked man-
ually. Such checks become critical when developing large ontologies, where multiple
authors are involved, as well as when integrating and sharing ontologies from various
sources.
Formal semantics is obtained by defining an explicit mapping between an ontol-
ogy language and a known logic formalism, and by using automated reasoners that
already exist for that formalism. OWL, for instance, is (partially) mapped on de-
scription logic, and makes use of existing reasoners such as Fact, Pellet and RACER.
Description logics are a subset of predicate logic for which efficient reasoning support
is possible.
OWL Full
The entire Web Ontology Language is called OWL Full and uses all the OWL lan-
guages primitives. It allows, in addition, the combination of these primitives in
arbitrary ways with RDF and RDF-Schema. This includes the possibility (already
present in RDF) of changing the meaning of the predefined (RDF and OWL) prim-
itives, by applying the language primitives to each other. For example, in OWL full
it is possible to impose a cardinality constraint on the Class of all classes, essentially
limiting the number of classes that can be described in an ontology.
The advantage of OWL Full is that it is fully upward-compatible with RDF,
both syntactically and semantically: any legal RDF document is also a legal OWL
Full document, and any valid RDF/RDF-S conclusion is also a valid OWL Full
conclusion. The disadvantage of the OWL Full is that the language has become
19
2 – The Semantic Web vision
OWL DL
In order to re-obtain computational efficiency, OWL DL (DL stands for Description
Logic) is a sub language of OWL Full that restricts how the constructors from
RDF and OWL may be used: application of OWL’s constructors to each other is
prohibited so as to ensure that the language corresponds to a well studied description
logic.
The advantage of this is that it permits efficient reasoning support. The dis-
advantage is the lost of full compatibility with RDF: an RDF document will, in
general, have to be extended in some ways and restricted in others before becoming
a legal OWL DL document. Every OWL DL document is, in turn, a legal RDF
document.
OWL Lite
An even further restriction limits OWL DL to a subset of the language constructors.
For example, OWL Lite excludes enumerated classes, disjointness statements, and
arbitrary cardinality.
The advantage of this is a language that is both easier to grasp for users and easier
to implement for developers. The disadvantage is of course a restricted expressivity.
<rdf:RDF
xmlns:owl = ’http://www.w3.org/2002/07/owl#’’
xmlns:rdf = ’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’
xmlns:rdfs = ’’http://www.w3.org/2000/01/rdf-schema#’’
xmlns:xsd = ’’http://www.w3.org/2001/XMLSchema#’’>
An OWL ontology can start with a set of assertions for house keeping purpose. These
assertions are grouped under an owl:Ontology element, which contains comments,
version control, and inclusion of other ontologies.
20
2.3 – Semantic Web Languages
The most important of the above assertions is the owl:imports, which lists other
ontologies whose content is assumed to be part of the current ontology. It is impor-
tant to be aware that the owl:imports is a transitive property: if the ontology A
imports the ontology B, and the ontology B imports the ontology C, then A is also
importing C.
Classes
Classes are defined using the owl:Class element and can be organized in hierarchies
by means of the rdfs:subClassOf construct.
<owl:Class rdf:ID=’’Lion’’>
<rdfs:subClassOf rdf:resource=’’#Carnivore’’/>
</owl:Class>
It is also possible to indicate that two classes are completely disjoint such as the
herbivores and the carnivores, using the owl:disjointWith construct.
<owl:Class rdf:about=’’#carnivore’’>
<owl:disjointWith rdf:resource=’’#herbivore’’/>
<owl:disjointWith rdf:resource=’’#omnivore’’/>
</owl:Class>
Properties
In OWL are defined two kinds of properties:
21
2 – The Semantic Web vision
• Object properties, which relate objects to other objects. Example is, in the
savanna ontology, the relation eats.
• Datatype properties, which relate objects with datatype values. Examples are
age, name, and so on. OWL has not any predefined data types, nor does it
provide special definition facilities. Instead, it allows the use of XML-Schema
datatypes, making use of the layered architecture of the Semantic Web.
Here there are two examples, the first for a Datatype property while the second is
for Object properties:
<owl:DatatypeProperty rdf:ID=’’age’’>
<rdfs:range rdf:resource=’’&xsd;#nonNegativeInteger’’/>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID=’’eats’’>
<rdfs:domain rdf:resource=’’#animal’’/>
</owl:ObjectProperty>
More than one domain and range can be declared, in such a case the intersection
of the domains (ranges) is taken. OWL allows to identify “inverse properties”, for
them a specific owl element exists (owl:inverseOf) and has the effect of relating a
property with its inverse by inter-changing the domain and range definitions.
<owl:ObjectProperty rdf:ID=’’eatenBy’’>
<owl:inverseOf rdf:resource=’’#eats’’/>
</owl:ObjectProperty>
Eventually equivalence of properties can be defined through the use of the element
owl:equivalentProperty.
Restrictions on properties
In RDFS it is possible to declare a class C as a subclass of a class C , then every
instance of C will be also an instance of C . OWL allows to specify classes C
that satisfy some precise conditions, i.e., all instances of C satisfy the conditions.
This is done by defining C as a subclass of the class C which collects all the
objects that satisfy the conditions. In general, C remains anonymous. In OWL
there are three specific elements for defining classes basing on restrictions, they
are owl:allValuesFrom, owl:someValuesFrom and owl:hasValue, and they are
always nested into a owl:Restriction element. The owl:allValuesFrom specify a
universal quantification (∀) while the owl:someValuesFrom defines and existential
quantification (∃).
22
2.3 – Semantic Web Languages
<owl:Class rdf:about=’’#firstYearCourse’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#isTaughtBy’’/>
<owl:allValuesFrom
rdf:resource=’’#Professor’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
<owl:Class rdf:about=’’#academicStaffMember’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#teaches’’/>
<owl:someValuesFrom
rdf:resource=’’#undergraduateCourse’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
This second example, instead, requires that there exist an undergraduate course
taught by an instance of the class of academic staff members (existential quantifi-
cation).
In general an owl:Restriction element contains an owl:onProperty element
and one ore more restriction declarations. Restrictions for defining the cardinality
of a given class are also supported through the elements:
• owl:minCardinality,
• owl:maxCardinality,
• owl:Cardinality.
Special properties
Some properties of the property element can be defined directly:
23
2 – The Semantic Web vision
Instances
Instances of classes, in OWL, are declared as in RDF:
<rdf:Description rdf:ID=’’Kimba’’>
<rdf:type rdf:resource=’’#Lion’’/>
</rdf:Description>
OWL, unlike typical database systems, does not adopt a unique-names assump-
tion therefore two instances that have different names are not required to be actually
two different individuals. Then, to ensure that different individuals are recognized
by automated reasoners as such, inequality must be explicitly asserted.
<lecturer rdf:ID=’’91145’’>
<owl:differentFrom rdf:resource=’’#98760’’/>
</lecturer>
Because such inequality statements frequently occur, and the required number of
statements would explode for stating the inequality of a large number of individual,
OWL provides a shorthand notation to assert the pairwise inequality for all the
individuals in a list: owl:AllDifferent.
<owl:AllDifferent>
<owl:distinctMembers rdf:parseType=’’Collection’’>
<lecturer rdf:about=’’#91345’’/>
<lecturer rdf:about=’’#91247’’/>
<lecturer rdf:about=’’#95647’’/>
<lecturer rdf:about=’’#98920’’/>
</owl:distinctMembers>
</owl:AllDifferent>
Note that owl:distinctMembers can only be used in combination with the
owl:AllDifferent element.
24
Chapter 3
Many businesses and activities, either on the web or not, are human and knowledge
intensive. Examples include consulting, advertising, media, high-tech, pharmaceuti-
cal, law, software development, etc. Knowledge intensive organizations have already
found that a large number of problems can be attributed to un-captured and un-
shared product and process knowledge, as well as to lacks of “who knows what”
information, to the need to capture lessons learned and best practices, and to the
need of more effective distance collaboration.
These realizations are leading to a growing call for knowledge management.
Knowledge capture and learning can happen ad-hoc (e.g. knowledge sharing around
the coffee maker or problem discussions around the water cooler). Sharing, however,
is more efficient when organized; moreover, knowledge must be captured, stored and
organized according to the context of each company, or organization, in order to be
useful and efficiently disseminated.
The knowledge items that an organization usually needs to manage can have
different forms and contents. They include manuals, correspondence with vendors
and customers, news, competitors intelligence, and knowledge derived from work
processes (e.g. documentation, proposals, project plans, etc.), possibly in different
formats (text, pictures, video). The amount of information and knowledge that a
modern organization must capture, store and share, the geographic distribution of
25
3 – Web applications for Information Management
sources and consumers, and the dynamic evolution of information makes the use of
technology nearly mandatory.
and explicit knowledge is knowledge that resides in the knowledge base. Conversion
of knowledge from one form to another leads often to the creation of new knowledge;
such conversion may follow four different patterns:
26
3.2 – Software tools for knowledge management
27
3 – Web applications for Information Management
access information stored in knowledge bases, e-Learning systems for allowing users
to perform “Internalization”, possibly in a personalized way.
These technologies, are still far from being completely semantic, i.e. based on
context and domain aware elements able to fully support knowledge operations at
a more high level, similar to what usually done by humans and, at the same time,
being still accessible to machines.
The following sub-sections take as a reference three wide diffused technologies
such as content management systems (CMS), search engines and e-Learning systems,
and for each of them discuss the basic principles and “how and where” semantics
can improve their performance. Performance is evaluated both from the functional
point of view and from the knowledge transfer point of view.
For each technology, a separated subsection is therefore provided with a brief
introduction, a discussion about currently available solutions and some considera-
tions about the integration of semantic functionalities. Finally, the shortcomings
and solutions identified will define the requirements for a general purpose platform
able to provide semantics integration on the web, with a minimal effort.
28
3.2 – Software tools for knowledge management
• Distribute content management and control. The Web manager has often been
a critical bottleneck in the timely publication and ongoing maintenance of Web
content. CMSs remove that bottleneck by distributing content management
responsibilities to individuals throughout the organization. Those individuals
who are responsible for content, now have the authority and tools to maintain
the content themselves, without any knowledge of HTML, graphic design, or
Web publishing.
• Separate content from layout and design. In a CMS, content is stored sepa-
rately from its publication format. Content managers enter the content only
once, but it can appear in many different places, formatted using very differ-
ent layouts and graphic designs. All the pages immediately reflect approved
content changes.
• Create reusable content repositories. CMSs allow for reuse of content. Objects
such as templates, graphics, images, and content are created and entered once
and then reused as needed throughout the Web site.
29
3 – Web applications for Information Management
• Build sophisticated content access and security. Good CMSs allow for sophis-
ticated control of content access, both for content managers who create and
maintain content and for users who view and use it. Web managers should be
able to define who has access to different types of information and what type
of access each person has.
• Include structures to collect and store metadata. Because data is stored sepa-
rately from both layout and design, the database also stores metadata describ-
ing and defining the data, usually including author, creation date, publication
and expiration dates, content descriptions and indexing informations, cate-
gories information, revision history, security and access information, and a
range of other content-related data.
• Allow for customization and integration with legacy systems. CMSs allow cus-
tomization of the site functionality through programming. They can expose
their functionalities through an application programming interface (API) and
they can coexist with, and integrate, already deployed legacy systems.
• Allow for archiving and version control. High-end CMSs usually provide mech-
anisms for storing and managing revisions to content. As changes are made,
the system stores archives of the content and allows reversion of any page to
an earlier version. The system also provides means for pruning the archived
content periodically, preferably on a base of criteria including age, location,
number of versions, etc.
Logical architecture
A typical CMS is organized as in Figure 3.2.
30
3.2 – Software tools for knowledge management
31
3 – Web applications for Information Management
Semantics integration
As shown in the aforementioned paragraphs, CMSs are, at the same time, effec-
tive and critical components for knowledge exploitation, especially for explicit to
32
3.2 – Software tools for knowledge management
33
3 – Web applications for Information Management
either the query or the document content, and the corresponding conceptual model
has been effectively performed, matches can be found independently from vocabular-
ies. This interesting capability of finding resources independently from vocabularies
can be leveraged by “Semantic Web” CMSs to offer functionalities much more ad-
vanced than now. Language independence, for example, can be easily achieved. So
a user can write a query in a given language, which will likely be his/her mother
language, and can ask the system to retrieve data both in the query language and
in other languages that he/she can understand. Matching queries and documents
at the conceptual level makes this process fairly easy. The currently available sys-
tems, instead, can usually provide results only for same language of the query (in
the rather fortunate case in which the query language corresponds to one of the
languages adopted for keyword-based indexing).
Other retrieval related functionalities include the contextual retrieval of “seman-
tically related” pages during user navigation. When a user requires a given page of
the site managed by a semantic CMS, the page conceptual description is picked up
and used to retrieve links to pages, in the site, that have conceptual descriptions
similar to the one of the required page. This allows for example to browse the pub-
lished site by similarity of pages rather than by following the predefined interaction
scenario fixed by the site map.
The impact of semantics on the CMS technology is not limited to storage, classi-
fication and retrieval of resources. Semantics can also be extremely useful in defining
the organization of published sites and in defining the expected work flow for re-
sources produced and reviewed within the CMS environment. There are several
attempts to provide the first CMS implementations in which content is automat-
ically organized and published according to a given ontology. In the same way,
ontologies are used for defining the complex interactions that characterize the docu-
ment review process and the expected steps required for a document to be published
by the CMS.
In conclusion, the introduction of semantics handling in content management
systems can provide several advantages both for what concerns document storage
and retrieval and for what concerns the site navigation and the site publication work
flow.
34
3.2 – Software tools for knowledge management
dog are very simple examples of information retrieval systems. More qualified and
probably more diffused examples are also available, Google [9] above all.
A simple information retrieval system works on the concepts of document in-
dexing, classification and retrieval. These three processes are at the basis of every
computer-based search system. For each of the three, several techniques have been
studied starting from the preliminary heuristic-based approaches until the nowadays
statistical and probabilistic methods. The logical architecture of a typical informa-
tion retrieval system is shown in Figure 3.3.
Many blocks can be identified: the Text Operations block performs all the operations
required for adapting the text of documents to the indexing process. As an example
in this block stop words are removed and the remaining words are stemmed. The
Indexing block basically constructs an inverted index of word-to-document pointers.
The searching block retrieves all the documents that contain a given query token
from the inverted index. The ranking block, instead, ranks all the retrieved docu-
ments according to a similarity measure which evaluates how much documents are
similar to queries. The user interface allows users to perform queries and to view
results. Sometimes it also supports some relevance feedback which allows users to
improve the search performances of the IR system by explicitly stating which re-
sults are relevant and which not. In the end, the query operations block transforms
the user query to improve the IR system performances. For example a standard
thesaurus can be used for expanding the user query by adding new relevant terms,
or the query can be transformed by taking into account users’ suggestions coming
from a relevance feedback.
Describing in detail a significant part of all available approaches to information
retrieval and their variants, would require more than one thesis alone. In addition,
35
3 – Web applications for Information Management
the scope of this section is not to be exhaustive with respect to available technologies,
solutions, etc. Instead, the main goal is to provide a rough description of how an
information retrieval system works and a glimpse of what advantages can be implied
by semantics adoption in Information Retrieval. For the ones more interested in this
topic, the bibliography section reports several interesting works that can constitute
a good starting point for investigation. Of course the web is the most viable mean
for gathering other resources.
For the sake of simplicity this section prosecutes by adopting the tf ·idf weighting
scheme and the vector space model as guiding methodology and tries to generalize
the provided considerations whenever it is possible.
Indexing
In the indexing process each searchable resource is analyzed for extracting a suitable
description. This description will be, in turn, used by the classification process and
by the retrieval process. By now we restrict the description of indexing at text-
based documents, i.e. at documents which mainly contain human understandable
terms. In this case, indexing intuitively means taking into account, in some ways,
the information conveyed by the words contained in the document to be indexed.
As humans can understand textual documents, information is indeed contained into
them, in a somewhat encoded form. The indexing goal is to extract this information
and to store it in a machine processable form. In performing this extraction two
main approaches are usually adopted: the first one tries to mimic what humans
do and leads to the wide and complex study of Natural Language Processing. The
second, instead, uses information which is much easier for machines to understand,
such as statistical correlation between occurring words, term frequency and so on.
This last solution is actually the one adopted by nowadays retrieval systems, while
the former finds its application only on more restricted search fields where specific
and rather well defined sub-languages can be found. The tf · idf indexing scheme is
a typical example of “machine level” resource indexing.
The base assumption of tf · idf , and of other more sophisticated methods, is that
information in textual resources is encoded in the adopted terms. The more specific
a term is, the more easily the argument of a document can be inferred. So the main
indexing operations deal with words occurring in resources being analyzed, trying
to extract only the relevant information and to discard all the redundancies typical
of a written language.
In the tf · idf case, the approach works by inspecting the document terms. As
first, all the words that usually convey little or no information such as conjunctions,
articles, adverbs, etc. are removed. They are the so-called stop words and typically
depend on the language in which the given document is written. Removing the
stop words allows to adopt frequency based methods without data being polluted
36
3.2 – Software tools for knowledge management
ti ∈ d
tfi (d) =
max(tj ∈ d)
The term frequency alone is clearly a too simplistic feature for characterizing a
textual resource. Term frequency, in fact, is only a relative measure of how much
important is (statistically speaking) a word in a document. However, no information
is provided on the ability of the given word to discriminate the analyzed document
from the others. Therefore a weighting scheme shall be adopted, which takes into
account the frequency with which the same term occurs in the documents base. This
weighting scheme is materialized by the inverse document frequency term idf . The
inverse document frequency takes into account the relative frequency of the term ti
with respect to the documents already indexed. Formally:
dk ∈ D
idfti = log( )
D tfi (dk )
The two terms, i.e., the tf and the idf values are combined into a single value
called tf · idf which describes the ability of the term ti to discriminate the document
di from the others.
tf · idfti (d) = tfi (d) · idfti
The set of tf · idf values, for each term of the vocabulary L, for a document d,
defines the d representation inside the Information Retrieval system.
It must be noted that this indexing process is strongly vocabulary dependent:
words not occurring in L are not recognized and if they are used in queries they do
not lead to results. The same holds for more complex methods where L is built from
the indexed documents or from a training set of documents: words not occurring
in the set of documents analyzed are not took into account. So, for example, if in
a set of textual resources only the word horse occurs, a query for stallion will not
provide results, even if horse and stallion can be used as synonyms.
NLP is expected to solve these problems, however its adoption in information
retrieval systems still appears immature.
37
3 – Web applications for Information Management
38
3.2 – Software tools for knowledge management
vector of features representing a given document and the vector representing a given
query. Similarity between documents and classes can be evaluated in the same way.
Formally, the cosine similarity is defined as:
di · dj
Sim(di ,dj ) =
di · dj
39
3 – Web applications for Information Management
horse as query keyword, only the first document is retrieved, while the second is
completely missed, and vice-versa.
Semantics Integration
Semantics can play a great role in information retrieval systems allowing from one
side to face all the issues related to vocabularies and different terminologies, and on
the other side enabling users to perform “conceptual” queries. Conceptual queries
specify the users’ information need with high level “concept descriptions” and do
not use mere keywords, which can be imprecise and sometimes misleading (try to
search “jaguar” on Google, will you find cars or animals?).
With respect to the topics described in the previous sections semantics can be
integrated in the indexing process as well as in the retrieval process. In the indexing
process, semantics can be adopted by mapping, i.e., by classifying resources with
respect to a formal ontology.
Clearly the problem of term dependence still remains: resources are indexed as
before and, in addition, a classification task is performed. However this dependence
is somehow mitigated because the ontology acts as a bridge and as a merging point
for the different vocabularies adopted by the IR system. Whatever language is used
and whatever domain-specific vocabulary is adopted, the features resulting from in-
dexing are always concepts, which are in turn language and vocabulary independent.
The same holds for keywords in user queries: using a ontology as semantic back-
bone, synonyms can be easily took into account, for example by associating many
keywords to each ontology concept (Lexicon).
In the retrieval process, semantics can by-pass the synonym-related issues and
can also provide new kinds of researches, which are in principle similar to the well
known category search. The user, in fact, can directly select the concepts modeled
in the IR ontology as a query. Such a query is then easily converted in retrieved
resources, without vocabulary and, most importantly, without domain-related prob-
lems. So, if a user is browsing a naturalistic web site, and performs a search selecting
the concept “jaguar”, only references to animals will be retrieved, the context being
fixed by the site ontology, which is about nature.
Eventually semantics adoption can easily support queries for related documents
and query refinement processes. Queries for related documents start from a sample
page, possibly retrieved by the IR system in a previous query, and require simi-
lar pages that are stored in the IR system knowledge base. The ontology is newly
the angular stone of the process, allowing to find resources which have conceptual
descriptions similar to the one of the sample page. The process is completely vo-
cabulary independent and transparently supports cross-lingual operations.
In query refinement, instead, the user specifies a query that provides bad or not
relevant enough results. The user can then refine his/her query by selecting new
40
3.2 – Software tools for knowledge management
terms, more specific or more general. Ontologies can be of aid also in this case: the
semantics relationships occurring between ontology concepts offer, in fact, a built-in
method for query refinement. A semantic IR system can therefore easily allow users
to widen or restrict their queries for finding more relevant results. It can even run
queries proactively, in order to timely respond to user demands.
• Offer academic or educational credit via college and university on-line learning
41
3 – Web applications for Information Management
As with any learning medium, the use of e-learning offers benefits as yet not realized
in traditional training, while also presenting new risks to both producers and users.
On the positive side, e-learning products:
• Create stability and consistency of content due to the ease in which revisions
can be made
• Eliminate travel and lodging expenses required for traditional, in-person train-
ing
42
3.2 – Software tools for knowledge management
Nevertheless, like any other training format, it also has disadvantages and risks
associated with its production and use. Before committing to e-learning, one must
consider the following:
• User reaction and participation often depends on the level of individual com-
puter literacy
• Development costs can exceed initial estimates unless clear production goals
are established
Careful thought and planning must go into a decision to purchase, implement, and
utilize e-Learning products, both those bought off-the-shelf or those customized for
specific purposes. This is often referred to as the “build or buy” decision. In either
case, organizations considering e-Learning should conduct a comprehensive analysis
of their needs, goals, education or training plans, and their current infrastructure to
determine if e-Learning is a suitable pursuit.
e-Learning standards
All the major features of e-Learning, i.e., the ability to customize courses, to track
progress, to offer “just-in-time” learning opportunities, are only feasible if the ba-
sic infrastructure of an e-Learning component is designed to be interoperable and
communicates with components from a variety of sources. These elemental units are
usually called “learning objects” and are the basis for the standardization movement.
There are numerous definitions of a learning object, but it is basically a small
“chunk” of learning content that focuses on a specific learning objective. These
learning objects can contain one or many components, or “information objects”,
including text, images, video, or the like. Reusability shall be supported both at the
learning object and at the information object levels, and by standardizing the way
in which these objects are built and indexed, both learning objects and information
objects shall be easy to find and use.
43
3 – Web applications for Information Management
• Facilitate the translation of human languages (to re-purpose the content for
use in different cultures);
• Define a semantic framework that allows the integration of legacy systems and
the development of data exchange formats;
• The packaging of learning content for allowing simple transmission and acti-
vation of learning objects.
Learning management systems and applications: Learning management sys-
tems play an important role in the facilitation of a learning object strategy. The
management system serves as a type of gateway where content enters, is assembled
into meaningful lessons based on the learner’s profile, and is presented to the learner,
whose progress is then tracked by the management system. It is crucial, therefore,
44
3.2 – Software tools for knowledge management
that the system is able to operate with content and tools from multiple sources. The
standards developed in this category:
• Allow lessons and courses to move from one computer managed instruction
system (CMI) to another, while maintaining its ease of use and functionality;
• establish a protocol that aids in the communication between the software tools
that a learner is using (e.g. text editors and spreadsheets) and the instructional
software agents that provide guidance to the learner.
45
3 – Web applications for Information Management
Semantics Integration
46
3.2 – Software tools for knowledge management
questions and order your calendaring agent to make an appointment with one of the
teachers in a few days.
A week later you feel confident enough for changing the learning objective status
for Taylor expansions in your portfolio from ’active, questions pending’ to ’resting,
but not fully explored’. You also mark your exploration sequence, the conceptual
overviews you produced in discussions with the student and some annotations, as
public in the portfolio. You conclude by registering yourself as a resource on the
level ’beginner’ with a scope restricting the visibility to students at your university
only.
In this scenario some points can be identified where semantics integration plays
a crucial role:
47
3 – Web applications for Information Management
portfolios. Every resource in RDF/S and OWL has a unique identifier, there-
fore annotations can be about whatever needed. Attributes can define the an-
notation author, the annotator trust level, experience, etc. offering the basic
infrastructure for building a potentially world wide collaboration framework.
48
Chapter 4
49
4 – Requirements for Semantic Web Applications
2. The system shall serve different users, of different nationalities: all required
services must support the use of different languages.
3. The system shall enable cross-lingual queries, i.e. queries written in a given
language that provide results in a different language(s). As an example, a user
might specify a query in English and require results in English, Italian and
French.
4. The system shall provide a directory-like view of the ontology, for the knowl-
edge domain in which it works.
5. When a user selects a category label, a semantic web application shall provide
the resources classified as “pertaining” the category (ontology concept).
7. The system shall provide a classical search (textual search) functionality based
on resource contents. That is to say, it shall provide a search engine able to
work on synonyms and to provide results even if the words in the query do
not occur in the indexed resources.
8. The system shall allow for manual classification of resources with respect to
a given conceptual domain. The content editor (journalist) must therefore be
enabled to easily specify the concepts for which a resource is relevant.
9. For each selectable concept, the system shall provide a label and an extended
description, localized in the user language. This facilitates the user compre-
hension of the domain model, thus reducing misclassified resources.
10. The system shall facilitate manual classification of resources by providing the
available concepts for classification (the old keywords, in a sense) and by only
allowing selection of concepts actually present in the ontology.
50
4.1 – Functional requirements
13. The system shall be able to classify both owned and not owned resources.
Therefore a site using the system should be able to offer conceptual searches
both on its internal resources and on resources of other related sites.
14. In response to a specific user setting, the system should provide semantic and
transparent search functionalities. In other words, the system should perform
searches in background, while the user is surfing the site, taking advantage
of the information coming from the user navigation to generate meaningful
queries.
15. In the transparent search mode, the system should provide additional infor-
mation, i.e. retrieved pages, if and only if the relevance of results, as perceived
by the system, is over a reasonable threshold. Such a threshold should be set
by the user and can be modified by the user during the site navigation (in a
way like the “Google Personalized” system).
16. The system can provide a relevance feedback plug-in for the most diffused web
browsers (Mozilla and Internet Explorer at least) where the user can view the
classification of visited pages as deducted by the system and can correct them.
17. The system should guide the logical organization of a site, by proposing a
suitable location for a new page in the site map, depending on the conceptual
classification of that resource. For example, a new page about municipality
aids for people with disabilities should be easily accessible from the pages
about the municipality services and from the pages about disability aids.
The highest priority needs resulting from the gathering phase concern the search-
related part of web sites, mainly because of the poor performances of the nowadays
syntactic search engines. Performances are perceived as poor despite the high pre-
cision and recall values of such search engines in force of the syntactic nature of this
technology. In fact syntactic engines are not able to retrieve resources on the basis
of their conceptual content but they are only able to address retrieval by using the
occurrence of a finite set of keywords that may not appear in a user query.
The first requirement states that the similarity between two resources and the
similarity between queries and resources must be evaluated on the basis of resource
content in a semantics rich way. The immediately following requirement is, surpris-
ingly, not directly concerned with the search task but deals with multilingualism.
In the era of global services the ability to handle service requests and responses
in different languages is perceived as a critical factor. Moreover the capability to
perform cross language operations, i.e. operations whose triggering is in a language
and whose result is in another language, is a value added that many web operators
would like to offer to their users. Multilingualism is, in a sense, completely indepen-
dent from the adoption of semantics in web applications, however it is much more
51
4 – Requirements for Semantic Web Applications
simple to obtain when the main business of searching, classifying and matching is
performed at a conceptual level, which is language independent.
The other side of the medallion, i.e. the classification task, is addressed in the
requirements labeled from 8 to 13 and refers to three different degrees of automation
in the semantic classification of resources. The first requirement assumes that to
provide results relevant to a human user, the classification must be performed by a
human. He/she manually annotates the published resources by creating associations
between such resources and a model of the conceptual domain into which the web
application is deployed (ontology).
The following requirement, states that human classification is actually the only
way to ensure the provision of “meaningful” results, as said by the 8-th one. However
this task could sometimes be over helming, especially if the conceptual domain is
vast and complex. In such cases, intelligent systems must be of aid by doing the
hard work and by only involving humans for approval, modification or rejection of
automatically extracted classifications.
52
4.2 – Use Cases for Functional Requirements
53
4 – Requirements for Semantic Web Applications
selects the directory service of the site where resources are subdivided into homo-
geneous thematic sets. He/she searches for “disability” and subsequently selects
the corresponding directory entry for retrieving resources. The system receives the
request, maps the directory entry to the corresponding ontology concepts and ex-
tracts the most relevant resources with respect to that query. In other words, the
system identifies the involved ontology concepts and searches its knowledge base for
resources annotated as “about” those concepts. Then it ranks the available data
according to the semantic similarity with the user query (the selected concepts) and
finally it provides the results as a set of hyper links.
54
4.2 – Use Cases for Functional Requirements
the wait list for publication, and of establishing the next steps in the site editorial
activity.
This use case (Figure 4.3) is focused onto the process of creating new content
and it is specifically devoted to clarify the content classification phase.
When a redactor has completed the editing of a new resource (article), he needs
to classify the new resource, taking into account the site knowledge base, in order
to allow users to retrieve and navigate the site pages according to their conceptual
content (see the previous use cases).
The classification is a manual process in this use case: the redactor has access
to the ontology and is required to select the concepts that are relevant with respect
to the newly created content, possibly specifying a measure of “relatedness” in the
range between 0% and 100%. To perform such task, he can navigate a tree-like
ontology representation, and select the relevant concepts.
For each concept a simple interface allows to specify the degree of correlation
55
4 – Requirements for Semantic Web Applications
between the article being classified and the concept, providing four choices for the
relation strength: low (25%), middle (50%), high(75%) and very high (100%).
As the ontology could contain hundreds or thousands of different and complexly
related concepts, the classification task easily becomes infeasible both in terms of
time consumption and of cognitive overload on the redactor. The system must
therefore provide some aids to the classifier, possibly retaining as much as possible
the quality of the classification.
This can be done by quickly scanning the resource under examination and by
providing suggestions to the redactor. The system does not need to perform a full
and accurate classification of the resource being edited but needs only to extract the
most relevant conceptual areas that seem to be related to that resource. It has also
the ability to modify the ontology navigation interface by highlighting the suggested
concepts thus allowing its human counterpart to refine the proposed annotations or
even ignore them and proceed with the usual, manual, classification.
Once selected all the relevant ontology concepts, the redactor submits the new
article to the system that stores both the content and the classification, allowing
users to semantically query the site for relevant resources, including the one just
loaded.
1. The system deployment should virtually not require a downtime of the publica-
tion system that is being integrated, or at least should constrain the downtime,
in normal operational conditions, to few hours.
56
4.3 – Non-functional requirements
as much as possible the additional effort required for specifying semantic de-
scriptions and more in general reducing the changes in the publication work
flow, from the user/redactor point of view.
3. The system should be integrable into different site publication systems, inde-
pendently from the server side technology adopted for content publication.
4. The system should allow the reuse of already existing technologies: databases,
servers, etc.
6. The system should allow the manipulation of different data formats, at least
HTML, XHTML and plain text.
7. The system should be extensible for incorporating new functionalities and for
allowing the handling of different resource types such as multimedia resources.
57
4 – Requirements for Semantic Web Applications
1. The system should scale up from sites with few pages (around one hundred)
to really big sites with thousands of pages.
2. The system should be time effective, i.e., the system should provide results
in reasonable time, even when it is overloaded by several concurrent requests.
The response time is effective if contained into the user attention time-frame
that, for this kind of application is around ten seconds.
The above reported requisites for semantics integration in nowadays web appli-
cations are essentially resumable with the words “easy integration”. The highest
priority requirements state, in fact: a semantics-aware system shall be deployable
without requiring extra downtime of the site publication system that it integrates.
A semantic system should be transparent, i.e. the system presence should be un-
noticeable for both editors/publishers and end users. A semantic system shall not
conflict with already deployed technologies and shall be accessible from whatever
publication framework.
These are requirements that fall in the “usability” requirements class; in the same
class there are also some low priority requirements that represent further evolutions
or specifications of what intended in the high priority requirements.
From these last, in fact, emerges the necessity to reuse as much as possible
existing technologies such as databases, web servers, etc. (see requirement 4) and
the necessity to design platform independent systems (requirement 5). Finally some
requirements tackle the ability of a semantic system to handle many media. The
web people (content publishers, editors, redactors, users) are in fact more and more
aware of the impact that new media have on clients and end users. As a consequence,
while traditional technologies such as HTML pages and their evolutions (DHTML,
XHTML, ...) are naturally included into semantic elaboration, also new information
means such as audio streams, videos, DVDs and in general multimedia, shall be took
into account and possibly supported.
Performance issues deserve at least some attention in the design process, in
fact, once fixed the other requirements, performance can consistently affect the
effectiveness of a semantic system and can strongly influence its adoption in real
world applications. Scalability and timely responses (requirements 1 and 2 under
performance) are therefore high priority requirements that must be satisfied to fill
the gap that still persists between academic applications and real world solutions.
58
Chapter 5
This chapter introduces the H-DOSE logical architecture, and uses such
architecture as a guide for discussing the basic principles and assump-
tions on to which the platform is built. For every innovative principle
the strength points are evidenced together with the weaknesses emerged
either during the presentations of such elements in international con-
ferences and workshops or during the H-DOSE design and development
process.
The requirements analyzed in the previous chapter are at the basis of the design
and implementation of a semantic web platform specifically targeted at offering low-
cost semantics integration for already deployed web sites and applications. Such
an integration is specifically oriented to information-handling applications such as
CMSs, Information Retrieval systems and e-Learning systems.
Designing a complete semantic framework involves roughly two levels of abstrac-
tion, the first one being concerned with the so-called logical design while the second
is more focused on practical implementation issues and is called deployment design.
In this chapter the logical design of the H-DOSE1 platform is addressed while the
deployment design is tackled in Chapter 6. H-DOSE, stands for Holistic Distributed
Semantic Elaboration Platform. The reasons for calling it holistic will become clear
in the following sections.
1
The name H-DOSE comes from “Holistic DOSE”, since it integrates and reconciles different
points of view (semantic web, web services, multi agent systems, etc). It is commonly pronounced
as “High-Dose”.
59
5 – The H-DOSE platform: logical architecture
60
5.1 – The basic components of the H-DOSE semantic platform
5.1.1 Ontology
The H-DOSE semantic platform consider multilingualism as a critical issue that
must be addressed as stated by the functional requirements 2 and 3 reported in
Chapter 4 - Section 4.1.
Multilingualism issues in ontology-based applications are still active research
topics in the Semantic Web community. For solving them two main approaches
are currently under investigation: the first is based on the integration of language-
specific ontologies via ontology merging techniques, while the second assumes that
61
5 – The H-DOSE platform: logical architecture
a common set of concepts exists, that could be shared by different languages. Ac-
cording to the first approach, people speaking different languages will model a given
domain area by defining different ontologies that will be merged to provide a multi-
lingual semantic environment. The resource requirements for language management
will be comparable, in term of human and hardware resources, to that of ontology
merging, actually being the same task.
Moreover, as many aspects of the predefined knowledge area will be common
for all languages, many redundant “synonymous” relationships between language
specific ontologies will be defined, increasing resource wasting and complexity. De-
veloping a multilingual semantic framework using the above exposed methodology,
involves then the risk of getting an unmanageable entity as an outcome, in which
great care is required to define relationships between “equivalent ontologies” and to
track changes and coherently update those relations.
The second methodology addresses multilingualism issues by using an holistic
approach: when a common knowledge domain is modeled, it is likely that most
domain experts, even working in different languages, can identify a common “core”
set of concepts. A single, language independent ontology could therefore be used to
model such area, sharing concepts between different languages.
In the initial phase, this new stream of research was affected by a common
misunderstanding, which was strictly related to concept naming. Many times, in
fact, the concept name and/or its definition is considered as the concept itself, while
the second is actually an abstract entity to which people refer by using words.
A good definition of concept can be: “a well defined sequence of mental pro-
cesses”. In fact, words are only used as a trigger to the mental association that lets
us identify and instantiate concepts. In other words, a generic concept, a dog for
example, can be correctly identified using any alphanumeric string without any loss
of generality. On the other hand, ontology designers usually define concepts using
words, or sequences of words, in order to easily identify the meaning of the described
entity. To name concepts with language specific descriptions is just a usability trick
rather than a strict requirement, and should have no semantic implications.
Ontology is by definition language independent, while its instantiation in a spe-
cific idiom is effectively achieved by the adoption of a proper set of textual descrip-
tions for each concept, in each supported language. Such approach is inherently
scalable, as new languages can be easily supported by integrating new lexical enti-
ties and definitions, and possibly by slightly restructuring a small number of ontol-
ogy nodes. Moreover, special purpose relationships could be defined (e.g., links to
Wordnet [13] entities) providing ground for sophisticated functionalities.
The H-DOSE approach uses a language independent ontology2 where concepts
2
Firstly presented at SAC 2004, ACM Symposium on Applied Computing, Cipro
62
5.1 – The basic components of the H-DOSE semantic platform
are defined as high-level entities for which language dependent definitions are spec-
ified. Such semantic entities are linked to a set of different definitions, one for each
supported language, and to a set of words called “synset”. Operationally they can
be defined as:
concept :== concept ID, lex
lex :== (lang ID, description, synset)+
synset :== (word)+
A concept definition is a short, human readable, text that identifies the concept
meaning as clearly as possible, and that is expressed in a specific language. A synset,
instead, is composed of a set of near-synonymous words that humans usually adopt
to identify the concept.
The process of linking ontology concepts and lexical entities can be deployed
according to three different approaches: integration, annotation and hybrid. In the
integration approach, lexical entities are included in the ontology as new semantic
entities, a sort of special instances defined for each ontology concept. This approach
makes automated reasoning on lexical entities much more easier. The disadvantage
is that, whenever a new lexical entity must be added or an already existing term
must be modified, the entire ontology is involved.
The annotation approach solves this shortcoming by keeping separate the ontol-
ogy and the lexical entities. Lexical entities, which can be, for example, the senses
of a Wordnet-like lexical network, are then tagged as “related” to a given ontology
concept. Tagging allows to modify, update or even delete lexical entities without
modifying the ontology.
The latter hybrid approach mixes the first two approaches by keeping separate
lexical entities and ontology concepts, and by restructuring the networks between
lexical entities (Wordnet) in order to better reflect the ontology view of the given
knowledge domain. This approach is a little more flexible than the integration
approach, since modifications in lexical entities do not necessarily impact on the
ontology definition. However, the allowed changes can only be relatively small,
otherwise the structure of lexical entities would likely require a revision in order to
keep reflecting the domain view imposed by the ontology.
H-DOSE adopts the annotation approach by keeping the ontology physically
distinct from definitions and synsets. This allows a separate management of concepts
and language-specific information, and a complete isolation of the semantic and the
textual layers. Language specific semantic gaps are supported by including in some
concepts the definition and synsets in the relevant languages, only. This assumption
guarantees sufficient expressive power to model conceptual entities typical of each
language and, at the same time, reduces redundancy by collapsing all common
concepts into a single multilingual entity. The final resource occupation is, by a
63
5 – The H-DOSE platform: logical architecture
The two sets are modeled in the same way inside the ontology. However, concepts
belonging to first category will be linked to definitions and synsets expressed in each
supported language, while those belonging to the second set will be linked to smaller
subsets of languages.
The complex interaction between ontology designers, users and domain experts
required by this approach at design time must build upon the availability of an inter-
national network in which people cooperate to model a defined knowledge domain.
Such kinds of networks have already been proposed, for example in the EU Socrates
Minerva CABLE project [3]. CABLE involves a group of partners with proved skills
in learning and education and promotes cooperation to define learning materials
and case studies for continuous education of social workers. In CABLE, teams of
experts in social sciences and education cooperate, with the support of “multilin-
gual” domain experts, in defining case studies, teaching procedures and identifying
the related semantics in a so-called “virtuous cycle”. The CABLE project can ef-
fectively apply the H-DOSE approach to multilingualism, leveraging its “virtuous
64
5.1 – The basic components of the H-DOSE semantic platform
cycles” and defining a multilingual ontology for education in social sciences. The
resulting ontology, definitions and synsets, can constitute an effective core for the
implementation of multilingual, semantic, e-Learning environments.
Formal definitions
An H-DOSE ontology is formally defined as a set O of concepts c ∈ C and relations
r ∈ R, together with a set of labels L, descriptions E and words S associated to the
concepts:
O : {C, R, χ, L, E, ψ}
where
C : concepts c ∈ C
R : relations r ∈ R
R⊆C ×C
χ : R → [0,1]
Wlang : words f or a language << lang >> w ∈ Wlang
S : set of near − synonymous words s ∈ S
s ⊆ C × Wlang
ψ : S → [0,1]
L : labels of concepts l ∈ L
l ⊆ C × Wlang
E : descriptions of concepts e ∈ E
e ⊆ C × Wlang
C is the set of concepts c in the ontology O while R is the set of relations r relating
ontology concepts to each other with a given strength χ in a range between 0 and
1. W is, instead, the set of all possible words w in a given language, S is the set of
near synonymous words associated to each ontology concept c with an association
strength evaluated by ψ in the range between 0 and 1. L is the set of labels l
associated to a given ontology concept c in a given language and E is the set of
descriptions e associated to such concepts, in the same, given, language.
Ontologies, in H-DOSE, shall conform to this approach, however they are allowed
to refer or to have links to other already existing ontologies. In such a case, multilin-
gualism is supported only for the internal ontology, or better, only for the ontologies
adopting the above format. Multilingualism support for external ontologies depends
on how such a concern has been addressed by the designers of the external models.
65
5 – The H-DOSE platform: logical architecture
5.1.2 Resources
Resources in the H-DOSE platform are considered to as texts either written in
HTML, XHTML or plain text. The main motivation for such an assumption is that
the platform has been designed for supplying semantic services to nowadays web
applications, and texts are currently the most diffused resources on the Web. How-
ever, in order to take into account the 7th non-functional requirement for a semantic
platform, resource support is designed to be easily extensible at the multimedia case.
In H-DOSE, therefore, a resource is “something, in the platform knowledge domain,
about which some information can be provided”. Where the term “something” is
assumed to be a document, either textual or, in a future extension, multi medial.
Documents can be entire web pages but they can also be simple, homogeneous
chunks of text (or video in a near future). In this last case, support for defining
relationships between fragments is provided, allowing to specify which fragment
belongs to which page.
Semantic annotation and retrieval of fragments is one of the innovative points of
the H-DOSE platform. Fragments, in fact, allow to take into account different levels
of granularity in classification and retrieval of resources. This allows, for example, to
extract and retrieve only those document pieces that are similar to a given user query,
eliminating all the disturbing content that usually contours relevant information such
as banners, links, navigation menus, etc. Moreover, the ability to specify the mutual
relationships between fragments, and between fragments and pages, allows to reduce
redundancy at most, still maintaining the relevance of provided results.
So, if many fragments of a given web page are relevant with respect to a user
query, the entire page is retrieved rather than each of its component. In other cases,
if a document about the jaguar life is well articulated in sections, including, for ex-
ample, an introduction, some more detailed sections and a conclusion, the retrieved
result may vary depending on the level of granularity of the user query. Therefore,
if a user requires “documents about the life of jaguar” only the introduction may be
retrieved, while if the query is more specific: “documents about jaguar nutrition in
the Bangladesh jungle”, or if the user chooses to deepen the previous query, the H-
DOSE platform can retrieve the internal sections of the document, better adapting
results to the user query.
Besides the granularity with which resources can be manipulated by H-DOSE,
also the storage policy deserves some attention. In H-DOSE documents are not
managed directly by the platform. That is to say, H-DOSE does not store indexed
document into its internal database. Such design choice allows, from one side, to
leverage the already existing and probably more efficient storage facilities of CMSs,
learning systems and IR systems. On the other side, it permits to semantically
describe resources that do not belong to the site in which the platform is deployed.
H-DOSE does not make any assumption on the property of annotated resources,
66
5.1 – The basic components of the H-DOSE semantic platform
they can be owned by whom deploys the platform as well as by other actors. In
any case they can be indexed, classified and retrieved by the platform through the
adoption of an external annotation scheme where resources are identified by means
of URLs and XPointers.
With respect to the ontology formalization of the previous section, resources in
H-DOSE are formally described as follows:
D : documents d ∈ D
5.1.3 Annotations
Annotations are the most important component of a semantic platform since they
allow to describe, from a conceptual point of view, the resources, and since they
constitute the mean for providing conceptual based functionalities such as semantic
search.
H-DOSE, according to the functional requirement number 13, adopts an ap-
proach that keeps annotations and resources well separated. With this approach
several critic issues can be addressed: as first, one could annotate non-owned re-
sources, i.e. resources for which the classifier has not editing rights, secondly, anno-
tations are only loosely coupled with the resource they annotate. This allow a page
to change formatting or slightly change content, while the corresponding annota-
tion can stay unchanged. On the other side, if a resource becomes obsolete an it is
retired, the system can address such an issue by simply deleting the corresponding
descriptions. Annotations in H-DOSE are formally defined as:
A : annotations a ∈ A
A⊆C×D
ρ : A → [0,1]
Where D is the set of all resources d suitable for annotations, A is the set of semantic
annotations relating the resources in D with the ontology concepts in O and ρ is the
association weight between resources and ontology concepts. The weight function
ρ allows to specify different degrees of “relatedness” to ontology concepts, in a
way similar to what does the Vector Space model for classical information retrieval
systems, thus obtaining a flexible way of representing knowledge and of tackling
resource ambiguity. In RDF notation, a H-DOSE annotation looks like:
<hdose:annotation rdf:ID=’15643’>
<hdose:topic rdf:about=’’#jaguar’’/>
67
5 – The H-DOSE platform: logical architecture
<hdose:document rdf:about=’’#doc123’’/>
<hdose:weight rdf:datatype=’’&xsd;double’’>0.233</weight>
<dc:author> H-DOSE </dc:author>
<hdose:type> auto </hdose:type>
</hdose:annotation>
Together with the ability to merge fuzziness of documents with the crispness of the
ontology specification, spectra are also useful for performing some visual inspection
of knowledge bases. In fact, they can be visualized using the ontology concepts as
the x-axis and the σ(c) values as the corresponding y-values (Figure 5.3).
Since concepts do not possess an implicit ordering relation, the x-axis can be defined
in several ways, allowing the analysis of different aspects of a knowledge base. Depth-
first navigation of the ontology using the “subclassOf” relationship, as an example,
orders similar concepts, at the same granularity, into nearby positions, allowing good
discrimination of the ontology sub-graphs involved into a document annotation.
Breadth-first navigation, instead, allows the detection of the abstraction level of
the indexed resources by putting together concepts lying at the same level in the
ontology. Anyway, the resulting graphs have exactly the same capabilities in terms
of expressive power and matching properties, being different only in their visual
interpretation.
A conceptual spectrum σd associated to a document d is a conceptual spectrum
measuring how strong the association between ontology concepts and the document
is, taking into account the contribution of semantic relationships involved in the
3
Firstly presented at ICTAI 2004, International IEEE Conference on Tools with Artificial In-
telligence, Boca Raton, Florida
68
5.2 – Principles of Semantic Resource Retrieval
i.e., for each ontology concept c, the document conceptual spectrum value is defined
as the sum of ρ contributions extracted from all the annotations associating the
document d with the concept c.
69
5 – The H-DOSE platform: logical architecture
to each of them. The resulting resource set is ranked according to the computed
resource relevance and proposed to the user.
The first phase of a conceptual search includes logic reasoning or logic inference.
The user query is represented as a clause or as a set of clauses to be logically
satisfied by facts and axioms defined in the domain ontology, and the knowledge
base is subsequently surfed to find suitable instances. In other words, instances and
ontology concepts are merged into a cumulative graph that is surfed by the reasoning
engine for finding a match between the modeled knowledge and the user query. In
case no matching is found nothing could be deducted except that the knowledge
base does not model the given domain with enough information to answer to the
user query. Otherwise, the set of facts and axioms satisfying the query is provided,
allowing relevant resources identification.
It is important to notice that all provided results are equally relevant with respect
to the user query since they are all able to satisfy the query logical clauses. However,
discrimination should be performed, according to some external measures, in order
to provide a small, highly relevant set of results to the final user; this issue is
addressed in the second phase.
In that phase an ordering function is defined among the set of resources obtained
through logical inference. As an example, we might want to query our knowledge
base for lovely cats living in our city. The inference process would provide all cat
instances for which the properties “live in (city)” and “is lovely” hold. However
there are no means to evaluate how much lovely is a cat, from the logical point of
view. Some more information can therefore be taken into account for ranking results
according to user needs, the amount of “loveliness”, for instance, expressed as the
percentage of persons judging a given cat lovely.
70
5.2 – Principles of Semantic Resource Retrieval
span arguments broader than a single concept, they will be pointed by a consid-
erable amount of weighted annotations each relating the document with a given
ontology concept.
Annotations, in opposition with instances, deal with knowledge sources that
retain a certain degree of fuzziness, which is addressed through the definition of
annotation relevance; however ontology constraints are still valid, and the domain
model of concepts and relationships still needs to be taken into account in order
to provide semantics rich services. In particular, inference is still useful to enable
systems to discover previously unknown knowledge; the only further requirement
is that fuzziness of resources and crispness of logic inference merge together in a
common environment.
In H-DOSE the mean for such a merging is provided by the “Expansion Opera-
tor” defined for conceptual spectra. Considering the spectrum definition provided in
section 5.1.3, it is easy to notice that spectra are simply a way for collapsing many
annotations in a single object. They do not take into account the ontology struc-
ture, i.e. the conceptual specification of domain semantics provided by concepts
and relationships in the ontology. Moreover, when annotations are automatically
extracted, they retain a considerable noise component that affects the resource spec-
trum, adding uncertainty to the correctness of the conceptual representation: there
can be wrong annotations, missing annotations, etc.
In order to overcome such an issue it is mandatory to exploit the ontology model
including all available information, with a particular focus on semantic relationships
between modeled concepts. In particular, sub-symbolic information, expressed as
the strength of relationships, must be taken into account in order to correctly eval-
uate conceptual spectra, reducing the sensitivity to the annotation noise. Relation
weights, in fact, allow to take into account how much a concept is related to another
concept in the ontology, adding a quantitative information that is complementary
to the logical constraints on ontology navigation defined by the relation semantics
(transitivity, inheritance,...). Such values are therefore critical for an information-
related application since they establish how much a relation correlates two different
concepts and how much such a relation should contribute to the definition of re-
source semantics (spectra). Relation relevance weights must be specified during the
iterative ontology design process and must be validated by domain experts in order
to assess the conceptual coherence of the resulting knowledge base.
To gain a better focus on this issue we could think about conceptual spectrum
components as concept clouds in the ontology. Strongly related concepts are grouped
together by means of relationships, and annotations can be seen as the starting
seed of these groups. Even if the related concepts do not appear into the original
annotation set, due to wrong or missing mappings, they should take part into the
document conceptual specification as they are conceptually related to the spectrum.
Therefore a new spectrum operator should be defined to discover the set of clouds
71
5 – The H-DOSE platform: logical architecture
R∗ : transitive closure of R
The transitive closure R∗ includes the definition of the χ∗ function as the maximum
strength path between c and c , where the path strength is computed by multiplying
all χ values of relationships included in the path.
In other words, the Spectrum expansion process takes a raw conceptual spec-
trum σ as input, i.e., a spectrum as it comes from the manual or the automatic
annotation process. Then, it processes such a spectrum by analyzing the ontology
and by propagating the relevance weights of each spectrum component, through
both hierarchical and non-hierarchical relationships. X consists, therefore, of graph
navigation on the ontology, where each relation has an associated weight that as-
sesses the conceptual distance between linked concepts, and an orientation, that
identifies the ways into which navigation can be performed. The navigation result
is an enhanced spectrum Xσ in which original topics, together with their relevance
weights, appear surrounded by clouds of topics extracted by means of the expansion
operator.
Expanded spectra as the one in Figure 5.4 are much more useful for search tasks
since they cover every possible nuance of document semantics according to a given
ontology. This allows to retrieve both documents where required concepts directly
occur and other, related documents, even if the original query does not exactly
match their raw (not expanded) semantic classification.
72
5.2 – Principles of Semantic Resource Retrieval
5.2.3 Queries
One of the valuable properties of conceptual spectra is that they can be used to
represent documents as well as queries. A query is formally defined as follows:
Q : queries q ∈ Q
Q ⊆ 2W
Where 2W is the set of all possible combinations of words w in W .
Given that for each ontology concept c a synset s ∈ S is specified, then a query
conceptual spectrum can be extracted as reported below:
σq : σq (c) = ψ(c ,w )
(c ,w )∈S∧c=c ∧w ∈q
For each concept c in the ontology, a query spectrum is defined as the sum of
contributions of all query terms w ∈ W modulated by the ψ function associated to
the relation S.
73
5 – The H-DOSE platform: logical architecture
Query matching
Document spectra are expanded at indexing time and stored into the annotation
base that will be used for query matching and relevant document retrieval. Con-
versely, queries are translated into expanded spectra at runtime, before searching
the annotation base for a match. However, from a search engine point of view, they
are both spectra into a common, homogeneous space and the retrieval task simply
corresponds to resource matching into such space.
In order to perform spectra matching, a similarity function can be defined ex-
tending the Vector Space model for information retrieval. This extension interprets
the two spectra to be compared as two vectors into an n-dimensional space, where
ontology concepts c represent dimensions and the correspondent relevance weights
σ(c), computed by means of the expansion operator X, are the vector components.
Searching for a match in terms of shape is, in that space, equivalent to search
for vectors having the minimum angular distance between them, i.e. to search for
vectors with similar directions. Therefore, the similarity Sim(q,d) can be simply
computed by evaluating the cosine of the angle between the two spectra.
σq · σd
Sim(σq ,σd ) = cos(φq,d ) =
|σq | · |σd |
74
5.3 – Bridging the gap between syntax and semantics
4
Firstly presented at SAC2005, ACM Symposium on Applied Computing, Santa Fe, New Mexico
75
5 – The H-DOSE platform: logical architecture
and wholly relevant with respect to concept specification. For example the con-
cept “Business broker” could be represented by the expert-defined word “Agent”.
Unfortunately “Agent” in the computer science context is a software entity with
no connection at all with financial markets. Therefore simply retrieving synonyms
from lexical nets would produce an inconsistently expanded synset in which both
software and finance terms appear. With these premises, for every word to be used
in the expansion process it is necessary to discriminate between its different senses
to identify the synonyms according to the relevant sense only, and to avoid adding
misleading results to the knowledge base. This action in the literature is typically
called sense disambiguation. To perform sense disambiguation, H-DOSE uses a tech-
nique inspired by the focus-based approach defined by Bouquet et al. [15]. In such
approach the focus is defined as a concept hierarchy containing the original node,
all its ancestors, and their children. Thus the focus of a concept is the part of the
ontology necessary to understand its context. Let us consider an example to better
clarify the definition: referring to the ontology in Figure 5.5, we want to expand the
synset of the concept “Agent”.
Concepts surrounded by dashed lines compose the focus for the concept “Agent”.
If a search for “Agent” synonyms is performed in WordNet, as an example, one
of the provided terms is “broker”. However, in the above ontology, agents have
nothing in common with businessmen and are only proactive software entities. The
focus of the concept would easily allow rejecting the term “broker” since it is not
related with software, software entities and so on. Such approach, although being
quite effective in sense disambiguation, in the definition given in [15] has some
limitations. The original approach relies on label names to contextualize a concept,
and this brings some extra constraints for ontology designers. Beyond that, the
approach is not scalable to multilingual environments in which concept labels are
meaningless. Finally, relying on a single word per concept to understand the context
increases the chances of misunderstanding the correct sense. For this reason, the
technique here depicted, while finding its inspiration on the approach described
76
5.3 – Bridging the gap between syntax and semantics
5
There is a clash between the term “synset” adopted for noting words associated to concepts
and the term “synset” adopted in lexical networks for defining words with similar senses. Here the
term “synset” is referred to the last acceptation.
77
5 – The H-DOSE platform: logical architecture
The first step of these refinement strategies takes as input a synset and performs
a search on lexical nets retrieving, for each synset term, all the available synonyms,
without caring of senses. Then a set of policies is applied basically searching the
synonyms set for co-occurrence of words. This approach is purely statistic: every
term is ranked according to its occurrence frequency and filtered using an adap-
tive threshold. Repeated terms are likely to be related with the context to which
the ontology is referred. The pseudo-code of the method which, having received a
synset, returns the expansion, is presented in Figure 5.7. To achieve better results,
it’s possible to integrate this approach with the formerly presented one by using
the output of the focus-based expansion as input for the statistical method. In
this way, the statistical integration, together with the major contribute from the
focus-based expansion process, takes part into the final synset definition allowing
for the creation of a suitable set of words. The overall word set precision, with
respect to the conceptual specification, may decrease in the expansion process, still
remaining effective enough to be used by classification engines. Assuming that the
expert-created synsets have a precision of nearly 100%, the automatic expansion,
while bringing new useful information, may in fact include some misleading term.
78
5.4 – Experimental evidence
Therefore, the increase in synset size (and conceptual coverage) is usually balanced
by a potential loss of precision.
79
5 – The H-DOSE platform: logical architecture
and at the P
level, respectively. It is easy to notice that the distribution of the
first set and that of the second set are nearly separated: this results conforms to
the initial expectations and ensures that the DOSE platform can work effectively
in a multilingual environment. In fact, it shows that correspondent fragments, in
different languages, are annotated as belonging to a common set of concepts while
non-correspondent fragments are kept distinct by annotating them with reasonably
different concepts.
80
5.4 – Experimental evidence
81
5 – The H-DOSE platform: logical architecture
could easily be noticed in Figure 5.10, the spectra search engine is, on the average,
able to provide more relevant results than the keyword based one and it is also able
to provide a better ranking of retrieved documents providing more relevant results
in the first positions of the whole retrieved set.
Figure 5.10. Comparison between the previous search engine and the spectra one
on the query “lavoro disabile cieco” (Job Disabled-people Blind).
Results for the entire evaluation phase have been grouped into a chart (Figure
5.11) that shows the average relevance achieved by both search engines on the five
different queries, according to evaluators’ judgment. Such results have been weighted
by taking into account the ranking position of retrieved documents, therefore the
mean relevance value R on the y axis has been computed as follows.
rd
R=
rank(d)
Where rd is the relevance of the document d and rank(d) its ranking order.
The overall performance of the proposed spectra search engine overcomes the one
of the traditional search engine for every query issued. The weighted combination
of better ranking results and better precision, underline the approach feasibility and
82
5.4 – Experimental evidence
Figure 5.11. Mean relevance scores for both engines over all queries.
83
5 – The H-DOSE platform: logical architecture
Figure 5.12. Recall results for 5 queries on the www.asphi.it web site.
Figure 5.13. Precision results for 5 queries on the www.asphi.it web site.
84
Chapter 6
The main motivation for the H-DOSE design is to provide a system actually usable
in the nowadays Web. According to the vision of this work, such a result shall
be reached through the analysis of requirements perceived by the web actors (and
reported in Chapter 4). In order to respond to some of these requirements, non-
functional requirements especially, a so-called holistic approach has therefore been
adopted. The term holistic refers to the integration of different techniques, and
in particular web services and multi-agent systems, into a common platform for
semantic elaboration.
As emerges from the non-functional and usability requirements number 1 and 3,
the H-DOSE platform shall be “easy integrable” into already existing publication
frameworks and, shall be independent from the server side technologies adopted for
publication.
Web services are the state of the art technology for supporting such functional-
ities. They, in fact, allow system interoperability, by adopting open Internet stan-
dards, by being able to describe their own functionalities and location and by inter-
acting with other Web Services. They are cost-effective since they replace tightly
coupled applications, with related problems of data passing and interface agree-
ments, by offering a loosely coupled architecture in which the business logic is com-
pletely separated from the data layer. Moreover, web services allow reuse of func-
tionalities by adopting standard description formats: WSDL for service logic, UDDI
for advertising and SOAP to communicate. The web service technology is very well
suited for “access-type” services, which are not very computationally expensive and
85
6 – The H-DOSE platform
that usually do not take advantage of replication and distribution amongst differ-
ent locations. Web services are, in fact, more suited to accomplish interface tasks
rather than to implement the internal business logic of applications, which is usually
delegated to more effective processes.
H-DOSE accounts this feature by adopting web services as standard interfaces
for providing semantic services to SOAP-enabled applications. The internal com-
plex tasks are, instead, delegated to the platform internal layers where they get
executed through a quite different technology: multi-agent systems. Agents possess
in fact characteristics that made them actually suitable for accomplishing very inten-
sive tasks exploiting replication, distribution and location-aware computing. They
are “living” software entities located into proper containers that constitute their
ecosystem, and they have an adaptive and autonomous nature, and social capabili-
ties enabling them to coordinate their own actions with others and to cooperate or
negotiate for reaching a specific designed goal.
Since the non-functional requirements, especially the requirements number 1 and
2 under the category “performances”, indicate that scalability is one of the main
concern for the platform deployment in real world applications, H-DOSE uses agents
to perform computationally intensive tasks such as automatic indexing of resources.
The adoption of agent services, in fact, allows the natural distribution of tasks, and
thus of the committed charge, toward the various information sources involved by
the classification. So, for example, if a given web site is offering a suitable container
for agents, the indexing part of the H-DOSE architecture can be replicated on that
site, allowing for local indexing and, distributing the indexing load through all the
available information sources, depending on resource location.
86
6.1 – A layered view of H-DOSE
Indexing service
The Indexing service offers a public, SOAP-based interface for performing semantic
classification of textual resources. It is basically a queue manager for the correspond-
ing kernel-level service, implemented as a multi-agent system. Two main operation
types are supported, namely “indexing” and “batch-indexing”.
The indexing offers a simple, interactive way for semantically classifying re-
sources. An external application invoking this operation type is basically required to
provide its name or unique identifier, and the URI of the document to be indexed.
Some additional information can be provided if, for example, the document is a
fragment belonging to a more complex resource. In such a case a “partOf” attribute
specifies the URI of the resource which contains the document. The service call, as
well as in the “batch indexing” case, is asynchronous and the application can decide
whether listening for annotation results or not.
Internally, the indexing operation simply enqueues the URI of the document
into the kernel-level agent service, and if required maintains the reference to the
calling application for notifying the indexing result. Such a reference is not a clas-
sical “pointer”, instead it is simply the endpoint of a web service (either SOAP or
XML-RPC) to which the notification shall be propagated. This design choice allows
to keep the indexing service as stateless as possible, thus avoiding to store explicit
87
6 – The H-DOSE platform
The batch-indexing operation is, as said by the name, focused on the indexing of
many resources at time. This functionality is usually required when a consistent set
of resources, e.g., a CMS article base, must be classified. The interface, in this case
is pretty much similar to the simple indexing operation, however the parameters
are stored in arrays, one for the documents to be indexed and one for the optional
“container” resources. The internal implementation simply loads the queue of the
kernel-level service with the whole bunch of documents to be indexed. In addition
an alternative interface is also provided for enabling indexing transactions in which
each simple indexing request is enqueued into a single batch-indexing request (Figure
6.3).
Or
beginBatch(applicationURI)
index(documentURI1, applicationURI,
superDocumentURI1*, notificationEndpoint*)
index(documentURI2, applicationURI,
superDocumentURI2*, notificationEndpoint*)
...
index(documentURIn, applicationURI,
superDocumentURIn*, notificationEndpoint*)
endBatch(applicationURI)
Typically, batch operations are accomplished by the same agent, in the kernel-
level indexing service, thus allowing to easily keep trace of the corresponding clas-
sification tasks. Unlike what happens in the simple indexing scenario, the possible
88
6.1 – A layered view of H-DOSE
Search service
As stated in the previous chapter, the H-DOSE search services are semantic, i.e. the
results relevance, with respect to applications queries, is evaluated at a language-
independent conceptual level. Such evaluation uses the methods explained in section
5.2 and provides, as an output, a ranked set of resources in form of a list of URIs.
Two operations are defined: a “search by concept” and a “what’s related” search.
In the “search by concept” functionality, the calling application must specify a
list of relevant concepts, possibly accompanied by a corresponding set of weights,
between 0 and 1, that specify the importance of each concept in the list. This two
informations are automatically combined into a conceptual spectrum, which is then
expanded using the expansion operator defined in section 5.2.2 and implemented by
the Expander service at the Kernel layer. Once expanded, the query spectrum is used
as seed information for the Annotation repository that retrieves all the document
descriptions similar to the query spectrum. These descriptions are then ranked by
the Search engine according to their similarity with the original query. It must
be noted that while descriptions are retrieved using an expanded query spectrum,
which allows to select also the relevant resources that are not explicitly correlated
with the initial query. The final ranking is performed by evaluating the similarity of
89
6 – The H-DOSE platform
document descriptions with the original query, thus filtering those possibly wrong
results due to the expansion process. In other words, the expansion process widen
the potential recall of the search system while the final ranking tries to keep as high
as possible the results precision. Figure 6.5 reports the “search by concept” interface
in pseudo-code.
90
6.1 – A layered view of H-DOSE
of the document identified by the given URI. Then the retrieved results are ranked
according to their similarity with the spectrum of the query document, provided by
the annotation repository, and returned to the calling application. If the number of
available results is bigger than the maximum number of required results, only the
top maxResults URIs are provided.
A last remark shall be provided about the search engine service: the similarity
function used by this service is exactly the one defined in section 5.2.4 while, in the
annotation repository, similarity is evaluated in a very approximate form privileging
the response time with respect to precision in the evaluation.
Expander service
The expander service implements, inside the platform, the spectrum expansion op-
erator. It basically accepts as input a “raw”, or unexpanded, spectrum and performs
the spectrum expansion using the H-DOSE ontology. The result is newly a concep-
tual spectrum, which takes into account the not explicit knowledge encoded in the
H-DOSE ontology (see Figure 6.8 for the pseudo-code of the expander interface).
The expansion operator can be proficiently applied in the search phase, whatever
being the interaction paradigm used, except the “what’s related” one.
91
6 – The H-DOSE platform
Annotation Repository
The annotation repository module is the sole responsible for annotation storage,
retrieval and management. It accepts, on one side, storage requests, and writes the
received annotations into the platform database by means of a proper wrapper. On
the other side, it listens for search requests and has the ability to provide as result
a subset of the stored annotations in which the concepts included into the received
query are contained.
The Annotation repository is the most centralized module of the platform: it
manages all the annotation storage and search requests. For this reason it shall be as
fast as possible, especially in the retrieval phase where the response time is a critical
factor for achieving user satisfaction. For the same reason it possibly constitute a
bottleneck in the platform information flow.
This risk could easily be avoided by using service replication and, thus, by prop-
erly configuring the platform to work with more than one annotation repository.
To perform this configuration both the indexing sub-system and the service level
modules should be aware of the presence of more than one annotation repository
and should have references to the proper one. Correspondence between service con-
sumers (i.e. the Service layer modules and the indexing sub-system) and service
providers (the Annotation repository copies) shall be fixed at configuration time.
Clearly, care must be paid when deploying the platform in order to correctly config-
ure H-DOSE choosing a good trade-off between the required performances and the
available computational resources.
From a more operational point of view, the annotation repository exposes four
92
6.1 – A layered view of H-DOSE
Figure 6.9. Public interface of the store method of the annotation repository.
The retrieve function is, in a sense, the inverse operator of the store function and
given a set of concepts, i.e. a spectrum provides as result a list of URIs which have
spectra somewhat similar to the specified one (see Figure 6.10). The similarity is
simply evaluated by checking the common spectra components which are not null,
and by ranking results according to the number co-occurring concepts (i.e., spectrum
components).
Figure 6.10. Public interface of the retrieve method of the annotation repository.
Figure 6.11. Public interface of the reverse search method of the annotation
repository.
The inverse map functionality concludes this section. It basically provide a mean
for querying the persistent storage in order to retrieve the conceptual description of
an indexed document. Therefore it needs as input a document URI while providing
as result a conceptual spectrum (Figure 6.12).
93
6 – The H-DOSE platform
Figure 6.12. Public interface of the inverse map method of the annotation repos-
itory.
Indexing sub-system
The indexing sub-system is one of the most complex modules of the H-DOSE plat-
form. For this reason firstly a black-box description is provided, and only in a second
instance a more detailed system view is given, discussing each element of the box.
As stated by the name, the indexing sub-system provides the required function-
alities for the automatic indexing of textual resources. Two methods are offered
to external applications, which can either be the service-layer indexing module or
other modules that do not belong to the platform. These methods are respectively
for queuing resources to be indexed and for getting results. The latter is actually a
service callback registration.
Whenever a module needs to perform the automatic indexing of a resource, the
index method of the indexing subsystem is invoked. This method is available in two
variants: a single index function and a group index function. The twos only differ
for the amount of data handled at time. The latter, in particular, is used to index
a considerable group of textual resources in a unique, atomic operation. For both
methods, the corresponding signatures are reported in Figure 6.13.
Or
The indexing operation can be successful, so that the resource conceptual de-
scriptions will be stored in to the H-DOSE persistent storage, or can fail. In both
cases, calling applications may want a notification of the automatic annotation re-
sult. To obtain this notification message, they shall register themselves with the
indexing sub-system. In the H-DOSE platform this task is automatically performed
for the indexing service at the service layer, which is by default registered with the
kernel-level service. However if an external, semantics-aware application needs to
directly access the kernel-level indexing service it must register itself with the service
using the register notify method (Figure 6.14).
94
6.1 – A layered view of H-DOSE
Figure 6.14. Public interface of the register notify method of the kernel-level
indexing sub-system.
95
6 – The H-DOSE platform
indexed. If an indexing squad is already deployed on the server publishing the site,
the new indexing task is forwarded to the squad residing on the remote machine.
Otherwise a discovery process starts, trying to find whether the remote host has
an agent container accessible or not. If a container is available and access to it
can be gained, the agency manger packs a new indexing squad and migrate the
agents, together with the indexing request, to the newly found container. Instead,
if no containers are available, the agency manager backs up the request on a set of
“friend” machines. The “friend” machines differ from normal container providers
as they not only allow the deployment of new agents on the container, but also
allow agents to perform tasks which involve other machines on the Web. A typical
example of a friend machine is the host on which the centralized part of the indexing
subsystem works.
96
6.1 – A layered view of H-DOSE
At the end of the “migration phase”, the URI(s) of resources to be indexed are
received by the filter agent of an appropriate indexing squad. This agent, firstly
contacts the media detector agent for understanding both the language in which the
text is written and the format (HTML, XHTML, plain text) of information. With
this data it can configure its internal parser and code filter, and can then extract
the simple textual information required by the annotation agent. The output of
the filtering process depends on the kind of annotation technique adopted by the
annotator agent. As an example, a simple bag of words annotator may require all
the word stems in the document, while a SVM based classifier can require the set of
distinct words which occur in the textual resource, accompanied by a tdf · if weight
that specifies their ability to discriminate the document from the others already
indexed. Whatever design choice is taken, the filter agent and the annotator agent,
in a squad, shall be designed to work in symbiosis, with the same expected formats
and elaborations.
In the end, the annotator agent takes as input the filtered data and builds up
a conceptual spectrum. Such spectrum is then stored in the H-DOSE knowledge
base through a call to the store method of the annotation repository. At last, the
annotator notifies the indexing result (annotations stored or annotation failed) to
the agency manager, which then propagates the same information to the applications
registered with the kernel-level indexing service.
The deep search agent is involved when some particular agreement comes into
being between people running H-DOSE and an organization managing a given site.
In such a case the organization might allow a restricted access to its database (DB)
tables through a certain interface defined in the agreement. A special purpose agent
can therefore be designed and deployed on the machine running the site. This
special agent, the deep search agent, has the access rights and capabilities required
to directly query the site database, extracting not only published information but
also metadata associated to the table structure of the DB. The deep search agent can
be very useful for highly dynamic sites where information changes very quickly and
semantic search services must be provided, being always up-to-date. Under these
conditions, this agent can be used to constantly track the site database, detecting
changes and triggering new indexing cycles whenever needed.
97
6 – The H-DOSE platform
that ensure the platform operations by giving access to the business objects, i.e. the
ontology, the resources and the storage of classification data (annotations). Three
main modules are deployed at this level: the ontology wrapper which provides pro-
grammatic access to ontologies either written in RDF/S, DAML+OIL or OWL, the
annotation database that defines a set of high level primitives for the persistent
storage of conceptual spectra, and a document handling sub-system which manages
document-related issues such as fragmentation2 , pre-processing, etc.
98
6.1 – A layered view of H-DOSE
99
6 – The H-DOSE platform
In normal operating conditions the enrichment agent stays idle, monitoring the
behavior of services such as the Annotation Repository and the Search Engine.
If a critical situation is detected, the agent extracts, from the services, a list of
concepts whose coverage should be enhanced and contacts the Synset Manager agent,
in the indexing sub-system, to find lexical entities associated to such topics. It
subsequently composes, textual queries to be issued toward classical, text-based
search engines. Once textual queries have been composed, two concurrent processes
start, one interacting with the Agency Manager in order to trigger incremental
indexing on already known sites and the second interacting with search web services
(the Google web API [9], as an example) in order to retrieve a list of possibly relevant
URIs. At the end of such processes the Enrichment Agent performs some filtering on
100
6.2 – Application scenarios
retrieved URIs, deleting not understandable resources such as “pdf” and “doc” files,
in the current setting H-DOSE can in fact only support HTML, XHTML and plain
text. After the filtering process, the agent composes a list of resources to be indexed,
identified by URIs, and sends an indexing request to the indexing sub-system which
subsequently performs semantic annotation and updates the Annotation Repository;
Figure 6.16 shows the corresponding sequence diagram.
6.2.1 Indexing
The indexing scenario is characterized by many interesting aspects because it in-
volves several advanced techniques such as focused semantic indexing, code mobility,
deep search and collaboration. Whenever a web resource or a set of resources must
be indexed by the platform the indexing process starts, performing several steps
ending in the addition of new knowledge to the platform KB (Figure 6.17).
As first, when a set of resources, identified by their URIs, is scheduled for index-
ing, the agency manager agent divides the URIs by location. All resources published
on the same Internet site become sets to be indexed by possibly different indexing
squads. After this “by site” separation the manager agent checks, for each location
if an existing indexing squad is available. If the check result is true, the correspon-
dent set of resources is passed to the remote squad for “in site” indexing; instead, if
the location does not hold an indexing squad but offers a suitable agent container,
a new squad is created and sent over the Web, toward such location. In the case of
sites not offering agent containers, squads are migrated to a set of “friend” hosts in
order to balance the platform workload.
Once each indexing squad has been launched, the agents composing the squad
start to collaborate in order to perform semantic classification of required resources:
if the web site under classification uses a known site-wise search engine, the squads
directly interface such engine by means of a “deep search” agent, accessing informa-
tion stored in the site database, and thus providing classification of resources lying
in the so-called deep web. Otherwise normal indexing is performed.
The results of the semantic classification process are conceptual spectra which
are sent back to the agency manager, which, in turn, calls the Annotation Repository
service for persistently storing the extracted semantic information.
101
6 – The H-DOSE platform
At now, the indexing squads are able to classify every resource whose conceptual
domain overlaps the conceptual domain of the H-DOSE ontology. However some
improvements can be foreseen. For example, it is possible to design a new indexing
squad able to perform the so-called “focused indexing”. In focused indexing, a site
is firstly semantically analyzed, identifying to which part of the H-DOSE ontology
resources are relevant, then, a light-weight indexing squad, able to work only on the
identified ontology subset, is migrated to the same site. The advantage of doing
such operation is to reduce the computational load required to host machines, for
performing indexing process, and to reduce the amount of information that shall
transit into the network, between the agency manager and the indexing squads.
This advantage is balanced by a sensible increase in complexity for managing the
site semantic characterization and the focused squads, and to design effectively
cooperating agent colonies thus avoiding code replication on remote sites.
6.2.2 Search
The search scenario (Figures 6.19, 6.20) is organized as follows: an external appli-
cation requests a search on the platform knowledge base by interfacing with the
102
6.2 – Application scenarios
Search web service and by specifying a proper set of concepts. The Search service,
firstly contacts the Expander service for obtained an extended version of the query
spectrum. Then it starts communicating with the Annotation Repository service for
retrieving relevant annotations (i.e., document spectra). The Annotation Repository
takes the query spectrum given by the Search service and searches the database in
which annotations are stored to find relevant matches. If there are suitable resources,
i.e., resources annotated as relevant with respect to the concepts occurring in the
query conceptual spectrum, the Annotation Repository service provides the set of
retrieved annotations to the Search service. Otherwise, if no resources are available,
the Annotation Storage service throws an error that is caught by the Search service.
In the former case, i.e, when resources are available, the Search service checks
the similarity of retrieved spectra with the original user query (not the expanded
one). Then it ranks the results according to the just computed similarity values and
returns to the caller application a list of URIs. The list is as long as specified by
the application in the search request.
In the second case, where resources are not available, the Search service man-
ages the error by providing an empty list to the caller application. At the same
time, it notifies the maintenance and management sub-system, and in particular
the enrichment agents, that some concepts of the H-DOSE ontology are not covered
by annotations. This notification, in turn, triggers the autonomic features of the
platform that start a new indexing cycle, which possibly “repairs” the source of the
search error. Figure 6.16 shows how the H-DOSE autonomic features can achieve
this result.
103
6 – The H-DOSE platform
104
6.3 – Implementation issues
• the ontology wrapper uses the HP Jena library [23] for accessing RDF/S [4],
DAML and OWL [5] ontologies;
• the persistent storage module uses the PostgreSQL jdbc driver [24] for inter-
facing a PostgreSQL database server in which spectra are stored;
• the expander service is powered by the JGrapht API [28] that handles all the
issues related to the modeling of the H-DOSE ontology as a directed, weighted
graph;
• the Sun JAX-RPC [29] library is used by all web services to implement com-
munication methods while all the platform agents use the jade library for
implementing their behaviors and communication methods.
105
6 – The H-DOSE platform
3
Sourceforge is one of the most famous open source software repositories available on the web,
it can be reached at http://www.sourceforge.net. The H-DOSE web site, on sourceforge, can
be found at http://dose.sourceforge.net
106
Chapter 7
Case studies
This chapter presents the case studies that constituted the benchmark of
the HDOSE platform. Each case study is addressed separately starting
from a brief description of requirements and going through the integration
design process, the deployment of the HDOSE platform and the phase of
results gathering and analysis.
As part of the work presented in this thesis, the H-DOSE platform has been
used for adding semantic search functionalities to 4 already deployed web applica-
tions. The first is a legacy publication system written in PHP and owned by the
Passepartout service of the city of Turin. The second is a well known and widely
adopted e-Learning environment: Moodle [1]. The third is the case-based e-learning
environment developed in the context of the European project CABLE [3] and the
last is a transparent search engine built on top of the Muffin intelligent proxy [2].
The following sections will address the integration of semantic functionalities into
these applications by using H-DOSE as semantic back end.
107
7 – Case studies
starting from the content creation to the final publication on the web, is managed
and supported by a legacy system written in PHP.
In the context of a collaboration between the author’s research group and the
Passepartout service, this publication system has been integrated with semantic
functionalities by means of the H-DOSE platform. Currently the semantic function-
alities supported by the Passepartout web site include the manual classification of
published documents, the search by category and the semantic what’s related (re-
quirements 4, 5, 6 and 8). The integration has been deployed as follows: as first the
legacy system has been extended for supporting the exchange of SOAP messages
with the semantic platform. A previously-built module for SOAP communication
has been included in the system libraries and a new PHP module for managing the
communication with the semantic platform has been developed. This last module is
barely composed by a set of function wrappers, for the services offered by H-DOSE,
and adds, where needed, the intelligence required to perform more complex tasks.
Secondly, the template for publishing pages has been integrated to include the
semantic what’s related functionality: each published page has been integrated with
a link pointing to the semantically-related pages. When a user clicks that link, the
integrated system interacts with H-DOSE using the what’s related search function
that takes as input a resource URI and provides as result a set of ten other resources,
which are conceptually similar to the starting one.
As a third step, a semantically populated directory has been built in PHP; such
a directory is pretty similar to classical directory systems such as the yahoo’s one or
the dmoz.org open category. However, the resources belong to categories depending
on their conceptual description. As a consequence they could occur, at the same
time, in different category branches. Categorization is, in fact, directly related to the
ontology that defines the knowledge domain in which the system works: disability
in this specific case.
From the implementation point of view, the category tree for this test case has
been built off-line and then included into the Passepartout system. This choice is
only related to performance issues: building the same tree at runtime means that
every time the category is required, an ontology full navigation shall be performed.
Since the ontology is usually big, in this small scenario includes more than 80 con-
cepts and 20 different relationships, the time for page composition quickly grows too
high and does not satisfy usability criteria.
Eventually, the Passepartout publication system has been modified to support
the conceptual classification of pages. This last modification has a little or not im-
pact on the publication process. As the system originally included a site-wise search
engine based on manually specified keywords, the publication interface can be fully
retained whereas the keyword choice is now limited to the set of ontology concepts.
This simple solution allows to keep the interference of semantics introduction into
the usual work flow negligible, thus reducing the time required for the integration
108
7.1 – The Passepartout case study
7.1.1 Results
In order to extract relevant information about the effectiveness of the proposed
approach to semantics integration, a full test plan has been set-up that involve
three different test groups. Groups are devoted respectively to the evaluation of the
conceptual representation of the Passepartout domain and of the semantic functions
newly introduced into the publication system, to the evaluation of the performance
of the semantic modules in terms of standard measures such as precision, recall and
F-measure and, to the evaluation of the effort required to use the new conceptual
based interfaces, i.e. the usability evaluation.
The first tranche of tests has been called “ontological tests” and involves three
different sessions. In the first session, the ontology, developed in collaboration with
the Passepartout (about assistive services for disabled people; 80 concepts and 20
different semantic relationships), is tested for completeness, i.e., it is tested the
ability of the ontology to cover the Passepartout conceptual domain. In this test at
least six different redactors are required to conceptually classify a set of ten pages,
pseudo-randomly selected: a minimal amount of overlap between the pages tested
by different operators is granted in order to obtain suitable and comparable results.
109
7 – Case studies
For each page the classification is captured onto paper sheets and the redactors
opinions are collected too. The aim of the test is to analyze whether or not the
ontology is complete enough to cover the contents published by the Passepartout
service.
The second session aims at identifying inconsistencies into the ontology model
by performing focused classification tests, i.e. classification of resources belonging to
the same conceptual area at a different granularity. As in the first session, at least six
redactors perform the test and are required to provide both the pages classification
and opinions about the ability of the ontology to describe resource contents.
The second session is different from the first one by aims and scope, in fact,
while the first session aims at evaluating if, given a uniform distribution of contents,
the resulting classification is also uniformly distributed, the second session aims at
evaluating if, given a general topic and a uniform distribution of content granularity,
the resulting classification is uniformly distributed in deep, i.e. if the classification
results are uniformly distributed among the nodes descending from the given general
topic, in the ontology hierarchy.
The last session has the objective of identifying the ontology areas that are poorly
modeled, i.e. the conceptual areas in which there are multiple collisions of resource
classifications. For collision, it is intended the coexistence of several annotations be-
tween indexed resources and a given concept. If the number of collisions significantly
exceeds the ontology mean value, it is likely that the ontology has a modeling prob-
lem for the concept under examination. As an example because multiple, different
concepts have been modeled as a single one.
The second group of tests is aimed at evaluating the effectiveness of the semantic
search functionalities in terms of precision (p) and recall (r) and of their combination
in the F-Measure (F = 2·r·p
r+p
).
The sessions involved are three, as in the previous case, referred respectively to
the “what’s related” function for the first session and to the directory search for
the last two sessions. In the semantic “what’s related” test, the semantic system is
checked for both search and classification effectiveness: as first the degree of simi-
larity between the conceptual descriptions of the starting page and of the retrieved
pages is evaluated and secondly, the correctness of page annotation is evaluated.
In such a test, the operators involved are domain experts, the redactors of the
Passepartout site, therefore, although the evaluation still remains subjective, it is
supposed to be significant. The test is deployed as follows: each operator is pro-
vided with a set of 5 pages extracted pseudo-randomly from a test set composed of
20 different pages. He/She is required to browse the Passepartout site, to reach the
pages in the set and select, for each of them, the link “related pages” that activates
the semantic “what’s related” functionality. For each page the operator shall com-
pare the starting page and the retrieved pages. The similarity between pages shall
then be evaluated and reported on a proper test sheet. On the same test sheet, the
110
7.1 – The Passepartout case study
operator is also required to report his evaluation about annotation correctness, i.e.
to report if pages are correctly classified according to his knowledge of the domain.
The directory search test is further subdivided in two steps: in the first step
the ability of the system to retrieve all the relevant information and no other is
tested. This case represents a quite unusual operation scenario that is mainly useful
for testing the platform functionality: the user can only select a concept in the
site directory and the system is required to retrieve all the resources indexed as
relevant to that concept (in this test the expander module of the H-DOSE platform
is disabled).
Theoretically, assuming that the classification process has been well performed
all relevant resources should be retrieved and only them. In terms of precision and
recall measures, both values should be 100%. However, the classification process is
never perfect, therefore, the retrieved results will possess lower values of precision
and recall; these values will be as near to the maximum as the manual classification
has been well performed.
On the other side, assuming that the classification is “perfect”, since it is provided
by domain experts, this test allows to detect possible problems in the platform
operations. Practically, the test involves 10 operators , each required to perform
3 predefined searches (different for each operator). The search results are reported
by the test operators onto proper test sheets and the collected results are then
elaborated and organized.
In the second step, the directory search is tested in its full potential: the test
operators are allowed to select as many concepts in the directory as they like. Such
initial concept specification is then expanded by the H-DOSE platform using the
not-explicit knowledge modeled by the platform ontology and finally, resources are
compared to the resulting conceptual specification, ranked, and provided back to
the user.
The test involves ten different operators, each of them is asked to perform a given
search task, described by means of a goal statement, and to evaluate the relevance
on retrieved pages. Recall, precision and the F-measure are evaluated on the bases
of the evaluation results collected from all the operators.
Finally the last group of tests is designed to evaluate the additional effort required
to content editors (journalists) and to redactors, to include the semantic informations
required for the site operations. This kind of evaluation has been performed as a
set of interviews to the Passepartout site crew, which has been required to use the
integrated system for a month. It is important to notice that the usual publication
load is of about four new pages published per day and per redactor, that means a
total amount of around 500 operations in a month.
Unfortunately, as it is easy noticeable by the absence of tables and graphs, the
envisioned time frame for evaluations has been widely trespassed, and, at now, the
first test phase is not yet completed. The other phases, are still to be started.
111
7 – Case studies
These problems are not related to the platform deployment; in such sense some
preliminary results have been collected which seem to indicate a relatively simple
integration. Instead, the main problems are related to communication issues between
the research group and the Passepartout crew and to the coincidence of different
situations (such as the 2006 winter olimpic games, held in Turin) that prevented the
timely execution of the test program. Nevertheless results are slowly being collected
and, once complete, they will be submitted, as a paper, to an international journal
on semantic web technologies and applications.
112
7.3 – The CABLE case study
113
7 – Case studies
114
7.3 – The CABLE case study
a well defined ICT infrastructure, which reuses as much as possible already avail-
able, effective solutions, avoiding to reinvent the wheel. Three main components
emerge, composing the basic structure of the CABLE architecture: an e-Learning
environment, a repository of case studies and good practice examples, and a se-
mantic module able to leverage formal metadata associated to both learning objects
and case studies for composing and discovering associations between courses and
good practice examples. As the domain of application for the CABLE framework
requires growing experiences of users and teachers by allowing comparison and shar-
ing of similar case studies and solutions, the semantic module has the responsibility
of automatically establishing correspondences between new learning paths and ex-
isting case studies as well as the capability to correlate, at runtime, newly added
case studies to the already existing cases and learning modules. Figure 7.2 shows
the logical organization of the CABLE framework.
115
7 – Case studies
The two interaction paradigms can also be described in form of interaction dia-
grams. In the first case (Figure 7.3), a user, or a teacher, uses the VLE to participate
in some learning activity. At a time, in the e-learning process, case studies shall be
analyzed for better understanding how to tackle a given pedagogical scenario. As
the CABLE framework hosts many case studies provided by several entities, in a
Europe wide environment, the resources relevant to the learning module are ex-
tracted from the cases repository at runtime. Among other advantages, this allows
to automatically take into account newly added knowledge, in a transparent way.
Therefore, following the user request, the VLE contacts the semantic module for
relevant case studies, providing, at the same time, a conceptual description of the
learning object viewed by the user. The semantic module retrieves from the case
studies repository all descriptions of case studies that match, at least partially, the
VLE specification. Then, by applying ontology navigation techniques, it ranks the
retrieved results and provides back to the VLE a list of URLs of good practice ex-
amples. The VLE retrieves the case studies and presents them to the user, in a
convenient way.
In the second scenario, instead, the user is already accessing the case studies
116
7.3 – The CABLE case study
As can easily be noticed, in both cases there are no predefined matches be-
tween case studies, or between case studies and learning objects. Instead they are
discovered at runtime, by comparing the respective conceptual descriptions. As the
comparison is ontology-driven, not explicit associations can easily be discovered thus
leveraging the power of semantics for providing conceptually relevant results (which
117
7 – Case studies
are hopefully more relevant than the ones that would be extracted by applying
simple keyword matching techniques).
7.3.2 mH-DOSE
The overall adaptation process has taken no more than one day for adaptation
and one day for testing, thus demonstrating the easy integrability of the H-DOSE
platform in other web applications.
118
7.4 – The Shortbread case study
The system described in this section offers a semantic-based what’s related func-
tionality whose presence remains hidden to the user, at least until no relevant in-
formation can be found. In order to understand whether this goal can be reached
or not and how to effectively tackle the problems related to semantic retrieval of
related resources, the web navigation process has been analyzed. The interaction
scenario follows: the user surfs the web using a browser. For each page requested
by the user, the browser contacts a given server or a given proxy in order to fetch
the proper content. Then it interprets the received page and shows the content,
properly formatted, to the user (Figure 7.6).
As can easily be noticed, the interaction scenario involving a HTTP proxy (Fig-
ure 7.7)is very well suited for introducing transparent functionalities in the user
navigation process. The proxy, in fact, intercepts both user requests and server
responses, and can be exploited as access point for the semantic what’s related sys-
tem. For each incoming request, the page returned by the web server is analyzed,
semantically classified if necessary, and its semantic description is extracted. The
description is, in a sense, a snapshot of the user willings in terms of information
needs, at the semantic level. Such a snapshot is combined into a user model that,
under certain conditions can drive the retrieval of related pages, i.e., of pages se-
mantically similar to the user needs as modeled by the proxy. Every time a new
request from the user’s browser is received, the user model is updated and possibly
the related pages are retrieved. Then, a new page is automatically composed as the
sum of the requested page and of a list of suggested resources, and it is finally sent
back to the user browser.
The user can virtually be completely unaware of the search system located in
the proxy and can, in principle, think that the received pages are exactly the ones
he requested by typing a URL on his/her browser or by clicking a link on a web
page. In such a case the what’s related system is actually “transparent”.
119
7 – Case studies
120
7.4 – The Shortbread case study
resource description was already present in H-DOSE), the U.P. updates its inter-
nal user model by adding the knowledge related to the current page. This process
allows the system to incrementally extrapolate the user information needs, while
the model becomes more and more accurate at each new navigation step. Clearly,
the user model is accurate if and only if the user navigation is longer than a given
amount of pages. The U.P. has therefore some policies to prevent the system from
performing suggestions when the model is poorly accurate, avoiding to provide dis-
turbing information to the user. Once the model becomes accurate enough, the User
Profile can query the Semantic Information Retrieval system for resources semanti-
cally related to the user model. In other words, a snapshot of the user model (i.e.,
a conceptual spectrum) is passed to the S.I.R. which, in turn, requires H-DOSE to
provide the URIs of those resources whose conceptual spectrum is reasonably similar
to the received user model. These URIs are then forwarded to the proxy which mixes
the suggested information with the currently elaborated page, thus providing to the
user a complete page composed by the original content plus the newly generated
links to related pages.
Referring to Figure 7.8, the arrow labeled (4) exists, therefore, only when the
121
7 – Case studies
user model is accurate enough and there are resource descriptions available in the
DOSE platform. Otherwise only the arrow (5) exists, meaning that the proxy is
actually working as a simple relay.
The Semantic Information Retrieval system is basically a query generator for
the H-DOSE semantic platform. It has the ability to compose, from one side, re-
verse queries, i.e. queries that starting from a simple URL require the semantic
descriptions associated to the page identified by the URL. On the other side it has
the capability to take as input a conceptual spectrum, which encodes the model of
user needs, and to perform a search for those resources having similar conceptual
descriptions, using H-DOSE. Eventually, the S.I.R. has also the ability to trigger the
H-DOSE indexing whenever a reverse query fails, meaning that the page currently
transiting through the proxy has not yet been classified.
The XML-RPC to SOAP gateway is a protocol translator allowing simple con-
nections between the Semantic Information Retrieval system, which uses XML-RPC,
and the H-DOSE platform which is based on web services and uses SOAP as com-
munication protocol.
122
7.4 – The Shortbread case study
particular it allows many multiple and concurrent connections from remote clients
requiring different tasks (search or indexing). Therefore, whenever a user reputes
necessary to not run H-DOSE on its personal machine, Shortbread can be configured
to access a remote H-DOSE server and to use such server as the semantic backbone
of the system.
Currently there are still some open issues when working remotely with H-DOSE:
user authentication, user management and ontology switching functionalities are in
fact very preliminary and do not allow an efficient and secure management of user
rights and information. In other words, the current version of the H-DOSE plat-
form does not limit the user to see only the information he/she is entitled to, but
offers the same visibility to all the users connected to a given H-DOSE server. This
becomes a critical issue if the user navigation is about confidential topics/resources.
The author’s research group is actually concerned with such issues and, in the forth-
coming new version of the H-DOSE platform, these issues will be tackled allowing
for a more effective and secure operations. As a last remark it must be noticed that,
due to the modular nature of the approach and to the backward compatibility of
H-DOSE, the entire system will be able to work on future versions of the platform
without performing changes on the other modules.
7.4.3 Implementation
The system has been implemented in the Java programming language. This choice is
mainly related to the availability of a programmable proxy written in this language,
which is called Muffin [2], and to the fact that also the H-DOSE platform has been
developed in Java. In more detail, Muffin is a programmable Java proxy based on
the notion of filter. A filter can act either on an entire page, as a whole, or onto
single chunks (tokens) of the HTML code that represents the page itself. Page-level
filters can act both on the path between the user browser and the remote server,
and on the inverse path. Token filters, instead, can only work on the return path.
The proxy component of ShortBread has been implemented as a token filter in
Muffin. Basically, it sniffs the URLs of the pages requested by the user while they
are on their way back to the browser. Then, if some suggestions are available, it
modifies the HTML code of the transiting page by adding a “what’s related” section
either at the end or at the beginning of the page. The Shortbread core, constituted
by the User Profile, the Semantic Information Retrieval system and by the XML-
RPC to SOAP gateway has been developed as a set of separate Java classes which
are included in the muffin filter and thus executed as part of the muffin process.
H-DOSE, instead, is completely separated from Shortbread and it is based on
both web services and intelligent agents. It runs on a proper servlet container
(Apache Tomcat) and can be distributed among several hardware devices in order
to maximize its performances. In the preliminary experiment setup ShortBread has
123
7 – Case studies
124
Chapter 8
This chapter is about all the HDOSE related tools and methodologies de-
veloped during the platform design and implementation. They include a
new ontology visualization tool, a genetic algorithm for semantic anno-
tations refinement, etc.
During the H-DOSE design and development many side-issues have been ad-
dressed, being related with the main stream of research. In particular two tools
have been developed, respectively referred to the problem of annotation genera-
tion/storage, and to the problem of ontology creation/visualization.
The former research effort is an outcome of the firsts platform experimentations.
Doing automatic annotation of documents, the corresponding spectra resulted often
very dense, i.e. they usually involved many concepts in the ontology. This feature is
sometimes undesirable, since it retains a great amount of redundancy and may host
several annotation errors, which get easily undetected by being buried in many low
relevant annotations. In order to tackle such an issue an evolutionary refinement
method for annotations has therefore been developed, trying to eliminate as much
redundancy as possible, while maintaining, at the same time, the maximum anno-
tation relevance. Evolutionary techniques have been chosen since this is a classical
optimization process where a trade-off between annotation conciseness and relevance
must be reached, and these techniques are known to be well suited for this kind of
problems.
The latter tool, instead, is related to the H-DOSE experimentation in the Passep-
artout and in the CABLE test cases. One of the preponderant aspects that comes
out from these case studies is that often, even before understanding the values
added by semantics integration, non technical people do not understand the formal
knowledge model. They are in fact expert in the domain, but not in formalization
techniques. As a consequence, many times modeling errors remain undetected since
RDF or OWL are too cumbersome to be understood by non-experts.
125
8 – H-DOSE related tools and utilities
This situation has promoted, in the author’s research group, several insightful
discussion on how make more clear the formalization commitments in ontologies.
Such discussions ended-up in a new ontology visualization tool able to represent
conceptual models as enhanced trees in a 3D space. The tool has been experimented
in the CABLE final meeting as well as at the SWAP2005 conference and proved to
be quite useful. In almost all cases comments have been about the visualized models
and their correctness rather than on the tool interface and navigation paradigms.
This is clearly not a demonstration of effectiveness, however it gives some good
feedback on the tool ability to ease the process of ontology design and revision,
especially when such process is performed by not technicians.
126
8.1 – Genetic refinement of semantic annotations
Figure 8.1. The grammar tree for “Peter goes to Rome”. (NP = noun phrase,
N = noun, VP = verbal phrase, V = verb, PP = prepositional phrase, PREP =
preposition)
is related, i.e. to which such words have been associated by human experts and/or
automated rule extraction.
127
8 – H-DOSE related tools and utilities
set producing a set of semantic annotations (this is, for example, the basic principle
of SVM classification in forthcoming version of H-DOSE).
Without regarding the method used to extract semantic descriptors from syn-
tactical information, the generated set of semantic annotations has at least three
peculiar characteristics: it possesses an accuracy degree value, which describes how
well resource semantics is captured by the annotations, it is composed of many en-
tities, in a number that it is likely to be not manageable by human experts and it
usually models redundant semantic information, at different granularity levels.
Useful semantic information should be classified as humans do, trying to pre-
serve typical features of human perception of reality and being at the same time
understandable by machines. Typical features of annotations created by experts are
conciseness, expressiveness and focus. Human beings cannot handle huge amounts
of information, therefore they evolved mental processes able to extract the key el-
ements of an external object or of an external event, building concise models able
to guide decision processes when a given situation happens. At the same time since
real world can offer different similar situation manageable in the same way, a concep-
tual model should be general and expressive enough to allow reuse. Finally, mental
models are usually focused to a precise sequence of external stimuli, defining a sort
of domain specific knowledge similar to the philosophical formalization of ontology.
Effective automatic annotation should therefore mimic those characteristics in
order to be useful in the Semantic Web, however available technologies are still
not able to provide the required level of accuracy, conciseness and expressiveness of
semantic descriptors. Refinement is therefore needed in order to provide effective
semantic descriptors and from a user perspective, effective applications. Annotation
refinement requires knowledge about domain concepts, about relationships between
concepts and should take into account the granularity level of available descriptors
since they can be referred to entire resources as well as to single paragraphs of a
document. Such process can be modeled as an optimization process with several con-
straints, some of which are not explicit, the relation between semantic annotations
referred to the same resource as an example.
Genetic algorithms can be applied to face such problems providing at the same
time an highly dynamic system able to react quickly to changes into the initial
annotation base, and to effectively face changes in user behavior.
128
8.1 – Genetic refinement of semantic annotations
There are several measures that allow to assess annotation quality, and to per-
form a selection of appropriate descriptors. For each semantic annotation a relevance
weight is usually provided quantifying how strongly a web resource is linked to a
given ontology concept. A first filtering phase could therefore select the most rele-
vant descriptors from the whole extracted set.
Unfortunately this simple selection method is not able to discriminate between
relevant and irrelevant descriptors, since it relies on inaccurate data. The relevance
value, in fact, is provided by the automatic annotation process and may be affected
by errors on classification, association rules extraction, etc. Even in the simplest
scenario, in which a set of words is associated to each ontology concept, the relevance
value is strongly influenced by such set, and a lack in the word definition process
can compromise the real semantic relevance of produced annotation.
On the other hand manually checking each generated annotation is infeasible,
even for small sets, and it would be undesirable since the entire annotation process
must be unsupervised.
The relevance value associated to each annotation should therefore handled with
caution, as an indicative value, and should be integrated by other considerations,
taking into account domain knowledge specification (ontology) and granularity of
annotations. Let’s clarify the scenario with an example, supposing to have a set
of seven annotations to refine. All annotations are referred to the same web page,
about dog care. The web page is composed of three paragraphs respectively about
dog nutrition, fitness and psychology. Extracted annotations are shown in Figure
8.3, while the ontology branch to which they point, is shown in Figure 8.4.
In this simple case, it is possible to figure out how an automatic process could
reach performances similar to human expert classification. As first, the automatic
process should sort annotations by raw relevance, in the example, it would pro-
duce the following sets: dog care, dog, dog nutrition, nutrition and care for the first
resource, and dog nutrition, dog and care for the second one. Secondly, it should
leverage ontological relationships between concepts, to set up selection policies based
on semantic links between topics and annotations, even if those links are not ex-
plicitly modeled (inference). As an example, it should be able to understand that
dog nutrition and dog care are sub classes of the more general concepts nutrition
and care, and that both topics can be applied to dogs. Finally, the refinement
process should analyze the granularity level of annotated resources, with respect to
ontology hierarchy, and should change the selection order in a way that more general
resources would be annotated by more general topics.
In the proposed example, resulting annotation sets would be ordered as follows:
dog, dog care, nutrition, care, dog nutrition for the first resource and dog nutrition,
dog and care for the second resource. At the end a filtering process selects most
relevant annotations providing results that, compared with the ones provided by
human experts, possess a satisfying level of conciseness and expressiveness.
129
8 – H-DOSE related tools and utilities
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="http://www.example.org/ontology#">
<!-- whole resource -->
<rdf:Description rdf:nodeID="A0">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog" />
<ex:weight>1.0</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A1">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:care" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A2">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog care" />
<ex:weight>1.0</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A3">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:nutrition" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A4">
<ex:uri rdf:resource="http://www.dogs.org/care"/>
<ex:topic rdf:resource="ex:dog nutrition" />
<ex:weight>0.8</ex:weight>
</rdf:Description>
<!-- first paragraph --->
<rdf:Description rdf:nodeID="A5">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:dog" />
<ex:weight>0.5</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A6">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:care" />
<ex:weight>0.2</ex:weight>
</rdf:Description>
<rdf:Description rdf:nodeID="A7">
<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/>
<ex:topic rdf:resource="ex:dog nutrition" />
<ex:weight>3.0</ex:weight>
</rdf:Description>
</rdf:RDF>
130
8.1 – Genetic refinement of semantic annotations
131
8 – H-DOSE related tools and utilities
are also able to track global optimum changes, i.e. changes into the fitness land-
scape, storing time dependent information into the population state [34], and thus
allowing the implementation of effective and reactive annotation refinement systems.
In the evolutionary refinement of semantic annotations, the individual size in
term of annotated topics fixes the amount of relevant annotations allowed for each
indexed resource, and should be tuned to achieve the best compromise between
conciseness and expressiveness, limiting as much as possible information losses.
Mutation introduce innovations into the population, crossover changes the con-
text of already available, useful information, and selection directs the search toward
better regions of the search space. Acting together, mutation and recombination
explore the annotation space while selection exploits the information represented
within the population. The balance between exploration and exploitation or be-
tween the creation of diversity and its reduction by focusing on the better fitness
individuals allows to achieve reasonable performance of the evolutionary algorithm
(EA).
Design
The evolutionary algorithm goal is to evolve a population of semantic annotation
sets referred to specific web resources using genetic operators like mutation and
crossover. Individuals are composed of a fixed number of genes, each identifying
a single link between a topic and a web resource (annotation) together with the
corresponding relevance value (Figure 8.5).
The proposed solution uses a steady state evolutionary paradigm E(µ + λ): at
each generation λ individuals are selected for undergoing genetic modifications, the
resulting µ + λ population is evaluated and the best µ individuals are transferred
to the next generation. The selection operator is a tournament selection with a
tournament size of 2, equivalent to a roulette wheel on the linearized fitness.
The tournament size fixes the convergence rate of the algorithm, its value is
intentionally kept low in order to avoid premature convergence on sub optimal so-
lutions. Evolutionary annotation refinement is organized into two main phases:
132
8.1 – Genetic refinement of semantic annotations
firstly, annotations available for each indexed resource are collapsed into a small set,
according to relevance weight, topic occurrence into the annotation base, and struc-
tural hierarchy. Secondly, the annotation base is updated with information from
the extracted set, taking into account the effect of the new topic coverage distribu-
tion. At the end of this last phase the process restarts, thus providing a continuous
refinement cycle (Figure 8.6) able to face changes into the annotation base.
In the first phase, for each indexed resource the available set of annotations is
extracted. Each annotation together with its relevance value become a gene stored
into a sort of resource-specific gene repository G(R). From this repository individ-
uals are randomly created taking as DNA a manually tuned number of genes, that
fixes the conciseness degree of refined semantic descriptors. When the initial popula-
tion has been generated, the evolutionary process evolves a population of individuals
whose fitness is given by a composition of the relevance weight contribution and of
the annotation base topic coverage contribution, with respect to annotation gran-
ularity level. The annotation base has in fact a dual nature: on one side it stores
annotations pointing at web resources, thus modeling indexing relevance (the value
associated to each annotation). On the other side, it models how the conceptual
specification covers the real knowledge domain. Basically there is a certain amount
of annotations pointing to ontology concepts; this value measures how well concep-
tual descriptions are fitted onto the domain model: ontology concepts pointed by
many semantic annotations and located at deep levels into the concept hierarchy
usually identify redundant information, or information for which syntax to semantics
conversion has provided poor results.
133
8 – H-DOSE related tools and utilities
1
d(gi ,gj ) ∝
dist(gi .topic(), gj .topic())
Although the fitness function is morphological rather then time dependent, i.e. the
fitness value depends explicitly from factors related to individual physical charac-
teristics, there is a non explicit constraint that changes the overall behavior of the
algorithm into a dynamic one.
134
8.1 – Genetic refinement of semantic annotations
At each refinement cycle the annotation base is, in fact, updated associating
to the whole annotation set, the topic coverage distribution reached in the last
optimization cycle. In the subsequent refinement cycle, individuals having the same
DNA of the ones in the current cycle, will possess a different fitness value according
to results of the last optimization. Moreover, indexing cycles may happen during
the refinement phase adding variability into the fitness landscape that the algorithm
walks to reach the optimum. In such sense, the overall evolutionary refinement cycle
assumes a highly dynamic behavior.
Currently only one mutation operator has been defined and one crossover oper-
ator. The mutation operator simply extracts a new gene from the gene repository,
associated to a resource, and substitutes a randomly chosen gene in the individual
DNA.
Recombination is performed uniformly, by selecting two individuals and alter-
natively cloning individual genes. In other words, since individuals have the same
DNA size, each DNA element is selected with a uniform probability distribution
between individuals participating in crossover. Such operator can provide invalid
individuals, i.e. individuals having one or more duplicated genes. Those individuals
are invalid since resources referred more than one time to an ontology concept are
not as meaningful as needed. The author designed a simple policy to avoid such
eventuality: individuals having duplicated genes are automatically dropped out and
cannot take part in the new population. Table 8.1 summarizes the adopted genetic
operators.
A stop criterion has also been defined for the evolutionary refinement of each
resource annotation, based on population uniformness: whenever the diversity in
individuals falls under a given threshold, or, as well, whenever the best individual
fitness stops to increase, the cycle is interrupted and the best set of semantic de-
scriptors (individual) is selected as local optimum. The iterated nature of refinement
assures that the global optimum could be reached, taking into account the fact that
there is no absolute recipe to define the optimum from a user point of view, which
is the ultimate goal of refinement.
Implementation
The evolutionary refinement strategy has been deployed as a Java class defined by
an high level interface, for abstraction from implementation details, as well as done
135
8 – H-DOSE related tools and utilities
Parameter Value
Penalty Ph 1
Dependancy d(gi,gj ) 1/dist(gi.topic(), gj .topic())
Topic coverage factor T (g) #Annotations
Annotation granularity FR (R) #( / ∈ resource URI)
Table 8.2. Fitness-specific parameters.
Results
Several testbeds have been defined for the evolutionary refinement of semantic anno-
tations proposed in this section. As first the evolutionary algorithm parameters have
been set up for performing fair comparisons on different annotation bases (Table 8.3).
Experiments used the last available version of the H-DOSE platform which allows
semantic indexing of web resources in many languages. The underlying ontology
has been developed by the H-DOSE authors in collaboration with the Passepartout
service of the city of Turin.
The ontology counts nearly 450 concepts organized into 4 main areas, for each
ontology concept a definition and a set of lexical entities has been specified [35]
for a total amount of over 2500 words, allowing shallow NLP-based text-to-concept
mapping.
The first experiment involved information from the Passepartout web site: 50
pages were indexed using the standard methods provided by H-DOSE (simple bag of
words) and corresponding annotations were stored into the Annotation Repository.
The newly created module for evolutionary refinement ran on such data produced
as output a runtime version of the Annotation Repository accessible from the search
engine service. Initial evaluation aimed at assessing feasibility of the approach and
validity of results in a simple static scenario. Annotations stored into the repository
136
8.1 – Genetic refinement of semantic annotations
Parameters Values
Diversity threshold 0
Fitness increase threshold for the best individual 0
Number of individuals (µ) 50
Number of new individuals (λ) 20
Probability of mutation v.s. crossover 0.1
Individual size 3
Table 8.3. Evolutionary strategy parameters.
were therefore refined a fixed number of times (10) without allowing modifications
on the original semantic descriptors set.
A total amount of 276 annotations corresponding to 28 relevant resources was
obtained as indexing result, the 22 remaining documents were judged not relevant
by H-DOSE. After evolutionary refinement, the runtime version of the Annotation
Repository, contained about 97 annotations referred to a total amount of 21 re-
sources. The difference on the amount of annotated resources was caused by fixed
size individuals which were not able to model resources for which less then 3 (size
of individuals) annotations were present. Beside the fact that resources having a
very low number of annotations are likely to be too specific or, even, wrongly anno-
tated, such problem can be fixed by propagating into the refined annotation base,
descriptors referred to such resources.
Semantic annotations stored into both repositories (the original and the refined
one) were evaluated by human experts which specified a relevance value between
0 and 100 for each annotation. The relevance mean was evaluated both at the
single-annotation level and at the resource level, Table 8.4 resumes the obtained
results.
As could easily been noticed the two mean relevance values seems to be in con-
trast to each other, but this is not the case. In fact, relevance at the single-annotation
level increases since many annotations, with low relevance values, were judged not
relevant by experts, and purged by the evolutionary refinement thus increasing the
relevance mean figure. On the other side, at the resource level, each annotation
contributes to the total relevance value, therefore it is straightforward that, without
regarding other parameters, as higher is the amount of annotations for each resource,
as higher is the corresponding relevance value (Figure 8.7).
In the better scenario, all annotations purged by the evolutionary system were
judged completely non-relevant by human experts, therefore relevance mean values
were the same for the original and the refined annotation base.
Overall performance was interesting since the total expressiveness reduction of
the annotation base, in terms of relevance at the resource level, was around 36%
137
8 – H-DOSE related tools and utilities
while the repository size reduction was of about 65%. The expressive density de ,
defined as the relevance over the number of stored annotations, was respectively of
0,035 for the original annotation base and of 0,063 for the refined one.
mean(relevance)
de =
#Annotations
The second experiment involved a real time scenario, in which the H-DOSE
platform works on-line and indexing can be triggered by automatic processes, ex-
ternal applications and search failures, at whatever time in the refinement process.
Since the evolutionary refinement works continuously, taking a sort of snapshot of
the Annotation Repository content at each refinement cycle, changes in the original
repository propagates to the refined one with a maximum delay of one cycle.
The experiment involved the indexing of on-line web resources from the “asphi”
web site [36]. Asphi is a private foundation that promotes the adoption of informatics
aids for people with disabilities, by financing several research projects.
A snapshot of the two repositories was taken and the same evaluation done on
138
8.2 – OntoSphere
the static scenario was performed. Table 8.5 details corresponding results, while the
chart in figure 8.8 depicts the corresponding relevance, at the resource level, for each
indexed fragment.
On-line work and user related changes into the original Annotation Repository do
not affect too much the refiner behavior, allowing to reach satisfying performances
even on dynamic fitness landscape. As could be easily seen looking at the experi-
mental data, the relevance loss can be estimated around 50% while the repository
size reduction is of about 75%, with a correspondent expressiveness density of 0.060
for the refined annotation repository and of 0.031 for the original one.
8.2 OntoSphere
In the last years the Semantic Web has been constantly evolving from a vision of few
people to a tangible presence on the Web, with many tools, portals, ontologies,etc.
139
8 – H-DOSE related tools and utilities
Such a great evolution involved many researchers, from different countries, and
has been primarily focused on technologies. At now a Web developer can start to
seriously consider the opportunity to provide semantically tagged content as the
needed tools and standards are available. However, the current web panorama
shows a very little adoption of semantics. The motivations for such a low adoption
can be various and related to very different aspects: technology immaturity, failing
dissemination, user and developer resilience to changes, etc. In the sea of possible
failures and shortcomings, interfaces have a relevant role often discriminating good
solutions from bad ones. This is particularly true for tools related with knowledge
modeling and visualization, where the involved information can be quite complex
and involve multidimensional issues.
Several attempts aim at providing effective interfaces for knowledge modeling,
i.e. for ontology creation and visualization. Protégé [37] and OntoEdit [38] for ex-
ample are complete IDEs (Integrated Development Environments) that address in
a single application all the aspects related to ontology creation, checking and visu-
alization (through proper plug-ins). Such tools, although adopting rather different
paradigms for editing and inspecting ontologies, have in common a bi-dimensional
approach to ontology visualization. The bi-dimensional approach can be variously
efficient and there are actually good solutions available on the web: GraphViz [39],
Jambalaya [40] and OntoViz [41], just for naming a few. Nevertheless, mapping the
many dimensions involved by an ontology like the concepts hierarchy, the seman-
tic relationships, the instances and the possible axioms defining a given knowledge
domain, on only two dimensions can sometimes be too restrictive.
The author, in collaboration with some colleagues2 , proposes OntoSphere3 , a
new tool for inspecting and, in a near future, for editing ontologies using a more
than 3-dimensional space. The proposed approach visualizes the mere topological
information on a 3D view-port, thus leveraging one more dimension with respect to
the current solutions. This allows, at least, to better organize the visual occupation
of represented data. Being the 3-dimensional view quite natural for humans, espe-
cially for what concerns navigation, the proposed approach can be more effective in
browsing as it involves “manipulation-level” operations such as zooming, rotating,
and translating objects.
In addition, many more dimensions are introduced to convey information on
the visualized knowledge model (meta-information). The extension of the subtrees
lying under the currently viewed concepts is, for example, visually rendered by
increasing the size of the visual cues adopted for them. The same approach is
applied to colors, which are used to add insights on the representation: blue spheres,
for example, indicate that the corresponding concepts are terminal nodes in the
2
Alessio Bosca which developed the 3D visualization panel and Paolo Pellegrino
3
Firstly presented at SWAP2005, Semantic Web Applications and Perspectives, Trento, Italy
140
8.2 – OntoSphere
141
8 – H-DOSE related tools and utilities
RootFocus scene
This perspective presents a big “Earth-like” sphere bearing on its surface a collection
of concepts represented as small spheres (Figure 8.9). The scene does not visualize
any taxonomic information and only shows direct “semantic” relations between el-
ements of the scene, usually a not completely connected graph. Atomic nodes, the
ones without any subclass, are smaller and depicted in blue while the others are
colored in white and their size is proportional to the number of elements contained
in their own sub-tree.
142
8.2 – OntoSphere
This view is particularly intended for representing the ontology primitives, i.e.,
the root concepts, but can also be used, during the navigation of the ontology, in
order to visualize direct children of a given node; a pretty useful option in case
of heavily sub-classed concepts. Topmost concepts within the ontology and the
relations between them define the conceptual boundaries of the domain and provide
a very good hint to the question: “what is the ontology about?”
TreeFocus scene
The scene shows the sub-tree originating from a concept; it displays the hierarchi-
cal structure as well as semantic relations between classes. Since usage evidence
proves that too many elements on the screen, at the same time, hinder user at-
tention, the scene completely presents only three fully-expanded levels at a time
and, as user browses the tree, the system automatically performs expansion and
collapse operations in order to maintain a reasonable scene complexity. The reader
143
8 – H-DOSE related tools and utilities
may note in Figure 8.10, on the left, how focusing the attention on the concept
“ente pubblico locale”, on the left in the figure, causes (with a simple mouse click)
the vanishing of the uninterested branches, then collapsed in their respective parents
(Figure 8.10, on the right). Collapsed elements are colored in white and their size is
proportional to the number of elements present in their sub-tree; instead concepts
located at the same depth level within the tree have the same color in order to eas-
ily spot groups of siblings. Hierarchical relationships within the scene are displayed
with a neutral color (gray) and without label, whereas other semantic relations in-
volving two concepts already in the scene are displayed in red, accompanied by the
name of the relation (as in the “RootFocus” perspective). If an element of the tree
is related to a node that is not present on the scene a small sphere is added for that
node in the proximity of its visual clue, so terminating the end of the arrow: in such
cases, incoming relations are represented with a green arrow, while outgoing links
with a red one.
ConceptFocus scene
This perspective depicts all the available information about a single concept, at the
highest possible level of detail; it reports the concept’s children and parent(s), its
ancestor root(s) and its semantic relations, both the ones directly declared for the
given concept and the ones inherited from its ancestors.
Semantic relations are drawn as arrows terminating in a small sphere: red if the
relation is outgoing and green otherwise (Figure 8.11). Direct relations are drawn
144
8.2 – OntoSphere
close to the concept and with an opaque color, while inherited ones are located a
bit farther from the center and depicted with a fairly transparent color.
This scene is pretty useful during consistency checking operations because it ease
the spotting of inconsistent relations whenever a concept inherits from an ancestor
a property that “conceptually” contrasts with other features of its own.
145
8 – H-DOSE related tools and utilities
Checking the ontology for “conceptual consistency” is rather different from the
formal consistency checking done by logic reasoners. What has to be checked is
not the ontology consistence for reasoning and inference, but whether a user can
detect domain-related inconsistencies created during the ontology design process.
For example, a concept may inherit some relationships that are not appropriate
for it, either because of a wrong parent-child relation or because of a previously
undetected error in the domain modeling: in this case the ontology is formally
consistent but not conceptually.
The ontologies involved in this test were the same used in the previous one, as
well as the users. Some interesting aspects came out from the experimentation: the
146
8.2 – OntoSphere
When the proposed application is used for ontology development, the support
provided for detecting conceptual inconsistencies is much more evident. The adop-
tion of OntoSphere for inspecting the work in progress allows, in fact, to easily detect
modeling errors. In particular, the mostly recognized errors are about relationship
propagation along the ontology hierarchy and wrong definitions of parent-child (isA)
relationships.
Although it is quite difficult to fill-up a table for showing how, and to what ex-
tent, the proposed application supports the process of ontology creation, interviews
with users evidence that many times the experimenters are able to quickly spot the
modeling errors. Their opinion indicates the intuitive visualization and the capa-
bility to visually represent inherited and inferred relationships as the main factors
for achieving success in their own modeling process. This last experiment actually
lies between the functional tests and real world test cases. However, to provide a
more grounded experimentation (please note that the results here presented are still
very preliminary) the authors performed a real world test in the occasion of the fi-
nal meeting of CABLE, a European MINERVA project on “CAse Based e-Learning
for Educators”. In that meeting, a demo of the OntoSphere application has been
presented to visualize the ontology developed in the context of the CABLE project.
The exciting result is that, rather than complaining about the complexity of the
147
8 – H-DOSE related tools and utilities
provided interface, or about the appearance or the controls for browsing the ontol-
ogy, the first observation was: “No! That relation can not subsist between those two
concepts!”. What surprisingly happened is that the application was able to highlight
the inherited relations so that the errors were spotted in few minutes of ontology
browsing. This is clearly a not scientific result since experiments are to be conducted
in a controlled environment, shall have a clear objective and must be carried on by
a significant group of users. And the aim of this paragraph is not to sustain the
thesis that assumes such reaction as a good result. However, the user reactions in
the CABLE meeting are encouraging signals that the still preliminary OntoSphere
application can be a valuable instrument in ontology design and development.
As last experiment, a simple scalability test was performed: the goal was to un-
derstand if OntoSphere is able to load and visualize ontologies having great amounts
of concepts and relationships. The entire SUMO ontology was therefore loaded and
browsed and the loading process took around 3.5 seconds, while navigation was
performed in real-time. SUMO is the Suggested Upper Merged Ontology and it is
currently released under a GPL license. It counts about 20’000 concepts related by
over 60’000 axioms.
There are still some issues to be fixed when browsing really huge ontologies: the
visualized concepts tend to clash if the number of concepts visualized at the same
time is high. Also the labels tend to overlap making the visualization more difficult
to manage (as in many other viewers). Moreover, since a human cannot take into
account more than a reasonable number of objects at a time, huge graphs shall be
collapsed and different ontology navigation patterns and interfaces shall be provided.
148
Chapter 9
149
9 – Semantics beyond the Web
9.1 Architecture
In this section an overview of the Domotic House Gateway (DHG) developed by the
author and his colleagues is presented, followed by a more detailed description of
involved components. The proposed system has been designed with the intent of
supporting intelligent information exchange among heterogeneous domotic systems
which would not otherwise natively cooperate in the same household environment.
The concept of event is used to exchange messages between a device and the
DHG. As will be seen, these low-level events are converted to logical events internally
to the DHG so as to clearly separate the actual physical issues from the semantics
that lies beyond the devices and their role in the house (functional semantics). In
this way, it is also possible to abstract a coherent and human-comprehensible view
of the household environment.
The general architecture of the DHG is deployed as shown in Figure 9.1, and can
be divided into three main layers: the topmost layer involves all the devices that
can be connected to the DHG and their software drivers, which adapt the device
150
9.1 – Architecture
technologies to the gateway requirements. The central layer is the main responsible
for routing low-level event messages to and from the various devices and the DHG,
and also includes the generation of special system events to guarantee platform
stability.
The last layer is the actual core of intelligence of the system. It is devoted to
event management at logical (or semantic) level and to this purpose it includes a
rule-based engine which can be dynamically adjusted either by the system itself or
manually through external interfaces. The rules define the expected reactions to
incoming events, which can be either generated by the house, let’s say the “door is
opening” as an example, or by an application: “open the door”.
Complex scenarios may involve the rule engine: for instance, a rule might gen-
erate a “switch on the light in room x” event if two events occur: the room x is
dark and a sensor revealed the presence of somebody in that room. Additionally, an
automatic rule learning algorithm is under study: the logged events are processed
to infer common event patterns that are very likely to be repeated in the future.
New rules can be firstly proposed to the households and then possibly accepted and
added to the existing rule set. Some interfaces permit to check, modify, add or
delete each of the rules. Different rule sets may eventually be used.
In the next sub-sections, a detailed description is presented for each block of
Figure 9.1.
151
9 – Semantics beyond the Web
Basically, at low level, each device is associated to the appropriate driver, while
at high (or logical) level the device is assigned a unique identifier and type, as well as
a logical placement in a location of the house (e.g., the living room). For instance,
a light might be registered as the 11-th device within the driver that controls its
domotic system (and which is presumably connected to its main control device),
while its type is a light, supporting on and off events, and the location is the kitchen
152
9.1 – Architecture
<house>
<roomname=’’kitchen’’>
<devicename=’’light’’ devID=’’11’’
devType=’’Light’’ driver=’’BTicino’’ />
</room>
<room name=’’living room’’>
<devicename=’’set tob box’’
devID=’’192.168.1.33’’
devType=’’STB’’ driver=’’STB’’ />
</room>
</house>
153
9 – Semantics beyond the Web
It should be noted that the registration of a new device type should imply the
extension of an existent type (class or concept), whose events are inherited, and
may also involve the registration of new events (Figure 9.4).
A number of predefined types of devices are initially provided, as well as a list
of supported events for each device type. This information will become part of the
house knowledge base (or House Model), as should facilitate the design of drivers
and devices especially in terms of interoperability.
154
9.1 – Architecture
155
9 – Semantics beyond the Web
The DIS permits to generate new (logical) events at run-time basing either on events
coming from the house through the Event Handler or on predefined or inferred
“rules” that may, for instance, act at a specified time. The current implementation
of the DIS is based on a run-time engine that processes rules according to the events
received by the Event Handler.
When certain conditions are met, new events may be generated and sent back
to the Event Handler that routes them to the correct devices through the Commu-
nication Layer. The rules can be preloaded or added either manually via external
interfaces, such as a simple console, or through the Rule Miner, which examines
the event log (see the Event Logger) to infer new rules. At this moment, some rule
control mechanisms are being studied to prevent annoying or dangerous situations.
At the very least, it is possible to save the rules and the status of the DIS at any
time for future reloading.
The event logger receives events from the Event Handler and saves them in a file in
order to facilitate the identification of possible erratic behaviors. Both logical events
and system events can be logged, and through external interfaces it is possible to
specify filters in order to limit the amount of stored data. The Event Logger is
also used by the Rule Miner to (semi-automatically) generate rules for the Domotic
Intelligence System.
The Rule Miner tries to infer new rules for the Domotic Intelligence System by
reading and analyzing the event log. The idea is to identify common event patterns
so as to forecast and automatically generate events according to such patterns.
However, this is still work in progress, as it is also necessary to keep into account
dangerous situation as well as possible conflicting actions, which should actually
prevent the execution of certain inferred rules. To this purpose the Rule Miner
might exploit the House Model and other sources of information as knowledge base
to achieve a more consistent understanding of what the event represent, and therefore
of what is happening in the house.
156
9.2 – Testing environment
157
9 – Semantics beyond the Web
by a common PC running Linux. The second and last one is the MServ open source
program [47], again running under Linux, capable of remotely choosing and playing
music files.
158
9.3 – Preliminary Results
159
9 – Semantics beyond the Web
The simple rules inserted into the system were also appropriately executed. So,
for instance, the light was automatically switched on after the windows in the same
room where all shuttered, the shutters being controllable domotic devices. However,
particular attention is necessary when inserting rules into the system. In fact, re-
ferring to the previous example, if the shutters remain closed one must still be able
to switch off the light, without causing the system to switch it on shortly there-
after. These types of issues, as well as cases regarding more critical devices such as
the oven which may contain improper items, are to be considered with great care,
especially when designing an automatic way to infer new rules.
160
Chapter 10
Related Works
This chapter presents the related works in the field of both Semantic Web
and Web Intelligence, with a particular focus on semantic platforms and
semantics integration on the Web.
On May 2001, Tim Berners-Lee wrote the Semantic Web manifesto on “Scientific
American”, in that article he proposed a new vision of the web: “The Semantic
Web (SW) is not a separate Web but an extension of the current one, in which
information is given well-defined meaning, better enabling computers and people to
work in cooperation.”. In his view the next generation of the web will be strongly
based on semantics in order to allow effective communication between humans and
machines, leading to a powerful collaboration between them in accomplishing tasks.
As he said, the Semantic Web will bring structure to the meaningful content of
Web pages, creating an environment where software agents roaming from page to
page can readily carry out sophisticated tasks for users. Such an agent coming to a
clinic’s Web page will know not just that the page has keywords such as “treatment,
medicine, physical, therapy” (as might be encoded today) but also that Dr. Hartman
works at that clinic on Mondays, Wednesdays and Fridays [7].
The ideas formalized by Berners-Lee came out after years of research on Arti-
ficial Intelligence and from the relatively recent research on the Web, and grouped
many researchers from all the world promoting further explorations toward the next
generation of the web. During the past five years the Semantic Web community
has been one of the most active research community in the world, producing many
diverse technologies and applications trying to put in reality the SW vision. Many
milestones have been reached in this endless and exciting process, in particular a
large enough agreement on semantics integration on the web has been reached.
Actually there is no unique recipe to insert “meaning descriptors” on the existing
web, but it is quite clear what are the main requirements to satisfy for the devel-
opment of scalable and useful semantic applications. Researchers found that for an
161
10 – Related Works
effective inclusion of semantics on the current web the meaning information should
be definable by people or machines potentially different from content creators, and
the common agreed way to fulfill such requirement is the definition of entities called
“semantic annotations” pointing at described resources. Consequently, several works
proposed techniques to provide semantic information through independent annota-
tions, offering services for annotation editing, storage and retrieval [49].
As those systems reached a significant diffusion in the academic world some
problems were noticed, in particular it was clear that the task of manually annotating
the whole existing web was not feasible, thus the subsequent evolution in research
involved the design of automatic annotation platforms.
162
10.2 – Multilingual issues
163
10 – Related Works
The words associated to each ontology concept, for each supported language, play
a crucial role in the process of automatic or semi-automatic annotation of web
resources. Simple “bag of words” techniques can use these words as a mean to au-
tomatically annotate a given document. The occurrence of a term in a document,
in fact, indicates that the document is probably related to the associated concept.
Other types of mapping rely on Natural Language Processing (NLP) [63] [53] and
adaptive information extraction [64]. For all of them, a preliminary phase for learn-
ing and training is needed; this phase is dependent from the scenario in which the
mapping is applied, and depends on the ontology and on the corpus of documents
to be classified (i.e. on significant terms).
164
10.3 – Semantic search and retrieval
165
10 – Related Works
by the early work on semantic search engines by Heflin and Hendler [71]. In their
approach, the user is allowed to perform query by examples, using data from ontolo-
gies. Basically the user is presented with a set of ontologies about different domains.
He should select the ontology related to the domain for which he wants to perform
a search, and a tree navigation interface is then provided in order to select concepts
and instances similar to the desired ones. The user can therefore select resources
he is interested in and a search template is automatically built using that selection.
Finally the search is issued to the query subsystem providing relevant results.
Figure 10.1. TreeViews: indented (A), nodes and arcs (B), TreeMap (C).
The main advantage of tree views is that they can be displayed with rather little
166
10.4 – Ontology visualization
167
10 – Related Works
168
10.5 – Domotics
sub-classing, and multiple inheritance, certain nodes and their sub-trees are cloned
and visualized multiple times so that the number of crossed arcs can be reduced, and
the readability enhanced. The duplicate nodes are displayed using an ad-hoc color
in order to avoid confusion. Unfortunately, this application does not support editing
and can only manage RDF data. Eventually, the approaches and functionalities for
each of the mentioned application are summarized in the following table.
View Scheme
Viewer Editor Network Hierch. Neighb. Hyperb.
Protégé × × ×
OntoViz × × × ×
Jambalaya × ×
TGViz × × ×
ezOwl × × ×
OntoEdit × × ×
Visualizer × ×
IsaViz × × ×
OntoRama ×
Table 10.1. Summary of ontology visualization tools.
10.5 Domotics
In the context of domotics, solutions to device connectivity are driven by local com-
mercial leaders. North America, Europe and Japan are three main areas oriented to
different wiring technologies and protocols, and the picture gets even more confused
when looking at Nation-wide level. However, three main approaches can be identi-
fied: the installation of a specific and separate wired network is a common approach
and is either based on proprietary solutions (EIB/KNX [80], LonWorks [81], etc.)
or on more diffuse technologies (e.g., Ethernet, Firewire), the latter being preferred
for computer related networks and wide-bandwidth requirements. On top of these,
various protocols are implemented (X10 [82], CEBus [83], BatiBus [84], HBS, just
to name a few) and none of them has yet prevailed on a global scale. The Konnex
open standard [85], involving EIB, EHS and BatiBus, is one of the major initiatives
in Europe aiming at global interoperability.
Another common approach is based on the reuse of the existing wirings, such as
power or phone lines (the former being more frequently used as a carrier, as in EHS,
X10, Homeplug [86]). However, these solutions generally present higher noise levels
than dedicated wirings and are therefore less versatile. Lastly, wireless technologies
are also becoming more and more attractive, and mostly based on either infrared
169
10 – Related Works
technology, or radio link (e.g. IEEE802.11b or Bluetooth [87]). The latter is usually
adopted for guaranteeing connectivity at higher distances.
As each of these alternatives is better suited in a given context, they are still
far from convergence into a unique solution, even because of strong commercial
influences. Instead, in a home one can frequently find different technologies, such as
Ethernet for multimedia connectivity and Bluetooth for personal connectivity, while
special buses reliably allow home automation tasks.
However, as processing complexity is increasingly coming at lower costs, tech-
nologies that are well established for common PCs are being transferred to numerous
devices. It is not therefore surprising that simple computer systems are being used
to bridge this interconnection gap, especially for what concerns ambient intelligence.
In this context, most of the efforts are divided into limited domains: each domotic
system vendor usually proposes intelligent approaches to the management of the
supported devices, and often remote control is also offered through the Internet or
with a phone call. However, the intelligence provided to the system is generally ba-
sic, unless enhanced using a more complex and versatile device, such as a computer.
In effect, some programming software can be found, but is generally only oriented to
the control of one specific technology, and offers little support to ambient intelligence
techniques (ETS [88], commercial; EIBcontrol [89], open source).
Furthermore, smart devices or information appliances (like PDAs, set top boxes,
etc.) are not usually produced by domotic system vendors, and are only recently
facing the intercommunication and standardization issues related to domotic systems
[90]. It becomes therefore desirable a “neutral” device capable of interfacing all these
parts. In this context, the OSGi Alliance [91] has already proposed a rather complete
and complex specification for a residential gateway (and alike). Here, protocol and
integration issues, as well as intelligence related behaviors, are demanded to third-
party “bundles”, that can be dynamically plugged into the OSGi framework at
run-time. An additional effort is therefore required to enhance the system with a
compact yet general approach to intelligent and automatic interoperability among
the involved devices.
Other researches, in the context of ambient intelligence, include the GAS (Gad-
getware architectural Style) project [92], which proposes a general architecture sup-
porting peer-to-peer networking among computationally enabled everyday objects;
the event-based iRoom Operating System (iROS) [93], which is mainly focused on
communications about devices in a room, and does not take in particular consid-
eration the possibility of a general and proactive interaction of the environment
with the user. The one.world project [94] offers a framework for building pervasive
applications in changing environments. Other researchers propose a fuzzy and dis-
tributed approach to device interaction, by exploiting a special rule-based system
[95]: special purpose agents mutually collect, process and exchange information,
although at the expense of higher complexity, especially in terms of adaptation of
170
10.5 – Domotics
171
Chapter 11
11.1 Conclusions
The work presented in this thesis shall be seen as an engineering exercise rather
than a work on theoretical aspects involved in the Semantic Web, which are only
partially tackled in the previous sections. The motivation of such a “practical”
approach is to demonstrate to which extent the currently available technologies and
solutions can promote the adoption of semantic functionalities in nowadays web
applications. According to this vision, the work presented in these pages is a sort of
software engineering project that, starting from the analysis of applications, on the
web, which more or less explicitly deal with knowledge elaboration, exchange and
storage, extracts the requirements for a semantic platform readily usable in those
applications, with successful results.
These requirements clearly impose several constraints on the extent to which
semantics can be applied in this context. Available solutions are to be gathered and
harmonized in order to provide common middle-ware platforms, easily accessible
through the adoption of available off-the-shelf technologies such as SOAP and Web
Services. H-DOSE, rather than being a complete, sound solution to this exercise is
a first attempt to provide an answer to the demand of exploitable technologies that
seems to arise from the actors of the current Web.
What shall be rather clear, after reading the presented work, is that the exercise
can be effectively solved and semantic functionalities can be introduced on the cur-
rent web. H-DOSE, in this sense, has the merit of practically providing a sample
solution that allows the aforementioned integration in nowadays web applications
172
11.1 – Conclusions
such as content management systems, e-learning systems and search systems. Re-
sults, in terms of performance, are sometimes preliminary or they are still being
gathered, however performance is not the main goal of the platform. The main goal
is to practically demonstrate that research on the Semantic Web field can be easily
exploited by the nowadays Web, although in a relatively low-power version. Per-
formance is important too, as demonstrated by the executed and by the on-going
experiments, however the Web people is rather aware of difficulties involved in the
technology maturation process and seems to pay much more attention on integra-
tion issues than on performance of the preliminary adoption. On a long time frame,
performance will indeed make the difference between successful and failing solutions,
however, at now, integration and readily exploitation is the main concern and the
main goal to be reached for spreading Semantic Web technologies on the real Web.
The approach presented in this thesis moves exactly in this direction and do
not aims at providing commercial-level solutions. It is still based on the traditional
search paradigm used by current keyword-based engines. Information, therefore, is
still seen as encoded into documents and searches are still for relevant documents
instead of being for relevant knowledge. Applying this paradigm limits, greatly, the
potentials of semantics adoption both in terms of user satisfaction and of provid-
able results. However, it has the advantage of trespassing the barriers that usually
prevent the adoption of new technologies in already deployed applications.
One of the main problems of the Semantic Web is, according to literature, the
absence of a killer application. Nevertheless, as demonstrated by the Web itself,
killer applications are not the only way for a technology to be adopted. Permeation
is another channel through which the same result can be reached. The key to
permeation is the ease of integration, the ability to readily use new technologies
without changing what already offered to the users. The result is a sort of silent
invasion of the Web made by semantic technologies. If permeation takes place, the
transition between the nowadays Web and the full Semantic Web will be nearly
unnoticeable and will lead to the materialization of the powerful theories now being
developed by the research community. H-DOSE is designed having in mind this idea
of permeating the currently available solutions instead of substituting them as in the
case of a killer application. The small experiments presented in the previous chapters
shows that the first factor for achieving permeation, i.e., ease of integration, can be
reached by the available semantic technologies, the next step is then to increase the
importance of this silent invasion, favoring the final exploitation of the full-powered
Semantic Web.
The approach adopted by H-DOSE although promoting this idea of permeation,
moves also along the path traced by another consideration: “for the Semantic Web,
partial solutions will work and even if an agent will not be able to reach a human-
level of understanding and thus will not be able to come to all conclusions that a
human might draw, the agent will still contribute to a Web far superior than the
173
11 – Conclusions and Future works
current Web”.
The functionalities provided by the platform are clearly much less sophisticated
that what is now being investigated by SW research initiatives. However, it seems
rather difficult that sophisticated semantic search paradigms and solutions will be
able to spread-up the Web until more simple semantics will be adopted. Except for
very specific domains in which solutions and agreements on technological issues can
be defined, such as for the Multi-Agent system community.
Moreover, even when the full power of logic and inference can be applied, these
technologies often prove to be too rigid to deal with real world applications where
uncertainty and contradictions exist. To tackle this issue several researches attempt
to define a common framework for logic reasoning in presence of contradictory and
inconsistent environments. However these solutions, although effective as research
elements, still appear too preliminary to be successfully engineered into off-the-shelf
products.
Concluding this section, some philosophical issues should be addressed as they
still agitate the sea of Semantic Web researchers and since they will probably char-
acterize the various phases of SW exploitation. The ontology-based modeling of
real-world entities and situations, and of their relations, seems in fact to be too
rigid and too distant from the way humans approach the representation of the same
knowledge. A growing group of skeptic people, either working inside or outside the
SW initiative, is now involved in a deep discussion of the foundational technologies
on which the Semantic Web is/will be built. The underlying idea is that represent-
ing human knowledge, which is intrinsically fuzzy, uncertain and not formal with a
rigid, formal model is probably a not good solution, although at now seems more
feasible than others.
Moving from this critical movement inside the SW, the idea of using “Concep-
tual Spaces” [96] instead of ontologies is, for example, more appealing since the
representation invented by Peter Gardenfors allows to model real objects, and sit-
uations in a much more natural way. The notion of conceptual space is based on
the so-called “quality dimensions”. According to the Gardenfors’ definition, quality
dimensions are those mechanisms which allow to evaluate the “qualities” of objects.
They correspond to the ways different stimuli are judged to be similar or different.
An introductory example of quality dimensions encompasses temperature, weight,
brightness, pitch and the three ordinary spatial dimensions height, width and depth.
Dimensions are seen as the building blocks of the conceptual level. Without
going in too deep details (see [96] for further explanations), a conceptual space is
defined as a class of quality dimensions D1 ,...,Dn . And a point in a conceptual
space is subsequently represented as a vector v = d1 ,...,dn
with one index for
each dimension. Each dimension is endowed with a certain topological or metrical
structure. As an example, the weight dimension is isomorphic with the half-line of
non-negative numbers, the “taste” quality can be represented on a 4-dimensional
174
11.1 – Conclusions
space whose components are generated by four different types of sensors: saline,
sour, sweet and bitter.
A natural concept in a conceptual space is then defined as a convex region of
the space. From this assumption several properties can be derived which justify the
adoption of such a theory as a semantic representation of the real world. However,
there is a main factor that still prevents a rapid development of applications of
conceptual spaces: the lack of knowledge about the relevant quality dimensions.
It is almost only for perceptual dimensions that the psychophysical research has
succeeded in identifying the underlying geometrical and topological structures. For
example, we have only a confuse understanding of how we perceive and conceptualize
things according to their shapes.
In addition to the criticisms about the representation of concepts, the idea of
ontologies developed by every actor playing a role on the web appears a little bit
visionary. Redundancy in fact is likely to diverge with very few ontologies, as can
already be seen for available knowledge models, and the task of merging, linking
and sharing the enormous amount of differently modeled knowledge will soon be
infeasible. If it is already difficult to reach ontological agreements when only few
people are involved, on a worldwide scale this is clearly impossible. SW people usu-
ally confute the above assertions by porting the current Web as a successful example
of this self-organization approach. Nevertheless, it must be said that the current
web is actually a completely not homogeneous repository of information where the
only standard agreements are on format of published data and not on meaning, or
metadata. This characteristic is actually the key of the Web success but, at the
same time has imposed severe problems of data passing between organizations and
between software applications, as demonstrated by the several initiatives aimed at
solving this problem, XML is an example. Whenever the data to be exchanged must
also be understood, by machines especially, the wild reality of the web demonstrates
to be too wild to allow effective knowledge manipulation. Agreements can be cer-
tainly reached on formats as happened for the nowadays Web (RDF/S and OWL are
examples), but reaching the same agreement for knowledge modeling seems quite
visionary. Clearly on small, specific, domains effective solutions can be found but
the point is on the scalability of this approach rather on its feasibility in “controlled”
scenarios.
Solutions will probably involve some “bottom-up” approach to the problem,
mimicking what humans to in their everyday interactions. It is likely that the final
solution will include some commonly agreed general model for basic facts (almost
all humans can recognize other humans, without being specifically instructed, for
example) and some domain specific knowledge defined autonomously by people and
shared between interacting parties, as happens for example when two persons of
different nationalities meet themselves and must find a common knowledge base
on which to build their subsequent interaction. Independently from these future
175
11 – Conclusions and Future works
evolutions, which are hard to predict, at now it is clear that the nowadays “top-
down” semantics can be fully exploited only on limited domains where agreements
on meaning can be reached, and reasoning can deal with inconsistencies, uncertainty
and contradictions. A whole Semantic Web, although stimulating as a vision, still
seems far from being reached and research efforts are still needed to walk in this
exciting direction.
176
11.2 – Future Works
able to translate the interaction between users and domotic homes from the current
command-based paradimg to an objective based one.
For what concerns theoretical research, the author and some of its colleagues are
now starting to work in the context of the so-called Semantic Desktop initiative.
The main objective is to migrate the technologies developed in the wider context
of the Semantic Web initiative, to the users’ computer desktops, enabling a more
effective interaction between humans and machines in performing everyday tasks. In
particular, the work under research involves a semantic-based cataloging system able
to better organize files and directories on the user’s machine, allowing for an easier
retrieval and modification of them. This semantic-powered file search system will
then be integrated by a semantic-based service composition framework, currently
developed by Alessio Bosca, a researcher working in the e-Lite group, and applied
to a project developed in collaboration with Tilab.
The service composition will allow, as a vision, to accomplish complex and coor-
dinated tasks in an actually natural manner. For example, for sending a fax, a user
will simply compose a pseudo-natural request such as “I want to send a fax to Julie”.
The machine will then provide a text editor for editing the fax content. When the
fax will be complete, the semantic composition service will check the availability
of a fax device on the user’s computer. Let us suppose that this device does not
exist, in this case the semantic composition service will look up the Web for a free
fax service, for example. Then, it will convert the text of the fax in the correct
format. In doing this conversion it will look up the user’s agenda for “Julie” and it
will notice that two “Julie” appear. Therefore, it will ask explanations to the user
and eventually it will finalize the fax sending process.
The last front on which the e-Lite group is moving is about conceptual spaces.
In this context, the author is planning to experiment a first, very simple implemen-
tation of the Peter Gardenfords ideas to demonstrate the feasibility of the approach.
This work, differently from the others introduced above is still in a ideation phase
where available knowledge and solutions are gathered and time for doing design and
implementation has still to be allocated.
177
Bibliography
178
Bibliography
179
Bibliography
180
Bibliography
[61] L. Bentivogli, E. Pianta, F. Pianesi. Coping with lexical gaps when building
aligned multilingual wordnets, 2000.
[62] J. Gilarranz, J. Gonzalo, F. Verdejo. Language-independent text retrieval
with the eurowordnet multilingual semantic database. In Second Workshop
on Multilinguality in the Software Industry: The AI Contribution, 1997.
[63] M. Vargas-Vera et al. Mnm: Ontology driven semi-automatic and automatic
support for semantic markup. In 13th International Conference on Knowledge
Engineering and Management (EKAW 2002), 2002.
[64] M. Vargas-Vera et al. Knowledge extraction by using an ontology-based anno-
tation tool. In K-CAP 2001 - Workshop on Knowledge markup and semantic
annotation, Victoria, BC, Canada, 2001.
[65] P. Resnik. Semantic similarity in a taxonomy: an information-based measure
and its application to problems of ambiguity in natural language. Journal of
Artificial Intelligence Research, 11:95–130, 1999.
[66] E. Agirre and G. Rigau. Word sense disambiguation using conceptual density.
In Coling-ACL’96 Workshop, Copenhagen, Denmark, pages 16–22, 1996.
[67] Rocha C., Schwabe D., Poggi de Aragao M. A hybrid approach for searching in
the semantic web. In WWW2004 conference, New York, NY, pages 374–383,
2004.
[68] Guha R., McCool R., and Miller E. Semantic search. In WWW2003, Bu-
dapest, Hungary, pages 700–709, 2003.
[69] Stojanovic N., Studer R., Stojanovic L. An approach for the ranking of query
results in the semantic web. In ISWC2003, Sanibel Island, FL, pages 500–516,
2003.
[70] Davies J., Weeks R., Krohn U. Quizrdf: Search technology for the semantic
web. In WWW2002 workshop on RDF & Semantic Web Applications, Hawaii,
USA, 2002.
[71] Heflin J. and Hendler J. Searching the web with shoe. In Artificial Intelligence
for Web Search. Papers from the AAAI Workshop, pages 35–40, 2000.
[72] Ben Shneiderman. Treemaps for space-constrained visualization of hierarchies.
ACM Transactions on Graphics (TOG), 11(1):92–99, 1992.
[73] Jambalaya.
http://www.thechiselgroup.org/chisel/projects/jambalaya/jambalaya.html.
[74] Tgviztab, a touchgraph visualization tab for protégé 2000.
http://www.ecs.soton.ac.uk/ha/TGVizTab/TGVizTab.htm.
[75] Touchgraph library. http://touchgraph.sourceforge.net/.
[76] ezowl: Visual owl (web ontology language) editor for protégé.
http://iweb.etri.re.kr/ezowl/index.html.
[77] Ontoedit.
http://www.ontoknowledge.org/tools/ontoedit.shtml.
181
Bibliography
182
Bibliography
183
Bibliography
184
Bibliography
[135] R. Ghani, A.E. Fano. Using text mining to infer semantic attributes for retail
data mining. In ICDM 2002: 2002 IEEE International Conference on Data
Mining, pages 195–202, 2002.
[136] D. Bonino, F. Corno, L. Farinetti, A. Ferrato. Multilingual semantic elab-
oration in the dose platform. In SAC 2004, ACM Symposium on Applied
Computing, Nicosia, Cyprus, pages 1642–1646, 2004.
[137] P. Resnik. Using information content to evaluate semantic similarity in a
taxonomy. In 14 th International Joint Conference on Artificial Intelligence
(IJCAI’95), Montreal, Canada, pages 448–453, 1995.
[138] M. Vargas-Vera et al. Mnm: Ontology driven tool for semantic markup. In
European Conference on Artificial Intelligence (ECAI 2002), Workshop on
Semantic Authoring, Annotation & Knowledge Markup (SAAKM 2002), Lyon
France, 2002.
[139] IST Advisory Group (K. Ducatel, et al). Scenarios for ambient intelligence in
2010. final report, 2001.
[140] eperspace, european IST project. http://www.ist-eperspace.org/.
[141] COGAIN, european IST NoE project. http://www.cogain.org/.
[142] European home systems association (EHSA). http://www.ehsa.com/.
[143] Home audio video interoperability (HAVi). http://www.havi.org/.
[144] Microsoft AURA project. http://aura.research.microsoft.com/.
[145] MIT oxygen project. http://www.oxygen.lcs.mit.edu/Overview.html.
[146] CNR NICHE project. http://niche.isti.cnr.it/.
[147] Jini network technology. http://www.sun.com/software/jini/.
[148] Universal pnp. http://www.upnp.org/.
[149] D. Bonino, F. Corno, L. Farinetti. Domain specific searches using concep-
tual spectra. In 16th IEEE International Conference on Tools with Artificial
Intelligence. Boca Raton, Florida, 2004.
[150] D. Bonino, A. Bosca, F. Corno. An agent based autonomic semantic platform.
In First IEEE International Conference on Autonomic Computing. New York,
USA, 2004.
[151] The apache jakarta project. http://jakarta.apache.org.
[152] The protégé ontology editor and knowledge acquisition system.
http://protege.stanford.edu/.
[153] The frodo rdfsviz tool. http://www.dfki.uni-kl.de/frodo/RDFSViz/.
[154] Rdfauthor. http://rdfweb.org/people/damian/RDFAuthor/.
[155] Grigoris Antoniou, Frank van Harmelen. A Semantic Web Primer. MIT press,
2004.
185
Appendix A
Publications
186
8. Dynamic Optimization of Semantic Annotation Relevance,
D. Bonino, F. Corno, G. Squillero – CEC2004, Congress on Evolutionary Com-
putation, Portland (Oregon), June 20-23, 2004 - (International Conference)
187