You are on page 1of 18

Data Warehousing and

Mining:
Concepts, Methodologies,
Tools, and Applications

John Wang
Montclair State University, USA

Information Science reference


Hershey New York
Acquisitions Editor: Kristin Klinger
Development Editor: Kristin Roth
Senior Managing Editor: Jennifer Neidig
Managing Editor: Jamie Snavely
Typesetter: Michael Brehm, Jeff Ash, Carole Coulson, Elizabeth Duke, Jamie Snavely, Sean Woznicki
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.

Published in the United States of America by


Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com/reference

and in the United Kingdom by


Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanonline.com

Library of Congress Cataloging-in-Publication Data

Data warehousing and mining : concepts, methodologies, tools and applications / John Wang, editor.
p. cm.
Summary: "This collection offers tools, designs, and outcomes of the utilization of data mining and warehousing technologies, such as
algorithms, concept lattices, multidimensional data, and online analytical processing. With more than 300 chapters contributed by over 575
experts from around the globe, this authoritative collection will provide libraries with the essential reference on data mining and warehous-
ing"--Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-59904-951-9 (hbk.) -- ISBN 978-1-59904-952-6 (e-book)
1. Data mining. 2. Data warehousing. I. Wang, John, 1955-
QA76.9.D343D398 2008
005.74--dc22
2008001934
Copyright 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not
indicate a claim of ownership by IGI Global of the trademark or registered trademark.

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.
469

Chapter 2.3
Rule-Based Parsing for Web
Data Extraction
David Camacho
Universidad Carlos III de Madrid, Spain

Ricardo Aler
Universidad Carlos III de Madrid, Spain

Juan Cuadrado
Universidad Carlos III de Madrid, Spain

AbstrAct This chapter analyzes a new approach that


allows us to build robust and adaptable web sys-
How to build intelligent robust applications that tems by using a multi-agent approach. Several
work with the information stored in the Web is a problems, including how to retrieve, extract, and
difficult problem for several reasons which arise manage the stored information from web sources,
from the essential nature of the Web: the infor- are analyzed from an agent perspective. Two dif-
mation is highly distributed, it is dynamic (both ficult problems will be addressed in this chapter:
in content and format), it is not usually correctly designing a general architecture to deal with the
structured, and the web sources will be unreach- problem of managing web information sources;
able at some times. To build robust and adaptable and how these agents could work semiautomati-
web systems, it is necessary to provide a standard cally, adapting their behaviors to the dynamic
representation for the information (i.e., using lan- conditions of the electronic sources.
guages such as XML and ontologies to represent To achieve the first goal, a generic web-based
the semantics of the stored knowledge). However, multi-agent system (MAS) will be proposed, and
this is actually a research field and usually most will be applied in a specific problem to retrieve
web sources do not provide their information in and manage information from electronic newspa-
a structured way. pers. To partially solve the problem of retrieving
and extracting web information, a semiautomatic

Copyright 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Rule-Based Parsing for Web Data Extraction

web parser will be designed and deployed like a Several solutions for information extraction
reusable software component. This parser uses have been proposed. Some of the most popular
two sets of rules to adapt the behavior of the web solutions, which have actually been implemented,
agent to possible changes in the web sources. The are related to the Semantic Web (Berners-Lee et
first one is used to define the knowledge to be al., 2001). Others use XML-based specifications
extracted from the HTML pages; the second one (Bremer & Gertz, 2002) and ontologies (Gruber,
represents the final structure to store the retrieved 1993) to represent, in a coherent way, the infor-
knowledge. Using this parser, a specific web-based mation stored in the Web. In the near future, this
multi-agent system will be implemented. approach will provide the possibility of building
robust distributed web applications. However, the
Semantic Web is still evolving. So, if we wish to
IntroductIon build an application that could reuse the informa-
tion, we need to use other approaches that allow
The World Wide Web (Web) is an interesting and the system to extract the information.
growing environment for different research fields, The Wrapper approach (Sahuguet & Azavant,
e.g., Agents and multi-agent systems (see Balabanovic 1999) is one of the most widely used. It uses wrap-
et al., 1995; Knoblock et al., 2000), Information Re- pers (see Sahuguet & Azavant, 2001; Serafini &
trieval (Baeza-Yates & Ribeiro-Neto, 1999; Jones & Ghidini, 2000) which allow access to the Web as
Willett, 1997), Software Engineering (Petrie, 1996), a relational database (see Ashish & Knoblock,
etc. Over the past two decades, the evolution of the 1997; Camacho et al., 2002c; Fan & Gauch, 1999).
Web, and especially the stored information that can Building those wrappers may be a complex task
be obtained from the connected electronic sources, because, when the information source changes,
have led to an explosion of system development and it is necessary to reprogram the wrappers as
research efforts. well. Several toolkits, including W4F (Sahuguet
However, the success of the Web could be its & Azavant, 2001) and WrapperBuilder (Ashish
main pitfall: the enormous growth of the informa- & Knoblock, 1997), have been deployed to help
tion stored creates so many problems that building engineers build and maintain wrappers.
and maintaining a web application is difficult. The main goal of this work is to search for
Actually, there is increasing interest in building mechanisms that allow for the design and imple-
systems which could reuse the information stored mentation of robust and adaptable multi-agent web
in the Web (Fan & Gauch, 1999). To build these systems. These mechanisms should also integrate,
systems, several problems need to be analyzed like a particular skill of some specialized agents
and solved, i.e. how to retrieve, extract and reuse (web agents), the ability to automatically filter and
the stored information. extract the available web knowledge. Toward this
Information extraction (see Freitag, 1998; end, our approach will use a semiautomatic web
Kushmerick et al., 1997) is a complex problem parser, or simply WebParser, that is deployed as
because many of the electronic sources connected a reusable software component.
in the Web do not provide their information in a The WebParser is used by different web agents,
standardized way. So, it will be necessary to use and they can change its behavior by modifying
several types of specialized agents (or any other two sets of rules. The first rules are used by the
type of applications) to retrieve and extract the agents to define the knowledge to be extracted
stored knowledge from the HTML pages. Once from the HTML pages (i.e., different agents can
this knowledge is extracted, it could be used by access different sources), and the second set of
the other agents. rules is used to represent the final structure for

470
Rule-Based Parsing for Web Data Extraction

store the knowledge that has been retrieved (so User System Interaction. This layer usually
that any agent can adapt the extracted knowledge). provides a set of agents that is able to deal with
Finally, this parser will be used as a specific skill the users. These agents (UserAgents, Inteface-
in several agents to build a specific multi-agent Agents, etc.) could use different techniques,
web system (such as SimpleNews). such as learning (see Howe & Dreilinger,
1997; Lieberman, 1995), to facilitate the com-
munication between the users and the whole
generIc MultI-Agent system. In the past few years, this interaction
web ArchItecture has sparked interest in Human-Computer
Interaction (Lewerenz, 2000).
Several authors have proposed multi-agent ap- User Agents Task Agents. This layer is
proaches in different domains to deal with web usually built by a set of specialized agents
information (see Camacho et al., 2002b; Decker which achieve a specific goal. We call these
et al., 1997; Knoblock et al., 2000), some general specialized agents Task Agents, although
conclusions could be summarized from those different architectures refer to them as
works to describe a possible generic multi-agent Middle Agents, Execution Agents, Planning
architecture, which could be used to implement or Learning Agents, etc. Several models of
adaptable and robust web systems. Figure 1 shows MAS require that the control agents neces-
a schematic representation of this architecture. sary for the system to work correctly be in
The architecture is built using a three-layer this layer (usually named as ANS agents,
model. The functionality of those layers can be AMR, Control Agents, etc.). Characteristics
summarized in: of this layer, i.e., coordination, organization,
cooperation, negotiation, etc., are widely

Figure 1. Generic Web multi-agent based architecture

471
Rule-Based Parsing for Web Data Extraction

studied (see Nwana, 1996; Rosenschein, necessary to design the different relations
1985; Sycara, 1989). between the agents; new problems, such
Middle Agents Web Agents. This layer as coordination, control, or organization,
involves agents, such as Information Agents, need to be performed to obtain a complete
Web Agents, SoftBots, Crawlers (Selberg operative system.
& Etzioni, 1997), and Spiders (Chen et These systems could cause new problems,
al., 2001), which specialize in accessing, e.g., the coordination or cooperation among
retrieving and filtering information from the agents that the engineer could need to
the Web. These types of agents retrieve solve. These new problems arise from the
entire pages or specific parts of those pages utilization of multi-agent techniques that
(usually the information belongs to <meta> could be avoided in a monolitic approach.
tags). These agents could be characterized The increasing number of elements (agents)
because they are able to access different involved in achieving the goal set by the user
web servers, extract some information, increases the number of communication mes-
filter that information, and finally store the sages between the agents. The communication
retrieved document. These agents usually process could be a serious obstacle to a good
use one or more wrappers (see Kushmerick, performance by the whole system.
2000; Sahuguet & Azavant, 2001; Serafini
& Ghidini, 2000) to wrap the web source However, if the main goal is to obtain robust,
and retrieve the available information. adaptable, and fault-tolerant web-based systems,
we believe that the multi-agent based approach is
The multi-agent approaches that deal with web a suitable one, which provides many important
information have several advantages and disad- advantages in obtaining the desired systems.
vantages. They are summarized as follows:
characteristics to Implement robust
Advantages of a Multi-Agent Approach: MAs-web systems

These systems have better adaptability when From the previous generic MAS architecture,
unexpected problems in web servers occur. three important aspects for characteristics (re-
It is easy to add new agents specialized in lated to the layers shown in Figure 1) need to be
new web sources. performed to achieve the desired goal:
The software maintenance is usually simpler
than the traditional monolitic applications 1. It is necessary to provide a flexible and user-
because the whole system can be split into friendly user agent to adapt the behavior of
several simple elements. the system to the needs of the user.
These systems are more robust (have a better 2. The agent and multi-agent model used to
fault tolerance) because, if some of the agents implement the final system is a critical as-
are down, the whole system could still work. pect. To build the system, it is possible to
use several frameworks and toolkits, such
Disadvantages of a Multi-Agent as Jade (Bellifemine et al., 1999), JATLite
Approach: (Petrie, 1996), ZEUS (Collis et al., 1998), etc.
These frameworks allow to the engineers to
It is more complex to design the system reuse libraries and agent templates to facili-
than a single-agent approach because it is tate the design and implementation phases.

472
Rule-Based Parsing for Web Data Extraction

The selection of the agent and multi-agent the available web pages. We will consider that
architecture will be an important aspect in all web pages can be roughly classified into two
the implementation of the system. knowledge categories:
3. For any system that uses web sources, the
problems of accessing, retrieving, filter- 1. Non-structured knowledge. The stored in-
ing, representing and, finally, reusing this formation in the page is represented using
information need to be overcome. natural language, so it will be necessary to
apply NLP (Natural Language Processing)
This chapter addresses the latter character- techniques to allow the information extrac-
istic. The first two characteristics will not be tion.
analyzed. The process of knowledge extraction 2. Semi-structured knowledge. It is possible
is difficult, but it is an essential characteristic for to find, inside the page, a structure (e.g., a
any system that needs to solve problems using table or list) which stores the information
web knowledge. by using some kind of marks to delimit the
data (e.g., <table>, </table>, <ul>, </ul>,
<ol>, </ol>... tags in HMTL).
webParser: seMI-AutoMAtIc web
Knowledge extrActIon The WebParser proposed is a simple software
module, which is specialized in the extraction of
This section describes our approach to design- knowledge stored in the second kind of pages.
ing flexible and simple methods for information Therefore, the knowledge extracted by the parser
extraction from web sources. We have designed will be stored in a specific structure inside the
and implemented a parser, named WebParser, web page.
which, through the definition of several rules,
can extract knowledge from HTML pages. If the webParser Architecture
any changes are produced in the web page, it will
only be necessary to redefine the rules to allow A parser can be defined as: A module, library or
the parser to work correctly again. The utilization program that is able to translate an input (usu-
of rules creates flexibility and adaptability for the ally a text file) into an output with an internal
web agents that may use this parser. representation. The main goal of the WebParser
However, it is necessary to define what kind is to accept web pages and generate a data-output
of knowledge can be extracted and filtered from structure that contains the filtered information.

Figure 2. Semi-automatic Web Parser architecture

473
Rule-Based Parsing for Web Data Extraction

The WebParser uses as input the HTML page to 3. Only the following types of web pages are
be filtered along with several sets of rules that actually parsed:
must be defined by the engineer to obtain the Web pages which contain one or more
information. tables (<table>...</table>) can be parsed.
Figure 2 shows the rule-based architecture for the Web pages which contain one or more lists.
semiautomatic web knowledge parser. The definition It is possible to extract information from un-
of several rules allows the engineer to modify the ordered (<ul>...</ul>), ordered (<ol>...</ol>),
behavior of the parser and adapt it in a simple way. and definition (<dt>...</dt>) lists.
These rules will be used by the parser to represent Web pages that contain nested structures
the knowledge to extract, and the output structure built using tables and/or lists.
to store the knowledge respectively. Two main types 4. The code and final implementation of the
of rules must to be defined: WebParser will be written in the Java
language to obtain a portable and reusable
HTML-Rules. These rules define the type of software. This implementation decision was
knowledge to extract (tables, lists, etc.), and made after taking into account that our main
the position where this knowledge is inside goal is to integrate this software into web
the page. agents. Actually, Java is a suitable and very
DataOutput-Rules. This set of rules defines popular language used by a large number
the final data (output) structure that will be of researchers and companies to implement
generated by the parser when the extraction their agent-based and web applications.
process ends.
Definition of the Sets Rules in the
We have used the term semiautomatic be- webParser
cause, once the engineer defines the two sets of
rules to describe the knowledge to be extracted, From the architecture designed for the WebParser
the rest of the processes are automatic. If the (shown in Figure 2), it is necessary to provide two
page changes, or if we want to extract other different rules to extract the information from a
knowledge inside the same page, it will only be given page.
necessary to modify those rules. Several limita- HTML-Rules. Although it is possible to define
tions and conditions have been considered in the different rules, the WebParser uses a specific
process of designing the parser. These can be HTML-Rule for filtering each page. This
summarized as: rule is used to select what structures will be
filtered from the page. These filtering rules
1. The WebParser uses a set of predefined have two attributes:
rules (special characters) which are used Type. This attribute tells the parser what
by the parser to preprocess the HMTL page. type of structure will be filtered. Only list
These special characters (e.g.: , ....., , etc.) and table attributes are allowed.
that have their HTML representations (as: Position. If the web page stores several struc-
&aacute; &eacute; ..... &ntilde; etc.) are first tures (tables, lists, etc.), this attribute is used to
translated into standard characters (e.g.: a, locate which of those structures are the target
e, n, etc.) to avoid possible problems in the of the extraction process. If there are nested
extraction process. structures, we can use the dot (.) to locate
2. These sets of rules are written by the engi- the exact position of the structure, i.e., struc1.
neer, and those rules will be stored into text struc2.strucj represents that information stored
files to facilitate the modification.

474
Rule-Based Parsing for Web Data Extraction

in the j-th structure, that is nested with two etc. The WebParser will cast the extracted
level depth, will be extracted. string into the desired type of data.
DataOutput-Rules. These rules define the output Data structure. Finally, it is necessary to
data structure and what knowledge will be provide the final data output structure that the
extracted from the page. Only one of those parser will generate. It can be either a vector
rules (as in the HTML-Rules) is used for or a table. It is possible to select a horizontal
every page. These rules are built using the table (tableh: the attributes will be put in the
following attributes: first row and the data in the next rows) or a
Data Level. This attribute shows where the vertical table (tablev: the attributes will be
data is located within the structure. put in the first column and the data in the
Begin-mark/End-mark. Once the cells that next columns). If the extracted information
store the data are fixed (using the previous is a list, it will be stored in a vector.
attribute), it is necessary to set the begin and
end patterns which are used to enclose the Figure 3 shows an example of a simple web
data. For instance, when the data is stored page (and its related HMTL code) that stores
in a table (it will be stored between the three different structures: a simple unordered list,
tags <td> and </td>), it is possible to use as a table, and a nested structure which combines
begin-mark the symbol <td>...data...</td> lists and tables recursively. For instance, if we
to show the string that represents the data wish to extract only the second simple table and
begins from this symbol (it is possible to the ordered list stored in the third structure (that
use any string to indicate the beginning and is nested inside into a table, and inside into a list)
ending of the pattern). from the web page, it will only be necessary to
Attribute-name. Once the information is define the rules (HTML and DataOutput) shown
selected, it is necessary to provide the name in Figure 4. The attributes shown in these rules
of the attributes that will be associated with are used by the parser to:
the retrieved information.
Attribute distribution. This attribute shows HTML-Rule (a) describes that the struc-
the attributes-name when the structure to be ture to extract is a table, and that it is the
filtered is a table. In this situation, we have a second (position = 2) structure stored in
horizontal (Table 2 inside the third structure the page. DataOutput-Rule (a) shows that
in Figure 3) or a vertical (Table 3 inside the the data is in the second cell of the table,
third structure in the example) distribution and that the <td> tag is used as begin-end
in the tables. This attribute could have a pattern. The names of the attributes are
null value if the structure does not have provided in that order to the parser. So, if
any attribute name (e.g., a table with only the distribution of the attributes in the table
numerical information). If the structure to is horizontal/vertical (distrib=hv), we will
filter is a list, the value will be null because first indicate the name of the attributes in the
no distribution is necessary for the parser rows ({att1,1+att1,2+att1,3}) and then name
(the different items retrieved will be stored the attributes in the columns ({att2,1+att3,1}).
in a Java vector). The data type to retrieve will be String values
Data types. The predefined value of any at- and, finally, the WebParser will generate
tribute or data extracted is String. However, a table (data struc= tablehv) to store the
the parser can extract other types of data such retrieved data.
as: integer (int), float (flo), doubles (doub),

475
Rule-Based Parsing for Web Data Extraction

Figure 3. Web page example and HTML code with several types of structures

HTML-Rule (b) describes that the structure dePloyIng A web-MAs


to extract is a list which is stored inside the usIng the webParser
third structure (position=3). DataOutput-
Rule (b) shows that the data is stored in the The WebParser has been implemented as a Java re-
second cell of the table, which is stored in usable software component. This allows us to:
third position in the list (data Level=3.3.2),
and that it possibly uses the <li> tag Modify, in a flexible way, the behavior of
as the begin-end pattern. There are no the parser by changing only the rules.
names associated with the data to retrieve Integrate this component, like a new skill,
(attrib={null}), and no distribution of them in a specialized web agent.
is necessary. The data type to retrieve will
be integer values and, finally, the WebParser We have deployed a Java application from this
will generate a vector (data struc=sortlist) WebParser to test different rules retrieved from
to store the retrieved data. the selected web pages. This allows the engineer
to test the behavior of the parser before it will be
Actually, the output of the WebParser is a integrated as a new skill in the web agent.
Java object (vector or tables), so this output will We used a simple model to design our web
be modified by the agent as needed. agents. This model allows to us to migrate the
designed agent to any predefined architecture that

476
Rule-Based Parsing for Web Data Extraction

Figure 4. HTML and DataOutput-Rules to extract the information stored in the selected structures

Rule (a): table Rule (b): list


- HTML rule: - HTML rule:
type= table type= list
position= 2 position= 3
- DataOutput rule: - DataOutput rule:
data Level= 2 data Level= 3.3.2
begin-mark= <td> b egin-mark= <li>
end-mark= </td> e nd-mark= </li>
attrib= {att1,1+att2,1+att2,1} attib = {null}
{att2,1+att3,1}
distrib= hv d istrib = (null)
data type= (str) d ata type = (int)
data struc= tablehv data struc = sortlist

will ultimately be used to deploy the multi-agent The processes which are used by the agent
web system. Figure 5 shows the model that defines to access to the information source and to
a basic web agent using the next modules: extract the knowledge (Automatic Web Ac-
cess and Parser modules respectively).
Communication module. This module de- The answers retrieved by the agent (HTML
fines the protocols and languages used by the pages) will be provided as input to the parser
agents to communicate with other agents in (and the rules defined by the engineer).
the system [i.e., KQML (Finin et al., 1994)
or FIPA-ACL (FIPA.org, 1997)]. Once the web agent is correctly designed,
Skill: wrapper. This basic skill must be the integration of the WebParser only needs to
implemented by every agent that wishes define the two set of rules analyzed in the previ-
to retrieve information from the Web. This ous section, and then use the API provided with
module needs to be able to do the following the WebParser to correctly execute this software
tasks: access the web source automatically; module.
retrieve the HTML answer; filter the in-
formation; and, finally, extract the knowl- simplenews: A Metasearch system
edge. for electronic news
Control. This module coordinates and man-
ages all the tasks in the agents. SimpleNews (Camacho et al., 2002a) is a meta-
search engine that, by means of several special-
To correctly integrate the WebParser in the ized and cooperative agents, searches for news
web agent, or to change the actual wrapper skill in a set of electronic newspapers. SimpleNews
if the agent is deployed, it will be necessary to uses a very simple topology (as Figure 6 shows),
adapt this agents functionality to the behavior of where all of the web agents solve the queries sent
the software component. To achieve successful by the UserAgent. The motivation for designing
migration to the WebParser, it will be necessary and implementing SimpleNews was to obtain a
to change or modify the following:

477
Rule-Based Parsing for Web Data Extraction

Figure 5. Architecture for a Web agent

web system that could be used to evaluate and the number of solutions requested, and the
compare empirically different multi-agent frame- agents that will be consulted. The interface
works in the same domain. Actually, SimpleNews used by this agent allows to the user to
has been implemented using the Jade, JATLite, know: the actual state of the agents (active,
SkeletonAgent (Camacho et al., 2002b), and ZEUS suspended, searching or finished) and the
frameworks. messages and contents sent between the
The SimpleNews engine uses a set of special- agents. Finally, all requests retrieved by the
ized agents to retrieve information from a particu- agents are analyzed (only different requests
lar electronic newspaper. SimpleNews can retrieve are taken into account) and the UserAgent
information from the selected electronic sources, builds an HTML file, which is subsequently
filter the different answers from the specialized displayed to the user.
agents, and show them to the user. As Figure 6 Control Access Layer. Jade, JATLite, or any
shows, the architecture of SimpleNews can be other multi-agent architecture needs to use
structured in several interconnected layers: specific agents to manage, run or control the
whole system (AMS, ACC, DF in Jade, or
UserAgent Interface. This agent only pro- AMR in JATLite). This level represents the
vides a simple Graphical User Interface to set of necessary agents (for the architecture
allow users to make requests for news from analyzed) that will be used by SimpleNews to
the selected electronic papers. SimpleNews work correctly. This layer resolves the differ-
uses a UserAgent that provides a simple ences from the two versions of SimpleNews
graphical user interface for making queries, that are implemented (from Jade framework
and from JATLite).

478
Rule-Based Parsing for Web Data Extraction

Figure 6. SimpleNews architecture

Web Access Layer. Finally, this layer repre- It is difficult to evaluate the performance of a
sents the specialized web agents that retrieve particular web agent when using a query in a
information from the specific electronic different language. Another reason is that most
sources in the Web. of those sources are widely used in Spain, so the
information stored in them should be enough to
The meta-search engine includes a UserAgent test an MAS built with those web agents.
and six specialized web agents. The specialized From all the possible available versions of
web agents can be classified into the following SimpleNews, we selected the Jade 2.4 version
categories: for several reasons:

1. Financial Information. Two web agents This framework provides an excellent API,
have been implemented, and they will spe- and it is easy to change or modify some Java
cialize in financial newspapers: Expansion modules of the agents.
(http://www.expansion.es), and CincoDias The performance evaluation shows us that
(http://www.cincodias.es). the system has a good fault tolerance.
2. Sports information. Two other web agents It is a multi-agent framework widely used
specialize in sportive newspapers: Marca in this research field. So, more researchers
(http://www.marca.es) and Futvol.com can analyze the possible advantages of in-
(http://www.futvol.com). tegrating this module into their agent-based
3. General information. Finally, two more web or web-based applications.
agents have been implemented to retrieve
information from generic newspapers: El webParser Integration into
Pais (http://www.elpais.es) and El Mundo simplenews
(http://www.elmundo.es).
This section provides a practical example which
The selected electronic sources are Spanish to shows how the WebParser could be integrated
allow a better evaluation of the retrieval process. into several web agents that belong to a deployed

479
Rule-Based Parsing for Web Data Extraction

Figure 7. HTML and DataOutput-Rules to extract the headlines from the Web page request

- HTML rule:
type= table
position= 1
- DataOutput rule:
data Level= 1.2
begin-mark= <td><b>
end-mark= </b></td>
attrib= {null}
distrib= (null)
data type= (str)
data struc= sortlist

MAS (SimpleNews). The following steps must be 1. to the one shown in Figure 5. The wrapping
taken by the engineer to replace the actual skill process is achieved by several Java classes
(wrapper) in the selected agent: that belong to a specialized package (agent
wrapper).
1. Analyze the actual wrapper used by the web 2. The web source is shown in Figure 8. The
agent. Then, identify which modules (or figure on the left shows the answer (using
classes) are responsible for the extraction the query: bush) to the search engine
of the information. used by this electronic newspaper. The fig-
2. Analyze the web source. It will be necessary ure on the right shows the HTML request.
to generate the set of rules to extract the It is interesting to see how the information
information and generate the information is stored in a nested table (our system only
in the appropriate format. The WebParser retrieves news headlines). Figure 7 shows
provides a simple Java object for storing the the HTML and DataOutput rules necessary
extracted information. If a more sophisti- to extract the information.
cated structure is necessary, the engineer 3. Once the previous rules have been tested
may need to program a method that translates (using several pages retrieved from the infor-
these objects into the internal representation mation sources) and the different situations
of the agent. have been considered, the classes or package
3. Change the actual wrapper module used by identified in the first step are changed by the
the WebParser and use the tested rules as WebParser and the related rules.
input to the parser. 4. Finally, the web agent is tested with some
4. Test the web agent. If the integration is suc- test that has been used previously, and the
cessful, the behavior does not change in any results are compared.
of the possible situations (information not
found, server down, etc.) managed by the These two rules will be stored in two differ-
agent. ent files, which will be used by the WebParser
when the wrapping skill of the agent is used to
For instance, assume that we want to inte- extract the knowledge in the source request.
grate the WebParser into the specialized web The final integration of the WebParser will be
agent www-ElPais that belongs to SimpleNews. achieved by the engineer through exchanging
The method of achieving the previous steps is
outlined below:

480
Rule-Based Parsing for Web Data Extraction

Figure 8. Web page example and HTML code provided by www.elpais.es

the actual Java classes in the agent for a simple 3. Web Source Analyses. The possible answers and
method invocation with several parameters requests from the web source are analyzed.
(like the name of the rules and the page to be 4. Generate/Test Rules. The HTML and
filtered). DataOutput Rules are generated by the en-
gineer. Using the WebParser application, the
Verification and Test engineer tests the rules, using as examples
some of the possible HTML requests from
This section shows how complex it is to integrate the web source.
the WebParser into a deployed system. So, previ- 5. Change Skill. When the rules work correctly,
ous processes were repeated for every agent in the WebParser is changed for the actual Java
the selected version of SimpleNews. This version classes.
uses six web agents, specialized in three type of 6. Test Agent. The agent with the integrated
news. Two groups of three different agents were parser is tested in several possible situa-
made and were modified by different program- tions.
mers. We have evaluated seven phases: 7. Test Multi-Agent System. Finally, all the web
agents are tested. If the integration process
1. Architecture Analyses. This stage is used is successful, then the new system should
by the engineer to study and analyze the not present any differences.
architecture of the implemented MAS.
2. WebParser Analyses. It is necessary to study Table 1 shows the average measures obtained
the API (http://scalab.uc3m.es/~agente/ by two students who used the deployed system
Projects/WebParser/API) provided by the to change the skill of all the web agents. Each
WebParser to correctly integrate the new programmer modified three agents (one of each
software module. type), and the table shows the average effort for

481
Rule-Based Parsing for Web Data Extraction

Table 1. Average time (hours) for each integration phase

each integration phase. It is important to note Our approach tries to obtain a reusable and
that the order of modifying the different agents is portable software that could be used by
shown in the table; the General Information agents different web agents to extract knowledge
(www-Elpais, www-ElMundo) were modified from specific web sources.
first. From this table, it is possible to show when The utilization of rules provides two ad-
the multi-agent system is analyzed (and when the vantages: the flexible modification of the
functionality and software modules of the agents parser behavior when the source changes;
are properly understood by the engineers). Chang- and easy reutilization of the well-tested rules
ing and modifying the actual wrapping skill in for similar web sources.
the agents only required a few hours to adapt it in Once the API of the WebParser is analyzed
the first agent. When this process is successfully by the programmer, it is easy to use it as a
implemented, the next agents only need about one new skill module inside the web agent. This
hour to build the rules, to change the skill and to could improve the implementation of web-
test the new agents. This average time is measured based multi-agent systems and gathering
over deployed agents (so it is necessary to change systems.
the implemented modules). If we are building new
agents from scratch or using an MAS framework, Currently, we have implemented an initial
it is not necessary to implement the Change Skill version of the WebParser, and have integrated it
phase, so the implementation of a wrapper agent into several web agents that belong to a simple
could take about 30 minutes (the wrapper skill). multi-agent web system.
This means that the time and effort to program However, there are several important points
(and to reprogram these classes when the sources that will be addressed to obtain a fully portable and
change) the agent is highly reduced. reusable software for extracting web knowledge.
These points can be summarized as:

conclusIons And future To study the flexibility of the rules that can
worK be defined in the WebParser. Is it possible
to extract other types of stored knowledge
The main contributions of this chapter can be with this simple representation?
summarized as follows: To study other agent-based and multi-agent
technologies and frameworks that actually
have been used by different researchers

482
Rule-Based Parsing for Web Data Extraction

and companies, including ZEUS (Collis et Camacho, D., Molina J. M., Borrajo, D., & Aler,
al., 1998) and JATLite (Petrie, 1996), and R. (2002). Solving travel problems by integrating
to see if it is possible to implement web or web information with planning. In Proceedings of
wrapper agents which integrate the parser the 13th International Symposium on Methodolo-
inside the agents implemented with those gies for Intelligent Systems (ISMIS 2002), Lyon,
technologies. France. Berlin: Springer-Verlag.
Chen, H., Chung, Y., Ramsey, M., & Yang, C.
(2001). A smart itsy bitsy spider for the Web.
references
Journal of the American Society for Information
Science.
Ashish, N. & Knoblock, C. A. (1997). Semi-auto-
matic wrapper generation for Internet information Collis, J., Ndumu, D., Nwana, H. S., & Lee, L.
sources. Second IFCIS Conference on Coopera- (1998). The Zeus agent building tool-kit. The Brit-
tive Information Systems (CoopIS), Charleston, ish Telecom Technology Journal, 16(3), 60-68.
South Carolina.
Decker, K., Sycara, K. & Williamson, M. (1997).
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Middle-agents for the Internet. Proceedings of the
Modern Information Retrieval. Boston, MA: 15th International Joint Conference on Artificial
Addison-Wesley Longman. Intelligence, Nagoya, Japan.
Balabanovic, M., Shoham, Y., & Yun, T. (1995). An Fan, Y. & Gauch, S. (1999). Adaptive agents for
adaptive agent for automated web browsing. information gathering from multiple, distributed
information sources. Proceedings of the 1999
Bellifemine, F., Poggi, A., & Rimassa, G. (1999).
AAAI Symposium on Intelligent Agents in Cyber-
Jade - A FIPA-compliant agent framework. In
space, Stanford University, California.
Proceedings of the Conference on Practical Ap-
plications of Agents and Multi-Agents (PAAM99), Finin, T., Fritzson, R., Mackay, D. & McEntire,
London (pp. 97-108). R. (1994). KQML as an agent communication
language. In Proceedings of the 3rd International
Berners-Lee, T., Hendler, J., & Lassila, O. (2001).
Conference on Information and Knowledge Man-
The semantic web. Scientific American.
agement (CIKM94), Gaithersburg, Maryland (pp.
Bremer, J. M. & Gertz, M. (2002). Xquery/ir: 456-463). New York: ACM Press.
Integrating XML documents and data retrieval.
FIPA.org. (1997) Agent communication language.
Camacho, D., Aler, R., Castro, C., & Molina J. Foundation for Intelligent Physical Agents. (Tech-
M. (2002). Performance evaluation of Zeus, Jade nical Report)
and skeleton agent frameworks. In Proceedings
Freitag, D. (1998). Information extraction from
of the IEEE Systems, Man, and Cybernectics
HTML: Application of a general learning ap-
Conference (SMC-2002), Hammamet, Tunisia.
proach. In Proceedings of the 15th National
New York: IEEE Press.
Conference on Artificial Intelligence (AAAI-98).
Camacho, D., Molina J. M., Borrajo, D., & Aler, R. Menlo Park, CA: AAAI Press.
(2002). Mapweb: Cooperation between planning
Gruber, T. R. (1993). A translation approach to
agents and web agents. Information & Security:
portable ontologies. Knowledge Acquisition, 5(2),
An International Journal, Special issue on Multi-
199-220.
agent Technologies, 8.

483
Rule-Based Parsing for Web Data Extraction

Howe, A. E. & Dreilinger, D. (1997). Sav- Nwana, H. S., Lee, L. C. & Jennings, N. R. (1996).
vysearch: A metasearch engine that learns Coordination in software agent systems. The Brit-
which search engines to query. AI Magazine, ish Telecom Technical Journal, 14(4), 79-88.
18(2), 19-25.
Petrie, C. (1996). Agent-based engineering, the
Jones, K. S. & Willett, P. (1997). Readings in In- Web, and intelligence. IEEE Expert, 11(6), 24-
formation Retrieval. San Francisco, CA: Morgan 29.
Kaufmann.
Rosenschein, J. (1985). Rational interaction:
Knoblock, C. A. et al. (2000). The Ariadne ap- Cooperation among intelligent agents. Stanford,
proach to web-based information integration. CA: Standford University. (PhD Thesis).
International Journal on Cooperative Informa-
Sahuguet, A. & Azavant, F. (1999). Building light-
tion Systems (IJCIS) Special Issue on Intelligent
weight wrappers for legacy web data-sources using
Information Agents: Theory and Applications.
W4F. Proceedings of the International Conference
Kushmerick, N. (2000). Wrapper induction: Ef- on Very Large Databases (VLDB).
ficiency and expressiveness. Artificial Intelligence,
Sahuguet, A. & Azavant, F. (2001). Building
118(1-2), 15-68.
intelligent web applications using lightweight
Kushmerick, N., Weld, D. S., & Doorenbos, R. B. wrappers. Data & Knowledge Engineering, 36(3),
(1997). Wrapper induction for information extrac- 283-316.
tion. In Proceedings of the International Joint
Selberg, E. & Etzioni, O. (1997, January/Febru-
Conference on Artificial Intelligence (IJCAI97)
ary). The metacrawler architecture for resource
(pp. 729-737).
aggregation on the Web. IEEE Expert, 8-14.
Lewerenz, J. (2000). Automatic generation of
Serafini, L. & Ghidini, C. (2000). Using wrapper
human-computer interaction in heterogeneous
agents to answer queries in distributed informa-
and dynamic environments based on conceptual
tion systems. Proceedings of the 4th International
modelling.
Conference on Multiagent Systems.
Lieberman, H. (1995). Letizia: An agent that as-
Sycara, K. P. (1989). Multiagent compromise via
sists web browsing. In Proceedings of the Interna-
negotiation. In Distributed Artificial Intelligence
tional Joint Conference on Artificial Intelligence
(vol. II, pp. 119-138). San Mateo, CA: Morgan
(IJCAI95) (pp. 924-929).
Kaufmann.

This work was previously published in Intelligent Agents for Data Mining and Information Retrieval, edited by M. Mohammad-
ian, pp. 65-87, copyright 2004 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).

484

You might also like