Search Engine Case Study: Searching The Web Using Genetic Programming and MPI

Parallel Computing 27 (2001) 71±89
www.elsevier.com/locate/parco
Search engine case study: searching the web using

genetic programming and MPI
Reginald L. Walker 1
Computer Science Department, University of California at Los Angeles, Los Angeles, CA 90095-1596, USA
Received 15 October 1999
Abstract
The generation of a Web page follows distinct sources for the incorporation of information.
The earliest format of these sources was an organized display of known information determined
by the page designers' interest and/or design parameters. The sources may have been published
in books or other printed literature, or disseminated as general information about the page
designer. Due to a growth in Web pages, several new search engines have been developed in
addition to the re®nement of the already existing ones. The use of the re®ned search engines,
however, still produces an array of diverse information when the same set of keywords are used
in a Web search. Some degree of consistency in the search results can be achieved over a period
of time when the same search engine is used, yet, most initial Web searches on a given topic are
treated as ®nal after some form of re®nement/adjustment of the keywords used in the search
process. To determine the applicability of a genetic programming (GP) model for the diverse set
of Web documents, search strategies behind the current search engines for the World Wide Web
were studied. The development of a GP model resulted in a parallel implementation of a pseudo-
search engine indexer simulator. The training sets used in this study provided a small snapshot
of the computational eort required to index Web documents accurately and eciently. Future
results will be used to develop and implement Web crawler mechanisms that are capable of
assessing the scope of this research eort. The GP model results were generated on a network of
SUN workstations and an IBM SP2. Ó 2001 Elsevier Science B.V. All rights reserved.
Keywords: Genetic programming; Distributed computing; Information retrieval; World Wide Web; Search
engines
1
This work was supported by the Raytheon Fellowship Program. The implementation results associated
with the network of workstations were generated on computers located in the Department of Computer
Sciences, Purdue University, West Lafayette, IN 47907.
E-mail address: rwalker@cs.ucla.edu (R.L. Walker).
0167-8191/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 1 9 1 ( 0 0 ) 0 0 0 8 9 - 2
72 R.L. Walker / Parallel Computing 27 (2001) 71±89
1. Introduction
Web pages used in this study consisted mostly of Yahoo business headline doc-
uments [35] which, because of their contents being mostly text, were incorporated in
the training sets [13]. Some of the other documents ± ordinary Web pages ± con-
tained errors that may occur when a chosen browser displays erroneous data and/or
error messages as it parses the selected Web document. A browser produces erro-
neous results when a Web page designer attempts to create a compact Web page
containing a large number of HyperText Markup Language (HTML) related tags,
attributes, and values [20]. The page designer creates a compact Web page by con-
catenating several of the HTML related tags on a single line. The problem occurs
when the host computer attempts to read each line of text into an internal buer, for
a line length that is greater than the machine's line limit causes the process to crash.
Also, dierent machines impose dierent line buer limits for Web browser software.
This technique for compact Web pages attempts to reduce the network load asso-
ciated with the retrieval and transmission of large documents to/from remote sites.
This paper is an extended version of papers that appeared as [30] and [32]. Section 2
presents the basis for current search engines. Section 3 presents the characteristics of
the current search engines. Section 4 presents the need for improvements of the in-
formation retrieval (IR) systems. Section 5 presents studies of distinct pattern
searches. Section 6 presents preliminary implementation issues. Section 7 presents
the preliminary results. Section 8 presents the conclusion.
2. The early search engines
The early Web search engines [16] were explicitly divided into three separate
components: a search engine index browser, search engine indexer(s), and the search
engine. The early popular search engine and the index browser were provided by the
Gopher integrated search engine software whose indexing mechanism, however, was
determined to be inecient. This ineciency of the indexing mechanism led to the
development of the earliest indexer, Archie. The lack of multiple word searches in the
Archie indexer led in turn to the development of Veronica and eventually Jughead.
Yet, each of these associated indexers only provided an indexing mechanism for the
Gopher search engine. The eventual development of wide area information service,
WAIS, introduced an integrated search engine [31] with supplemental indexer and
browser mechanisms that provided users with indexed results generated by the
Gopher-related indexers. Mosaic 2 [20] thus provided users with an alternative to the
Gopher index browser for its indexed results as well as its ®le transfer protocol (FTP)
archives of software.
2
It should be noted that Netscape [21] and Mosaic share design methodologies in conjunction with some
of the same software developers.
R.L. Walker / Parallel Computing 27 (2001) 71±89 73
3. Comparison of current search engines
Tables 1 and 2 show comparisons of the general characteristics that comprised the
current search engine technology [8,20]. Since it was initially developed to track
Gopher, FTP, and Telnet sites, Yahoo [35] had the closest ties to the early search
engine strategies and Mosaic. This search engine also followed an early biology-
based classi®cation system, the result of work performed by the 18th century
botanist Linnaeus. AltaVista [1] and Inktomi [12] search engines followed the
indexer strategy of WAIS in their reliance on unspeci®ed categories for indexing
millions of documents.
The inclusion methodologies for all entries follow two approaches: via a Web
crawler and/or human editors. In the case of the former, each engine provided access
to millions of Web pages through their databases and/or through Inktomi's or
AltaVista's database (with AltaVista providing the most up-to-date database by
rebuilding it every 24 h. Such an approach eliminated problems associated with the
addition, reorganization, and deletion of pages ± problems prominent in other da-
tabases whose updates are not as frequent (every 7±10 days).
The IR system [15,29,25] associated with these two indexers worked very well.
LookSmart [17] used 24,000 categories ± a re¯ection of its IR system's response time.
This slow response time, an indication of a distributive computer system, was also a
major factor with the Lycos search engine. Infoseek [11], however, used both
Table 1
Comparison of the current search engine indexer technology
Unique point Indexer technology
AltaVista One of the most powerful and Dynamic categorization technology
comprehensive search engines (AKA:COW9) ± no assumptions are
made about what a user is looking for
Excite/NetCenter Uses concept-based searching Intelligent concept extraction (ICE) ±
®nds relationships that exists between
words and ideas
HotBot Joint venture of HotWired and Partnership with leading information
Inktomi companies to provide users with free
searchable access valuable data
Infoseek Provides two search methods Ultraseek server context classi®cation
engine (CCE) taxonomy ± applies the
idea of knowledge management (KM)
by using inverted document frequency
LookSmart Extensive subject categories (1) Sites selected by editors
(2) Web pages indexed by AltaVista
Lycos Core technology identi®es and Partnership with leading information
categorizes online information companies to provide users with free
access to valuable data
Yahoo Initially built to track Gopher, (1) Yahoo categories
FTP, and Telnet sites. Closest in (2) Web sites tested by Yahoo
spirit to the work of Linnaeus, (3) Web pages indexed by AltaVista
the 18th century botanist
Table 2
Comparison of the current search engine indexer-related properties
Indexing Database size in Database Database update
mechanism pages structure rate
AltaVista Web spider ± 140 million No subject Every 24 h
Scooter ± indexes categories
every word
Excite/NetCenter Homepage inclu- 50 million Uses 14 categories Once about every
sion only by Ex- (channels) 7±10 days
cite editors. Link
inclusions by
WebCrawler
HotBot Web spider ± 110 million No subject cate- Once about every
Slurp gories 7±10 days
Infoseek Ultrasmart Web Millions Ultrasmart uses Once about every
spider ± 12 categories 7±10 days
Ultraseek Ultraseek uses
no categories
LookSmart Reviewed by 300,000 + Alta- 24,000 categories Once about every
editors Vista's database 7±10 days
Lycos A sta of 25 Web 110 million 18 topic Once about every
editors categories 7±10 days
Yahoo Web submittal 500,000 + Alta- 14 categories Once about every
form reviewed by Vista's database 7±10 days
Yahoo sta
indexing approaches while allowing the requester to choose the IR system with or
without the categories. This made the process speedier and more ecient since the
number of categories supplied by the indexer and the speed of the IR system are
inversely proportional.
Dierent IR systems used advanced database categorization [15] schemes. Alta-
Vista, for example, used an IR system based on dynamic categorization technology
[1] (also known as COW9) which does not make any a priori or post priori as-
sumptions about individual search patterns. Through their partnership with Ink-
tomi, HotBot [10] and Lycos [18] used a similar technology, while Excite [6]/
NetCenter 3 [21] used an a priori approach in their IR system by means of its in-
telligent concept extraction (ICE) technology [6]. The ICE technology allowed the
indexer to grow with relation to the past search queries executed by previous users of
its IR system. Such a stored list of search queries allows the system to stabilize after
the ICE system has been exposed to variations of a query word combination. In-
foseek, for instance, used a context classi®cation engine (CCE) taxonomy [11] that
applied technology similar to knowledge discovery in databases (KDD) [26] for its
IR system; while, Yahoo and LookSmart made use of the query technology provided
by AltaVista. These technologies used derivations of KDD for their foundation.
3
NetCenter was produced by the Netscape Communication Corporation.
Inclusion methodologies rely either upon human editors and/or Web crawlers.
The shortcoming of human editors [20] resulted from their biased knowledge about
certain subject matters. Yet, bias is never eliminated for it also is found in the
guidelines provided by the classi®cation methodologies that were chosen for indi-
vidual search engines. Yahoo, for instance, provided the Web designer with an initial
freedom of categorizing her/his Web page only after a Yahoo sta editor reviewed
the submitted form for the appropriateness of the chosen category.
The use of a Web crawler for database inclusion eliminated any such bias that was
not initially incorporated into the methodology for the inclusion mechanism. Excite
used a human editor for the homepage inclusion and a Web crawler to traverse the
links from the homepage. Human editors reviewed some of the Web page sites and
judged their contents in addition to the site's design as well as overall appeal. This
approach may add some bias towards an appealing, well designed site.
4. Current need for improvements of the IR systems
4.1. Overview
The indexers for the early and current search engines used the existing state-
of-the-art techniques to retrieve and rank the relevant documents of each request.
However, the coverage of the relevancy rankings was not consistent as it depends
on human and computer constraints. The purpose of improving early IR systems
were 2-fold: the current IR systems in correcting the precision and recall mecha-
nisms [15] for millions of documents that may be centrally or distributively located
throughout the Internet [4] eliminate its transmission delays ± hence increase
remote server's eciency; also, a lack of knowledge about a given indexer can
result in several searches over the Internet on the same set of keywords, yet with no
relevant results.
The goal of precision mechanisms in the IR systems is to discover commonalities
[23,26] among distinct subsets of document. The accuracy of relevancy rating for the
appropriateness of a document has been increased in some of the current search
engines by coupling the methodologies of their IR systems with KDD. The improved
relevance feedback determined the ®tness measure of the requester's database query
by computing the degree of ®tness between the precision and recall of relevant
documents.
4.2. Models of IR systems
IR system model [15] has some characteristics that are similar to those of GP. This
model consists of:
1. a set of documents or records;
2. a set of index terms (single words or phases); and
3. a ®tness function that is used to measure the closeness of a query to retrieving a
relevant document.
The components that comprise the GP model [34] are divided into two major
units: (a) the search space and (b) a component that controls the quality and speed of
the GP system. The search space consists of:
1. the terminal set;
2. the function set; and
3. ®tness function.
The quality and speed control component consists of:
1. the algorithm control parameters; and
2. the terminal criterion.
4.3. Improvements
The methodology for KDD outlines the possible approaches taken by search
engines to improve their IR systems. The conventional approach provided the re-
quester with the query results based on the user's knowledge of the respective IR
systems. Since a typical user usually has a limited knowledge about the structural
and search methodologies that pertain to the individual search engines, this con-
stitutes a limitation of the current search engines.
The KDD approach, however, derives queries from the resulting databases built
by the search engines; and the IR system, in turn, organizes the database and pre-
sents the user with useful information. Thus, the incorporated structure and search
methodologies of KDD systems do not require an in-depth knowledge by its end-
users. The KDD IR system does require an intelligent tool [2] coupled with the
methodology to eliminate repeated queries.
5. Study of searches using the current search engines
5.1. Overview
Web page growth and stability were studied by using two approaches to track
information. The ®rst approach looked at the dierent character patterns employed
by the search engines for Web page classi®cation. This particular study used the term
sassafras tea as the basis for the search comparisons (see Tables 3±5). The results of
the search string underscores the diversity that exists in Web searches that rely on the
same and/or dierent search engines. In this study, 15 distinct search patterns were
used to generate the results for the chosen search engines.
The second approach examined the generation of new Web pages following a
natural disaster. The natural disaster chosen for this search was Hurricane Mitch.
The initial date for the ®rst Web search occurred 5 November 1998. This data col-
lection eort focused on the impact of Hurricane Mitch on Florida over an eight-day
period ± the time it takes most search engines to update their IR systems. In this
study, three distinct search patterns were used to generate the results for the search
strings.
Table 3
Number of relevant pages using the ``sassafras tea''
Search pattern AltaVista Excite/ HotBot Infoseek Look- Yahoo
NetCenter Smart
``sassafras tea'' 336 91 0 126 337 73
+``sassafras 66 21 0 5 66 0
tea'' + herb
``sassafras tea'' 143311 ± ± ± 143 311 ±
NEAR herb
``sassafras tea'' 555 226 ± ± ± 555 226 ±
OR herb
``sassafras tea'' 679 028 ± ± ± 679 028 ±
AND herb
Table 4
Number of relevant pages using ``Sassafras tea''a
NetCenter Smart
``Sassafras tea'' 195 91 0 126 196 73
+``Sassafras 42 21 0 23/54484 42 0
tea'' + herb
``Sassafras tea'' 142 324 ± ± ± 142 324 ±
NEAR herb
``Sassafras tea'' 582 609 ± ± ± 582 609 ±
OR herb
``Sassafras tea'' 679 028 ± ± ± 679 028 ±
AND herb
a
Note. The search results of the form 23/54484 imply Web page/Web site.
Table 5
Number of relevant pages using ``Sassafras Tea''
NetCenter Smart
``Sassafras Tea'' 159 91 0 126 160 73
+``Sassafras 26 21 0 23/54483 26 0
Tea'' + herb
``Sassafras Tea'' 142 072 ± ± ± 142 072 ±
NEAR herb
``Sassafras Tea'' 583 773 ± ± ± 583 773 ±
OR herb
``Sassafras Tea'' 677 866 ± ± ± 677 866 ±
AND herb
5.2. Results for search patterns in study 1
The search for the three major strings of ``sassafras tea,'' ``Sassafras tea,'' and
``Sassafras Tea'' showed some of the current search engines' indexers to be case
sensitive for dierent string patterns. Supplying an additional keyword, such as
``herb'' narrowed the scope of the search engines (see Tables 3±5). The use of a se-
quence of words inside double quotes indicated the searched-for phrase, while the
use of the + symbol indicates that the string pattern MUST occur within the
document. Only the AltaVista and Lycos indexers used the NEAR reserved word,
with AltaVista employing a 10 word range and Lycos a 25 word range. The use
of NEAR, AND, and OR reserved words in conjunction with the strings yielded
unexpected results. The search pattern
``sassafras tea'' NEAR herb
yielded
hits (sassafras NEAR herb) + hits (tea NEAR herb)
which is equivalent to the following two independent searches
hits (+sassafras NEAR + herb) + hits (+tea NEAR + herb).
The search pattern should have been
+``sassafras tea'' NEAR + herb.
Similar search problems occurred when the keywords AND and OR were used re-
sulting in signi®cant increases in the number of relevant pages (hits). To generate its
hits, LookSmart used the AltaVista indexer and database. Since this search phrase
was not a typical topic searched on a regular basis HotBot return 0 hits for all of the
tested search patterns in this study.
5.3. Impact of capitalization associated with variations of ``sassafras tea''
Search engine results associated with the permutations of the ``sassafras tea''
search pattern demonstrate the impact of capital letters in a search string in this
particular study. There occurred a consistent decrease in the number of hits as: (1) no
capital letters were used, (2) one capital letter was used, and (3) two capital letters
were used for AltaVista search results when the general searches were conducted.
Search engines ± with the exception of Infoseek that produced a Web page/Web site
combination ± that did not access the AltaVista results showed no eects of case
sensitivity. The results associated with advanced search strategies for AltaVista
showed: (1) consistent hit decreases associated with the use of NEAR, (2) consistent
increases for the use of OR, and (3) inconsistent results for the use of AND.
5.4. Results for search patterns in study 2
Search results for AltaVista varied the most during the interval between the ®rst
days (see Table 6). These dierences show the instability in the initial search. The
results for Days 2±5 showed little or no variations since a greater variation resulted
from a decrease in the number of hits. The transition from Day 5 to Day 6 revealed
an increase in the number of pages for each of the search patterns while Day 6 and
Day 8 displayed the results as stable and consistent.
Table 6
Number of relevant pages for AltaVista and Excite/NetCenter using ``hurricane mitch''a
AltaVista Excite/Netcenter
hurricane +hurricane +hurricane hurricane +hurricane +hurricane
+ mitch + mitch + mitch + mitch
+ ¯orida + ¯orida
Day 1 197 465 7785 3136 74 757 440 138
Day 2 517 700 94 31* 74 757 440 138
Day 3 517 700 94 31* 74 757 440 138
Day 4 483 951 94 31* 74 757 440 138
Day 5 517 700 94 31* 492 4653 55
Day 6 518 240 160 46* 85 674 4653 1230
Day 8 518 240 160 46* 85 674 4653 1230
Mean 467 285 1212 479 67 267 2246 438
S.D. 110 776 2684 1085 27 674 2085 502

a
Note. The * means the search pattern was + ``hurricane mitch'' + ¯orida.
The search results for Excite/NetCenter remained stable for Days 1±4. The
transition from Day 4 to Day 5 resulted in a 152-fold decrease in the ``hurricane''
search pattern, a 10-fold increase in the ``hurricane mitch'' search pattern, and a
2-fold decrease in the last search pattern. The transition from Day 5 to Day 6 re-
sulted in a 174-fold increase in the ``hurricane'' search pattern. The ``hurricane
mitch'' search pattern produced stabilized results with the ®nal search pattern pro-
ducing a 22-fold increase.
The search results for HotBot showed similar results for the ®rst two days of
this study (see Table 7); while the number of hits during the transition from
Day 2 to Day 3 increased 11-fold for the ``hurricane'' search pattern, 3-fold for
Table 7
Number of relevant pages for the HotBot and Infoseek using ``hurricane mitch''a
HotBot Infoseek
+ ¯orida + ¯orida
Day 1 23 571 439 176 192 999 195 18
Day 2 23 609 441 178 192 999 30 9*
Day 3 271 514 1542 560 192 999 30 9*
Day 4 271 514 1542 560 192 999 30 9*
Day 5 267 465 1526 560 192 999 30 9*
Day 6 271 514 1542 560 192 999 30 9*
Day 8 271 514 1542 560 366 140 27 5*
Mean 200 100 1225 451 217 733 53 10*
S.D. 111 643 496 173 60 587 58 4*
a
``hurricane mitch'' search pattern, and 3-fold for the ``hurricane mitch ¯orida''
search pattern. Days 3±8 remained constant, with minor ¯uctuations in the
number of hits.
Search results for Infoseek remained stable for the ``hurricane'' search pattern
with a 2-fold increase on Day 8. The ``hurricane mitch'' search pattern showed a
6-fold decrease from Day 1 to Day 2. The ®nal search pattern showed 2-fold
decreases during the transitions periods between Day 1 to Day 2, and Day 6 to
Day 8.
With the exception of the ®rst day of the search (see Table 8), the search results
for LookSmart proved identical to the search results for AltaVista. This discrepancy
between the ®rst day results of these search engines can be accounted for by
Looksmart's reliance on the re®ned results of the initial AltaVista searches. The
results for Days 2±8 re¯ect the same as the results for AltaVista.
The search results for Yahoo reveal the instability of the initial search ± it
having decreased by 175-fold between Day 1 and Day 2 for the ``hurricane'' search
pattern ± and then having increased 250-fold between Day 2 and Day 3. Also,
there was a 281-fold decrease from Day 1 to Day 2 for the ``hurricane mitch''
search pattern with the number of hits for Day 2 totaling approximately the same
for each search pattern. The results were constant for Days 3±8. And, the ®nal
search pattern showed a 32-fold decrease/increase from Day 2 to Day 3 and Day 3
to Day 4, respectively.
All the search engines ± with an exception of Excite/NetCenter whose instability
occurred during a three-day period starting with Day 4 ± displayed instability during
the ®rst three days. All of the search engines, except for Infoseek, showed an increase
in the number of hits with a progression of time. The means and standard deviations
associated with each search pattern registered the inconsistencies among each indi-
Table 8
Number of relevant pages for LookSmart and Yahoo using ``hurricane mitch''a;b
LookSmart Yahoo
+ ¯orida + ¯orida
Day 1 483 985 81 1 64 382 0/103 046 0/0
Day 2 517 700 94 31* 367 0/367 0/357
Day 3 517 700 94 31* 91 826 2/368 0/11
Day 4 483 951 94 31* 91 826 2/366 0/361
Day 5 517 700 94 31* 91 826 2/369 0/364
Day 6 518 240 160 46* 91 826 2/378 0/373
Day 8 518 240 160 46* 91 826 2/380 0/375
Mean 508 217 111 31* 74 840 1/15 039 0/263
S.D. 15 338 31 14* 31 844 1/35 929 0/163

a
Note. The search results of the form 0/0 imply Web page/Web site.
b
vidual search engine. These values also displayed the variations among the distinct
search engines as a group.
5.5. The impact of search engine IR system updates
The search engine results associated with the variations of the search pattern
hurricane demonstrate the impact of IR system update rate (see Table 2). The oc-
currence of a possible IR update reads as a hiccup phenomenon in the number of
relevant pages. The number of relevant pages associated with AltaVista when
searching for the hurricane search pattern reveals an increase from Day 1 (see
Table 6). The results associated with +hurricane + mitch and +hurri-
cane + mitch + ¯orida re¯ect similar patterns. The number of relevant pages was the
greatest on Day 1, while there were drastic decreases in the number of relevant
documents on Day 2. This decrease indicated a re®nement of the IR system and a
possible optimization of the associated search strategy. Until Day 6 which showed an
increase in the number of relevant pages the number of relevant documents remained
constant. This increase indicates the IR system' incorporation of additional relevant
pages. These results do not re¯ect a steady increase in the number of relevant pages,
however, re¯ect the ¯uctuations that were associated with the hurricane search
pattern. AltaVista-based results were associated with the LookSmart search engine
(see Table 8).
The hurricane search pattern results associated with the HotBot IR system re¯ect
an increase in relevant pages for Day 1 and Day 2 (see Table 7). With the exception
of Day 5, when the hiccup occurred, the results remained stable and constant during
the proceeding days. Corresponding increases and hiccups in the number of relevant
Web pages was apparent in the +hurricane + mitch as well as the +hurri-
cane + mitch + ¯orida search patterns. The only exception occurred with the latter
search pattern which did not re¯ect the hiccup.
The hurricane search pattern results associated with Infoseek were constant until
Day 6 when there occurred an increase that lasted until Day 8. This increase re¯ects
the inclusion of additional Web pages. The corresponding +hurricane + mitch and
+hurricane + mitch + ¯orida search patterns show decreases in the number of relevant
pages.
The hurricane search pattern results associated with Yahoo showed a hiccup on
Day 2 that was also displayed in the associated search patterns (see Table 8). The
relevant hurricane pages stabilized and remained constant after Day 2. The +hurri-
cane + mitch search pattern showed a steady increase beginning on Day 4. The
+hurricane + mitch + ¯orida search pattern re¯ected the hiccup on Days 2 and 3. A
steady increase began on Day 4.
AltaVista's generation of the most inconsistent results might provide the study's
most accurate results. Excite/NetCenter and HotBot display of similar results re¯ect
the use of a priori knowledge of other users. The Yahoo results for the multi-string
search pattern provide the relevant Web sites associated the pattern. This may re¯ect
an inclusion mechanism such as the reliance on human editors.
6. Parallel implementation of the pseudo-search engine indexer
6.1. Overview
The results associated with the study of search patterns for the current search
engines provided feedback on the sea on information generated by the relevant hits.
This information maybe consistent or very diverse based on the chosen search
engine(s). The previous results were incorporated into the preliminary design of a
pseudo-search engine indexer model for a GP tool [2,13,34] using the message
passing interface (MPI) [9,19] to organize Web pages data that comprised training
sets. This initial pseudo-search engine indexer model will be used to develop im-
provements for the current IR systems [29]. The GP tool will be used to develop
methodologies to organize the sea of diverse information into a meaningful format
[23]. This will be bene®cial to end users. The improved IR system will also require an
intelligent tool [2] coupled with the methodology to eliminate repeated queries. The
improved IR system will be derived by coupling improvements in the methodologies
for GP and KDD.
6.2. The implementation of the pseudo-search engine indexer
The initial implementation used a hash table to store the non-stop words [8] in
various Web pages that comprised the training sets. This approach was chosen for its
fast retrieval and provides a basis for future modi®cations of the pseudo-indexers' IR
system as in Fig. 1. By utilizing indexed memory [28], an ecient memory mecha-
nism associated with a genetic programming application can transmit cultural in-
formation between ®tness function evaluations. This mechanism will be implemented
so to improve the eectiveness of the computations associated with the ®tness
evaluations.
The pseudo-indexer was developed in a series of ongoing stages to initially index
the format of basic HTML documents. The following problems arose at this stage:
(1) HTML tags found in parts of a word, (2) the use of colons, (3) the elimination of
plurals, and (4) the parsing of C program fragments and/or HTML-related scripts.
The next stage incorporated additional lookup mechanisms essential in parsing the
HTML tags and their attributes. Additional issues surfaced from the use of the
period character. This character, separating sentences, must also be used for etc.,
a.m., p.m., U.S., et al. The period can also be found as part of two character-
punctuation strings such as ``.)'' and ``).''. Similar problems can occur with the usage
of the comma character.
To decipher the HTML tag attributes such as alt and its value, certain mecha-
nisms were put into eect (the value of such an HTML tag provides the page
designers with a textual alternative to an image that certain older browsers and text-
only browsers fail to display). For instance, a cost reduction mechanism was needed
to convert any uppercase string to a string of only lowercase characters. Because the
HTML language does not impose any case sensitivity constraints on page designer, a
document can be developed for a non-case-sensitive environment such as various
Fig. 1. Distribution of Web pages.
operating system (OS) environments [5], dierent versions of DOS as well as various
Windows environments. The cost reduction occurred during the lookup process for a
HTML tag and its attribute, since the HTML language allows the designer to use
any combination of upper- and lowercase characters for its HTML tag and/or tag
attributes.
Mechanisms were developed to ¯ush comments, JavaScripts, Common Gateway
Interface (CGI) Scripts, and other embedded code tags incorporated in HTML
documents. One variation of the HTML comment tag has the form ``<!± ± xxxx''
and ``yyyy ± ±>'' where xxxx and yyyy are character strings. The character strings
``<!± ±'' and ``± ±>'' were intended to be used without intermediate spaces. The
opening comment string, ``<!± ±'', should be followed by a space, and the closing
character string, ``± ±>'', should be preceded by a space. Implementation issues can
occur when the last character is not the ``greater than'' character for a string ending
with the substring ``± ±''.
The newly installed mechanisms helped to distinguish the HTML tag attributes
with attribute values that may or may not contain a space from those attributes that
are separated by spaces with no values. The HTML tag attribute values containing
spaces were character strings that included the NULL string and were composed of
one or more words. A need arose to dump the contents of the hash tables to a ®le for
debugging purposes. This mechanism served a second purpose as it provided a
mechanism's foundation for computing and/or recomputing the GP ®tness measure
[15] for each Web page that was added to the training set. Such a ®tness measure is
typically used in the computations for the relevancy ratings of each document in the
IR system.
6.3. Implementation issues
Initially, several problems precluded an ecient implementation of the pseudo-

search engine indexer. The indexer development was conducted on a dedicated
system that did not have the same memory constraints as most multi-user computer
systems. The dedicated system, an ideal environment, suered from the same
memory constraints but only after indexing over 200 Web pages. The indexer's ideal
model might consist of a series of hash tables that can be used for the retrieval of
indexed Web pages (see Fig. 1). This implementation became impractical as the
initial training set was increased in size and the system limitations began to restrict
the allocation of dynamic memory. Memory constraints were documented in the GP
literature [14,24,27] as a major limiting factor in the size of the training sets asso-
ciated with dierent GP applications.
The performance of the parser for the pseudo-search engine indexer showed a
diversity of approaches [5] among the current Web page designers. This diversity was
displayed in the dierent combinations of Java applets and CGI ®les embedded in
the varied HTML documents versus standard HTML documents.
The page parser was not developed to act as a language compiler because parsing
exactness was not necessary. The abuse of the basic HTML formats [20] increases
the developer's diculties more than the user's since the information is not well
conveyed. This in turn leads to the user's inability to recognize what portions of the
Web document are missing. An example of the basic HTML format is
<html>
<head>
<title>Title of the HTML document</title>
</head>
<body>
<h3>Title of the HTML document</h3>
Related text for this document.
<!± ± Comments associated with the text. Or the script commands associated
with the cascading style sheets (CSS) model and JavaScript-based style sheets
(JSS) ± ±>
More related text for this document.
</body>
</html>
When text might be substituted for an image, certain language-dependent HTML
tags can be consistently used to display dierent versions of a page (the use of the alt
tag provides one example of this phenomenon). Otherwise, some information will be
dropped when browsers cannot decipher the designer tag or tag attributes. The worst
misuse of a tag occurred when the designer does not adhere to a HTML tag format
such as the one for the character string ``<!± ±'' associated with the beginning of a
comment, but instead, uses ``<!± ±'' or some permutation thereof. The resulting
designer's use of comments as HTML text adds unnecessary words to the document
count. Also, the addition of words can eventually impact the accuracy of a search
engines' relevancy rating [8]. A few of the current search engines employ some form
of inverted document frequency [15] so to compute their ratings, while others may
use the results to compute a similarity metric [33] used to implement various in-
formation ®lters.
Problems can also occur when new Web page formats [22] are added by designers
so to increase eciency and ¯exibility while the existing indexer is not updated to
re¯ect these changes. The dynamic environment of the Internet will spur several
derivations and enhancements of the current HTML formats until they too become
limiting, due to network constraints such as throughput and reliability. These en-
hancements will also lead to the development of a greater number of Web pages that
might not be executed correctly.
7. Experimental results
One set of results was generated on a cluster of nine Sun workstations. The
workstations were labeled as Node0 ; Node1 ; . . . ; Node9 with Node1 and Node9 being the
same. A second set of results was generated on an eight node IBM SP2 cluster. The
nodes were labeled as Node0 ; Node1 ; . . . ; Node8 with Node0 and Node8 being the same
(see Fig. 1). This inclusion of a dual client node which is used to represent multiple
clients, causes a degradation in performance.
Fitness computations begin after the training set is parsed. These computations
are necessary in indexing (ordering) the training subset for each individual node;
their computation results will be merged and repeated for several generations [13] on
the training set as well as on a constantly growing set of Web documents. The eects
of multiple client indexer nodes on the GP population of pages becomes more ap-
parent as the number of parsed Web pages increase.
Tables 9 and 10 present the timing measurements associated with each sub-
cluster of nodes for powers of two Web pages. Node0 was used as a dedicated
server node with the other nodes being treated as its clients when the number of
nodes was greater than one. The client nodes sent requests to the server node for
Web pages and for access to the output channel. For consistency in the timing
results, this study suppressed all output while the message passing mechanisms
remained in place. The initial approach for channel access appeared to be a
mechanism similar to a token ring [3]. The inconsistency of the ®le size for Web
pages demonstrated that the token would sit in a nodes' message queue as it was
indexing a variable size ®le at its local Web site. Each node was treated as a Web
site for storing a distinct set of Web pages. Discussion of the load balancing
model was presented in [31].
The MPI implementation used the message tags from the server node to de-
termine which subroutine would be executed by the client node(s) for the infor-
mation stored in the transmitted message. The client node(s) informed the server
Table 9
Network of workstations execution times for n processors indexing m WEB pages
nprocs Number of WEB pages
1 2 4 8 16 32 64 128 256 512
1 3.6 8.9 18.1 32.1 59.6 89.2 174.9 275.1 719.6 1346.1
2 1.5 3.6 7.4 13.1 24.8 37.9 75.1 139.6 313.7 923.6
3 1.5 2.2 4.6 6.6 12.6 19.5 39.3 61.5 161.7 304.5
4 1.6 2.2 4.0 4.8 9.9 13.4 27.1 43.1 112.9 209.0
5 1.7 2.2 2.6 4.3 8.5 11.2 22.2 36.0 91.9 168.1
6 1.5 2.2 2.7 3.7 7.8 10.4 17.4 30.4 77.4 144.0
7 1.5 2.2 2.8 3.2 6.3 7.4 14.9 25.9 67.5 120.7
8 1.5 2.2 2.7 2.8 6.1 7.0 13.6 23.4 61.2 110.0
9 1.5 2.2 2.6 2.8 5.8 6.4 12.9 22.0 57.6 103.0
10 1.6 2.2 2.6 2.8 5.7 7.1 13.2 22.1 57.3 101.7
Table 10
IBM SP2 execution times for n processors indexing m WEB pages
nprocs Number of WEB pages
1 2 4 8 16 32 64 128 256 512
1 1.1 2.4 4.6 8.4 30.1 40.7 81.7 73.3 382.0 590.2
2 0.9 2.3 4.7 8.2 42.6 41.1 82.0 128.8 199.5 593.6
3 1.5 3.1 3.4 6.8 21.5 15.0 41.9 63.8 149.1 361.4
4 1.4 2.1 3.1 5.0 12.9 12.6 31.7 42.9 123.2 442.3
5 2.8 2.8 3.9 4.7 12.7 10.7 25.8 35.3 112.4 171.8
6 1.8 2.7 3.5 4.4 10.0 9.6 22.0 30.5 94.8 145.3
7 1.4 2.9 3.8 4.2 10.4 8.7 18.7 27.2 81.0 130.7
8 1.7 3.4 3.4 4.0 8.1 7.4 15.9 23.9 74.6 117.0
9 1.5 2.6 3.6 3.7 8.2 7.4 14.2 21.0 70.8 103.1
node by a general message tag that it was available to index/search another Web
page. Until the server node commands the processing of a data stream for a Web
page the client nodes sit idle. This study suggests an existence of a hierarchy
among the nodes which eliminates the duplication of Web pages on several nodes
simultaneously. The group server concept ensures that members of its group
never parse the same page on multiple machines. The search overhead and ®tness
measure computations increase with the duplication of an indexed Web page.
Duplication also decreases the overall eciency for a group of Web sites. This is
evidenced in the variance of memory resources derived from Web page contents
and size.
The problems that exists for page duplication are similar to the problems that
routers [7] have on a computer networks. The routers use router tables [4] to provide
an ecient mechanism for determining the most eective route for transmitting data
between communicating computers. A similar mechanism can be used by the sever
node to determine if any client node has received the page that needs to be parsed.
This information will be based on the Web page name as opposed to the Web page
contents.
8. Conclusion
The results of this eight-day Web search indicate a stabilization of distinct search
engine databases after the completion of the initial search. Searches repeated over a
series of days showed the aect of re®ned database indexers for the chosen search
engines. The user can optimize the results initially produced by repeating the search
over a period of time.
These results also proved that the most popular search engines did not produce
accurate results for the initial search which may contain some inherent errors that
are not easily observed/detected or documented. Literature on search strategies, as
well as the actual search engine Web pages, inform the user that the results may vary
among dierent search engines. However, this information does not mention that the
results produced by each search engine may vary over a period of days due to the
undocumented re®nement of the same keywords.
The search engines used in this study were AltaVista, Excite, HotBot, Infoseek,
LookSmart, and Yahoo. These search engines were chosen because their respective
indexers returned numeric values for the relevance rating for the chosen search
patterns. The Lycos search engine did not receive consideration in these studies
because its index browsers returned only the ®rst relevant 25 Web documents for a
given search pattern.
This preliminary study provides a glimpse into the computational eort needed to
index Web documents both accurately and eciently. The results generated by the
development and implementation of the Crawler mechanisms will examine the scope
of this research eort. The pseudo-indexer results showed that each Web site will
have an adequate workload when all the indexing mechanisms coupled with GP are
employed. The communication overhead will increase when a mechanism equivalent
to cache coherency is implemented to keep consistency among all of the Web sites
when calculating the relevancy ratings for simple and complex queries.
Acknowledgements
The author wishes to thank Walter Karplus and Zhen-Su She for their direction
and suggestions. The author acknowledges Elias Houstis, Ahmed K. Elmagarmid,
Apostolos Hadjidimos, and Ann Catlin for their support, encouragement, and access
to the computer systems at Purdue University. Special thanks to Martha Lovett for
her assistance with the ®gure in this paper.
References
[1] AltaVista, AltaVista Web Page, Digital Equipment Corporation, Maynard, MA, November 1998.
[2] P.J. Angeline, Genetic programming: a current snapshot, in: A.V. Sebald, L.J. Fogel (Eds.),
Proceedings of the Third Annual Conference on Evolutionary Programming, World Scienti®c,
Singapore, February 1994, pp. 224±232.
[3] R. Bagrodia, Process synchronization: design and performance evaluation of distributed algorithms,
IEEE Transactions on Software Engineering 15 (9) (1989) 1053±1064.
[4] T. Ballardi, P. Francis, J. Crowcroft, Core based trees (CBT): an architecture for scalable inter-
domain multicast routing, ACM Transactions on Computer Systems (1993) 85±93.
[5] T. Berners-Lee, Information management: a proposal, Technical report, CERN: European
Laboratory of Particle Physics, Geneva, Switzerland, 1989.
[6] Excite, Excite Web Page, Excite Inc. Mountain View, CA, November 1998.
[7] S. Floyd, V. Jacobsen, The synchronization of periodic routing messages, IEEE/ACM Transactions
on Networking (1994) 1±28.
[8] A. Glossbrenner, E. Glossbrenner, Search Engines for the World Wide Web, Peachpit Press,
Berkeley, 1998.
[9] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-
Passing Interface, MIT Press, Cambridge, MA, 1994.
[10] HotBot, HotBot Web Page, HotBot Inc., San Francisco, CA, November 1998.
[11] Infoseek, Infoseek Web Page, Infoseek Corporation, Santa Clara, CA, November 1998.
[12] Inktomi, Inktomi Web Page, Inktomi Corporation, San Mateo, CA, November 1998.
[13] J.R. Koza, Survey of genetic algorithms and genetic programming, in: Proceedings of the
WESCON'95, IEEE Press, New York, November 1995, pp. 589±594.
[14] J.R. Koza, D. Andre, Parallel genetic programmingon a network of transputers, Technical Report
STAN-CS-TR-95-1542, Stanford University, 1995.
[15] D.H. Kraft, F.E. Petry, B.P. Buckles, T. Sadasivan, The use of genetic programming to build queries
for information retrieval, in: Proceedings of the First IEEE Conference on Evolutionary Compu-
tation, IEEE Press, New York, June 1994, pp. 468±473.
[16] E. Krol, The Whole Internet User's Guide and Catalog, O'Reilly and Associates, Inc., Sebastopol,
CA, 1994.
[17] LookSmart, LookSmart Web Page, LookSmart Ltd. San Francisco, CA, November 1998.
[18] Lycos, Lycos Web Page, Lycos Inc., Framingham, MA, November 1998.
[19] Message Passing Interface Forum (MPIF), MPI: A Message-Passing Interface, University of
Tennessee, Knoxville, Tennessee, 1995.
[20] C. Musciano, B. Kennedy, HTML: The De®nitive Guide, O'Reilly and Associates, Inc., Sebastopol,
CA, 1997.
[21] Netscape, Netscape Web Page, Netscape Communications Corporation, Mountain View, CA,
November 1998.
[22] H.F. Nielsen, J. Gettys, A. Baird-Smith, E. Prud'hommeaux, H.W. Lie, C. Lilley, Network
performance of HTTP/1.1, CSS1, and PNG, in: Proceedings of the ACM SIGComm '97, ACM Press,
Baltimore, MD, 1997.
[23] E.H.N. Oakley, Genetic programming, the re¯ection of chaos, and the bootstrap: towards a useful
test for chaos, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.), Proceedings of the First
Annual Genetic Programming Conference, MIT Press, Cambridge, MA, July 1996, pp. 175±181.
[24] M. Oussaidene, B. Chopard, O.V. Pictet, M. Tomassini, Parallel genetic programming: an application
to trading models evolution, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.),
Proceedings of the First Annual Genetic Programming Conference, MIT Press, Cambridge, MA, July
1996, pp. 357±362.
[25] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, Genetic programming for improved data
mining ± application to the biochemistry of protein interactions, in: J.R. Koza, D.E. Goldberg,
D.B. Fogel, R.L. Riolo (Eds.), Proceedings of the First Annual Genetic Programming Conference,
MIT Press, Cambridge, MA, July 1996, pp. 375±380.
[26] T.W. Ryu, C.F. Eick, MASSON: Discovering commonalities in collection of objects using genetic
programming, in: J.R. Koza, D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.), Proceedings of the First
Annual Genetic Programming Conference, MIT Press, Cambridge, MA, July 1996, pp. 200±208.
[27] J. Sherrah, R.E. Bogner, B. Bouzerdoum, Automatic selection of features for classi®cation using
genetic programming, in: V.L. Narasimhan, L.C. Jain (Eds.), Proceedings of the 1996, Australian
New Zealand Conference on Intelligent Information Systems, IEEE Press, New York, November
1996, pp. 284±287.
[28] L. Spector, S. Luke, Cultural transmission of information in genetic programming, in: J.R. Koza,
D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.), Proceedings of the First Annual Genetic Programming
Conference, MIT Press, Cambridge, MA, July 1996, pp. 209±214.
[29] M. Stillger, M. Spiliopoulou, Genetic programming in database query optimization, in: J.R. Koza,
D.E. Goldberg, D.B. Fogel, R.L. Riolo (Eds.), Proceedings of the First Annual Genetic Programming
Conference, MIT Press, Cambridge, MA, July 1996, pp. 388±393.
[30] R.L. Walker, Assessment of the Web using genetic programming, in: W. Banshaf, J. Daida,
A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, R.E. Smith (Eds.), GECCO-99: Proceedings of
the Genetic and Evolutionary Computation Conference, Morgan Kaufmann, San Francisco, July
1999, pp.1750±1755.
[31] R.L. Walker, Development of an indexer simulator for a parallel pseudo-search engine, in:
Proceedings of the 2000 Simulation Multiconference, SCS Press, San Diego, CA, April 2000,
pp. 1750±1755, to appear.
[32] R.L. Walker, Implementation issues for a parallel pseudo-search engine indexer using MPI and
genetic programming, in: Proceedings of the Sixth International Conference on Applications of High-
Performance Computers in Engineering, WIT Press, Ashurst, Southampton, UK, January 2000, to
appear.
[33] A.M.A. Was®, Collecting user access patterns for building user pro®les and collaborative ®ltering, in:
M. Maybury (Ed.), Proceedings of the 1999 International Conference on Intelligent User Interfaces,
ACM Press, Baltimore, MD, January 1999, pp. 57±64.
[34] M.J. Willis, H.G. Hiden, P. Marenbach, B. McKay, G.A. Montague, Genetic programming: an
introduction and survey of applications, in: Proceedings of Genetic Algorithms in Engineering
Systems, IEE Press, London, UK, September 1997, pp. 314±319.
[35] Yahoo, Yahoo Web Page, Yahoo Inc. Santa Clara, CA, November 1998.

Search Engine Case Study: Searching The Web Using Genetic Programming and MPI

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engine Case Study: Searching The Web Using Genetic Programming and MPI

Uploaded by

Copyright:

Available Formats

Parallel Computing 27 (2001) 71±89