Architectural Design Web Mining

Architectural Design and Evaluation of an Efficient Web-crawling System
Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo

Department of Computer Science and Technology, Peking University, P.R. China
{yhf,jwang,lxm,guolin}@net.cs.pku.edu.cn
Abstract
Efficiently collecting Web pages plays an important
role in distributed information retrieval research area.
This paper presents an architectural design and
evaluation result of an efficient Web-crawling system. The
design involves a fully distributed architecture, a URL
allocating algorithm, and a method to assure its
scalability and reconfigurability. Simulation experiment
shows that load balance, scalability and efficiency can be
achieved in the system. Currently this distributed Webcrawling subsystem has been successfully integrated with
WebGather, a well-known Chinese and English Web
search engine, aiming at collecting all the Web pages in
China and keep pace with the rapid growth of Chinese
Web information. In addition, we believe that the design
can also be used in other context such as digital library,
etc.
1. Introduction
During the short history of the World Wide Web
(Web), Internet resources grow day by day and the
number of home pages increases rapidly. How to quickly
and accurately find what you need in the Web? Search
engine is a useful tool and it becomes more and more
important. The number of indexed pages plays a vital role
in a search engine. By indexing a larger number of pages
visited, search engines are able to satisfy users requests in
a better light. As the Web is changing every day, to index
more pages, we should collect more pages in a limited
time frame. Thus collecting pages efficiently is essential
for a quality search engine.
It is natural to think of distributed system and
parallel processing when talking about efficiently
executing tasks with large data set. Previously,
WebGather 1.0 [1], which answers more than 30,000
queries every day, adopted a centralized processing
method to collect Web pages (a main process manages
many crawlers to work in parallel), and one million page
indices are maintained after the pages are crawled and
analyzed. With the capability of crawling 100,000 pages a
day, WebGather 1.0 takes about ten days to refresh the
whole Web pages it hosts. We note that Google [2], born

of Stanford University, could index 560 million pages in
Jul 2000 [3]. The centralized version, WebGather 1.0, is
incompetent to update the database in a reasonable period
of time. For example, with the crawling capability of
WebGather 1.0, it will take 100 days to collect 10 million
pages. Because pages are often refreshed, some of the
collected pages will lose their value. Of course, it is quite
likely to accelerate the performance of the system by
improving crawling algorithm, adopting more powerful
machines and higher network bandwidth. Due to the
exponential increase of Web pages, it is not a good and
efficient approach after all. So adopting parallel
processing technology to collect more pages in a limited
time frame is essential in developing a large-scale search
engine.
This paper primarily concerns how to design a parallel
and distributed scheme to achieve the design goal. Well
present an architecture and propose methods of collecting
Web pages on the Web for a distributed search engine
system. Based on WebGather 1.2 and its log data, we have
designed and implemented an experiment model to
validate the architecture, design ideas and methods.
2. Related work
2.1. Harvest: a typical distributed architecture
Harvest [4] is a typical system that makes use of
distributed methods to collect and index Web pages.
Harvest is made of several subsystems. The Gatherer
subsystem collects indexing information (such as
keywords, author names, and titles, etc.) from the
resources available at Provider sites (such as FTP and
HTTP servers). The Broker subsystem retrieves indexing
information from one or more Gatherers, eliminates
duplicate information, incrementally indexes the collected
information, and provides a Web query interface to it. The
Replicator subsystem efficiently replicates Brokers around
the Internet. Users can efficiently retrieve located
information through the Cache subsystem. The Harvest
Server Registry is a distinguished Broker that holds
information about each Harvest Gatherer, Broker, Cache,
and Replicator in the Internet. Harvest provides a
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
distributed architecture to gather and search information

on the Internet, and it is worth of studying and learning.
However, Harvest is a huge and complicated system with
complex algorithms and huge expenses which hinder its
popularizing. As concerned to collecting pages in the Web,
Harvest has following aspects that do not fit the aim of a
Search engine.
1.
2.
3.
4.
A search engine has requirements on the rate of

collecting information, but Harvest does not
consider this aspect.
The Gatherer of Harvest will have good effect if
it runs on the machines of providers. However, it
is impossible to make every information provider
do so.
A Gatherer will discard URLs that could not be
visited by itself. However, other Gatherers may
be able to visit those URLs. So Harvest system
does not resolve how to use cross URLs.
Harvest has less effective controls on its
Gatherers when it is used to collect information
within a particular scope. For example, it should
be demanded to abide to the Web Robot
protocol and have some guidance to crawling.
2.2. Google: a typical centralized architecture

Google is one of the biggest search engines in the
world. Though it indexes the largest number [3] of Web
pages, it adopts a centralized architecture. Based on [5],
we learn that Google has only one URLserver which sends
lists of URLs to be fetched to the crawlers. The
URLserver is a single point of failure, so if it crashes, the
entire system may go down. In addition, in a large system,
a centralized component like URLserver may become a
performance bottleneck.
2.3. Our work

Integrated with characteristics of search engines and
based on WebGather 1.2, which uses a centralized
architecture, we present an architecture and propose
methods to collect Web pages, and employ them to
WebGather 2.0, which uses a distributed architecture.
After analyzing and summarizing data obtained from an
experiment, we find that our method can avoid the high
administration costs of setting up large number of
installations of a distributed system like Harvest and the
disadvantages of a centralized system. Then we
implemented our strategies on the actual system. Our final
goal is to collect all the Web pages in China and keep
pace with the rapid growth of Chinese Web information
using the 2.0 version of WebGather. The architecture
described here is not only suitable for designing and
implementing a search engine, but also fitting for building

a digital library.
3. A model for a distributed Web-crawling

system
3.1. Design Goal
In terms of IP blocks routed inside China [6],
WebGathers goal is to collect all the Web pages in China.
According to the statistical report [7], China has about
1,500 Web sites in Oct 1997, about 3,700 in Jul 1998,
about 5,300 in Jan 1999, about 9,906 in Jul 1999, about
15,153 in Jan 2000, about 27,289 in Jul 2000. And it is
expected to be 40,000 Web sites in Dec 2000.
Inktomi and the NEC Research Institute, Inc. have
completed a new study that verifies the Web has grown to
have more than one billion unique pages distributed on
4,217,324 Web sites [8]. Then every site will host about
238 unique pages in average. The number is relatively
stable. Therefore, WebGather needs to collect about
9,520,000 Web pages. Taking ten days as a crawling cycle,
with the current collecting rate of WebGather, it needs at
least ten workstations to cooperate. Besides the above
goal, we expect the distributed system has the following
characteristics.
1.
2.
3.
4.
Load balance.
Low amount of communication between main
controllers.
High scalability, that is, the more the main
controllers, the higher the performance.
Dynamic reconfigurability, meaning to add or
remove main controllers while running.
3.2. An architecture and primary design ideas

3.2.1. Distribution strategies. For convenience, we take
the distributed system as the whole distributed search
engine system, the main controller as the subsystem
executing on every workstation, the scheduler as the
module which schedules all main controllers in the
distributed system, and the cross URL as the reference
URL which points to other Web page in a visited Web
page.
Since the performance bottleneck of a search engine
lies in the network bandwidth and the capacity of a single
workstation, the following distributed strategies can be
taken into account.
1.
Main controllers are physically dispersed at

different places. Suppose a distributed system
including four main controllers, for China, and
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
2.
they can be placed in Shengyang city, Beijing

city, Wuhan city and Guangzhou city.
This Method is to divide IP into different parts,
by handling every URL with hash function, and
then allocate them to main controllers. So every
main controller only charges in collecting URLs
within its own scope. When a main controller
gets a cross URL not belonging to its own scope,
to avoid losing information, the main controller
should transmit the URL to a main controller
which is responsible for it. Each main controller
gets URLs through the hash function H (URL) =
(DNS (URLs host part)) MOD n, in which n
denotes the number of main controllers, and
DNS (URLs host part) denotes the sum of
integer parts of IP which comes from the
resolution of the host part of an URL, or the
integer directly from an URL string
transforming without resolution.
3.2.2 Main controllers communication strategies.

Mainly including the following two kinds of methods:
1.
2.
Circular
communication:
there
is
an
interconnection between two adjacent main
controllers, forming a ring graph. Cross URLs
transmission can be clockwise or anti-clockwise.
Mesh
communication:
there
is
an
interconnection between any two main
controllers, forming a fully connected graph.
Cross URLs transmission can be directly
completed from one to another.
From the view of theory and practical applications, we

explain the above two strategies.
The Web can be viewed as a directed graph G = (V, E)
which consists of a set of vertices, V, and a set of edges, E.
Each vertex is a Web pages URL, and an edge represents
a hyperlink between two Web pages, which means one
page has a hyperlink pointing to another. For any ViVj
V, Vi and Vj are connected if there is an edge between
them. Suppose that there is a set of vertices, Vswhich
includes seed URLs, we can say graph G is connected if
there is a path from Vsi ( Vsi Vs ) to any vertex Vi ( Vi
V ) in G. So the Web crawling is a process from the set of
Vs to other vertexes in the graph G. In order to find all
vertexes in the graph G as quickly as possible, there
should be many crawler subsystems starting from seed
URLs. Considering the limitation of network bandwidth
and the capacity of a single workstation, distributed
architecture should be introduced.
The prerequisite for the first distribution method is to

allocate all the Web sites in China to different main
controllers optimally. It means to allocate Web sites in
terms of the communication time between a Web site and
any main controller. First, we should collect Web sites as
many as possible, and assign them to each main controller
by the scheduler. After measuring communication time,
we re-allocate the Web sites again according to the result.
At that time the system is in an optimized state. Second,
when main controllers begin working, if a main controller
finds a new Web site, it should transmit the Web site to
the scheduler which charges in allocating the new Web
site to every main controller in terms of measuring
communication time, then the main controller with the
shortest time gets the privilege of collecting the Web site.
However, because all main controllers are placed in
different places, a lot of trouble will come, such as
debugging and maintaining the subsystems, analyzing data
after collection and so on. The second method is a little
slower than the first one in collecting speed, but the
second is simple when initializing the system and doesnt
have any shortcomings of the first. Additionally, through
reasonable tuning hash function that affects the allocating
of URLs, the second strategy can approach the speed of
the first.
In the first main controllers communication strategy,
even though it is simple while being initialized, there are
possibilities of transmitting cross URLs time after time.
So, a big amount of communication is a fatal shortcoming.
The
second
strategy
of
main
controllers
communication has an evident advantage: transmitting
URLs quickly, and achieving load balance easily because
any two main controllers have connections. So any
changes of the whole distributed system can be
immediately detected.
3.3 Analyzing the whole architecture

We adopt the second distribution strategy and the
second main controllers communication strategy in
WebGather. Figure 1 shows the system architecture.
WebGather Server Registry (WSR) is the scheduler
module which stores information including IPs and Ports
of all registered main controllers in the distributed system.
When any main controllers state has changed, WSR will
deliver new information to other main controllers to
reestablish connections. Every main controller is in charge
of collecting Web pages in its own scope. Each gather
belongs to some corresponding main controller. It
receives URLs from its main controller, crawls the Web
pages pointed by the URLs and transmits the content to its
main controller. When any main controller finds cross
URLs in the content, it should send them to the
corresponding main controllers. To reduce the amount of
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
communication, main controllers only send URLs among

each other.
The existence of the module of WSR (see the detail in

the description of Figure 1) makes dynamic reconfiguration of the system possible, which guarantees
high availability and scalability. Under the condition of
maintaining the load balance of the system, we consider
three feasible URL-allocation methods.
Gather2
MainCtrl2
Gather3
MainCtrl3
MainCtrl1
Gather1
WSR
MainCtrlN
3.5 Dynamic reconfigurability: an enhancement

to the model
1.
2.
..
GatherN
Figure 1. The Distributed WebGather architecture
3.4 Key technologies and their analyses

1.
2.
3.
Use hash table structure to store mass URL

data.WebGather uses Informix database to store the
visited Web pages. In the experiment, to improve
efficiency, we take hash table structure to produce a
file instead of a database to store URLs. In this way,
the main controllers only access database in a limited
resource. However, its drawback is the potential
inconsistency of data after the machine fails. It needs
vast work to maintain the data consistency. From the
view of data consistency, commercial database is
more safe. However, if the systems safe running is
guaranteed, an economical and efficient way is to use
file directly instead of database.
Use uniform domain name resolution to ensure
consistency in the experiment environment. Since it is
normal for some domain names and Web pages to be
changed during parsing domain names, it will lead to
different results even though you run the same group
experiment. In addition, it will be very slow to parse a
domain name if it no longer exists. So we parse all of
them at one time and store them in a file. Each
experiment takes the same data set, thus the results
are meaningful for comparing.
The WSR module in the distributed system makes
sure each main controller holding the newest and
consistent information in the system. It is a
prerequisite to make the system feasible and reconfigurable.
3.
Use hash function to dynamically allocate

URLs.
On the basis of the first method, each main
controller maintains a table of Web sites.
Tables are identical among different main
controllers. Every record in the table contains a
Web site (IP) and the corresponding main
controller computers information.
Use a two-stage logical mapping. First, we map
URLs into a logical table by a hash function,
and then map certain parts of the logical table
into different main controllers.
By comparing the performances of the three methods

when adding or subtracting one main controller, we can
know which is the best. Let the number of Web sites be M,
the initial number of main controllers be N, Ni and Nj
represent two arbitrary main controllers respectively,
NN+1 represent the condition of adding one main
controller and NN-1 represent the condition of
subtracting one main controller.
Lets first look at the 1st method. After the system is
initialized, each main controller is responsible for M/N
Web sites. The hash function is h (x)= x MOD N (where x
is the sum of integer parts of IP or other result obtained
through other methods). So the load balance of the system
can be guaranteed. When either adding or subtracting
several main controllers, N will change. As a result, the
URLs previously belonging to Ni may be allocated to Nj,
which will lead to collect some Web pages repeatedly.
To overcome the drawback of the first method, in the
second method each main controller maintains two extra
tables Web sites table and visited URL table. All of the
Web sites tables are identical but the visited URL tables
are different for different main controllers. Because there
are limited Web sites, the join of a main controller may
have not enough Web sites to crawl. To maintain load
balance of the system, we have to shift some Web sites
plus the corresponding information in the visited URL
table from the existing main controllers to the new one.
Under such condition, there is an extra step needed after
calculating a URL by the hash function. 1) When a main
controller knows that a URL should be crawled by itself, it
must first judge whether the IP to which the URL belongs
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
has been crawled by another main controller according to

the Web sites table, and if not, it can do its work. 2) When
a main controller knows that a URL should be crawleded
by another main controller say, controller A, it must also
judge if the corresponding IP has been crawled by a main
controller other than controller A, and if not, it will
continue to send the URL to control A. In this method we
must maintain the consistency of the Web sites table
within different main controllers and transfer some
information about Web sites and corresponding visited
URLs when the number of main controllers changes. As a
result, the amount of communication among main
controllers will increase. We have decided to calculate
URLs using algorithm MD5, then each URL only takes 16
bytes.
Array A
1
5000
5001
10000
45001
50000
n1
n2
n10
4545
4546
5000
5001
9545
9546
10000
45001
49545
49546
50000
n1
n1 shift
n2
n2 shift
A[45001], A[45002] ...A[50000] to the 10th main

controller.
When adding a main controller, every existing main
controller should give part of its logical nodes to the new
one. So parts of A must be changed. In the example, the
change is shown in right part of Figure 4: n11 is made up of
what are chosen from n1, n2, , n10 respectively. The
corresponding elements in array A will be set to 11. And
when subtracting a main controller from the system, the
subtracted main controller should share its Web sites to
any other main controller. We still ought to change some
parts of array A.
Compared with the second one, the third method
adds a mapping and it also should maintain the visited
URLs table and should shift some of the table items to
other main controllers when the number of main
controllers changes. But in this method, it is not necessary
to store the Web sites table. Only a logical array A is
stored by each main controller. As a result, the amount of
communication among main controllers decreases. And
there will not be the extra step after calculating URLs by
the hash function as in the second method.
Now, because the first method is simple, we use it in
the simulation system. But the third method excels the
other two. So we will use the third method in WebGather
2.0. It will guarantee the good scalability of the Webcrawling subsystem.
n11
4. The result of the experiment

n10
n10 shift
Figure 4. Two-stage logical mapping of URLs (The

left part is the initial state of the logical array; the
right part is the state after adding one main controller)
Our third method uses two-stage logical mapping of
URLs. Here we use array say, A, to store the logical nodes.
Each array elements subscript represents this logic nodes
sequence number and its corresponding sequence number
of main controller is stored in this array element. For
example, assume N=10 and M=50000, M is the number of
logical nodes. The initial state of A is shown in left part of
Figure 4, A[1], A[2], .A[50000] are called logical
nodes. First we map the URL to the logical nodes by hash
function, then in the second stage map A[1],
A[2], A[5000] to the 1st main controller, A[5001],
A[5002], A[10000] to the 2nd main controller,
In Jun 2000, while WebGather is running, we utilize a

program to get simulative data which are 507 megabytes
including Web page URLs and cross URLs. After running
the program, we get simulative Web data with 761,129
Web pages. The data is our experiment object. All of our
measurements were made on a general Intel PC with two
550MHZ Intel processors, 512 megabytes of memory and
36 gigabytes hard disk. The operating system is Solaris
8.0.
Based on the above experiment environment, we
separately test four situations by varying the number of
main controllers (e.g., 2, 4, 8, and 16). Four experiments
are done independently. A centralized main controller
runs along with the distributed system consisting of n
main controllers. Every group takes at least three days to
finish testing. We got a large number of results from the
experiment.
4.1 Analyzing the load balance

To maintain load balance of the system, we use hash
function to dynamically allocate URLs to every main
controller. The consequence can be obtained through
analyzing collected Web pages by each main controller
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
every hour. After analyzing the results of the first ten

hours, we can deduce the final result of the system.
Take a group of reference data as illustrated in Table 1.
If the four groups of experiment data are all better than the
reference one, we deem the system meets the requirement
of load balance.
Table 1. Reference data (suppose two main
controllers)
t
10
ref1
ref2
2
3
4
6
6
9
8
12
10
15
12
18
14
21
16
24
18
27
20
30
By computing variances for the four groups of results,

we can know their divergence degree that can be thought
of as a standard to measure whether load is balancing.
Computing needs the following formulas.
E ( X ) = xk pk
k = 1,2,... (1)
Table 3. Load variances

t
n
2
4
8
16
Reference
0.000110
0.001454
0.000501
0.000309
8.18E-05
0.000326
0.00059
0.000564
0.000375
0.000315
0.000124
7.04E-05
6.11E-05
4.98E-05
5.32E-05
1.06E-05
1.57E-05
1.43E-05
1.11E-05
1.34E-05
0.01
0.01
0.01
0.01
0.01
6
6.18E-05
0.000465
4.18E-05
1.42E-05
0.01
7
2.14E-07
0.000702
4.25E-05
1.48E-05
0.01
8
1.25E-05
0.000672
7.44E-05
1.51E-05
0.01
9
2.74E-05
0.000662
5.91E-05
1.58E-05
0.01
10
8.24E-06
0.000568
5.79E-05
1.82E-05
0.01
We can see from Table 3 that, when there are two, four,
eight, or sixteen main controllers, the variances are all less
than the corresponding reference ones. That is to say each
main controller is responsible for about the same size of
Web page set. Thus, the expected goal of load balance of
the distributed system is achieved.
k =1
D( X ) = [ x k E ( X )]2 p k k = 1,2,... (2)
x k' =
xk
j = 1,2,... (3)
x
j =1
Explain the above (1) and (2) formulas: X is a random

variable (it has finite number), distribution of X is
P{X = xi } = pi ,
where i = 1,2,...n . X can be
any value of
x1 , x2 ,..., xn . Probability of Xs value is
p1 , p2 ,..., pn . Formula (1) is used to compute Xs mean

value, and formula (2) is used to compute Xs variance.
To compare the results, we regulated four group
experimental data and the reference data using formula (3).
Table 2 shows the regulating result of Table 1.
Xk is the value of one number in one column of Table 1.
x k' is the regulating result of Xk.
x
j =1
is the sum of one
column of Table 1.
Table 2. Regulated reference data
t
10
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
0.6
n
ref1
ref2
After regulating results of the four groups of

experimental data and reference data, we use formular (1)
and (2) to compute the variances, as shown in Table 3.
4.2 Amount of communication between main

controllers
To ensure consistency of the experiment environment,
we parse all the domain names at one time. So the amount
of communication only includes transferring cross URLs.
Every main controller only sends cross URLs and each
URL is no longer than 128 bytes. In the actual system, for
improving utilization of domain names, parsed domain
names by every main controller should be transmitted to
each other. Every main controller maintains a table that
includes corresponding relations between domain names
and IPs. Every record is no longer than 72 bytes (using 64
bytes to store host name, 4 bytes to store IP and 4 bytes to
store visiting time). So the amount of communication is
small. Additionally, to maintain the dynamic
reconfigurability of the system, when the number of the
main controllers changes, each main controller needs to
modify some of their tables (e.g., Web site table). To
maintain consistency of tables, the system needs extra
amount of communication. However, the situation is
seldom. So it will have little affect on the amount of
communication (see the detail in 3.5). Considering one
copy of message should be sent to many main controllers
in the previous two situations, we use multicast
technology in the actual system.
4.3 Analyzing scalability

The results of the first ten hours are shown in Table 4
and Figure 2. In Figure 2, the X-axis represents time,
whose unit is one hour, Y-axis is the number of visited
Web pages, and four groups of results are described by
various types of lines, for each type of lines, the higher
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
Table 4. Four groups of experimental results

Number
of main
controllers
in the
distributed
system
2
4
8
16
Number
of main
controllers
in the
centralize
d system
1
1
1
1
Line
types in
the
Figure 3
Visited Web pages number

Centralized
Distributed
Ratio
56199
52712
51055
24763
96130
177131
304854
290344
1.710529
3.360354
5.97109
11.72491
Square
Diamond
Plus
Star
In Figure 3, X-axis is the number of the main

controllers and Y-axis is the ratio of the number of Web
pages crawled by the distributed system and those by the
centralized system. We can conclude from Figure 3 that Y
is nearly linearly increasing with the increase of X when n
is not too big. So the distributed system has a good
scalability.
5
3.5
10
12
14
16
maincontrol number
Figure 3. The speedup of distributed system
5. Conclusion
The parallel and distributed architecture described in
this paper provides a method for efficiently crawling
massive amount of Web pages. The simulation results
demonstrate that the system realizes our design goal. At
present, we are applying the architecture and method to
implement WebGather 2.0. In the real system (visit
http://e.pku.edu.cn for a look), we temporarily run two
main controllers, which shows the expected outcome,
collecting about 300,000 pages a day. At the same time,
we realize that the success of the distributed crawling
system brings many more new issues of research and
development, such as parallel indexing and retrieving, etc.
In addition, we believe that the architecture proposed in
the paper can be used to build the information system
infrastructure in digital library context.
6. Acknowledgement
8
16
2.5
web page number
10
2,4,8,16 mainctrls
x 10
2
4
1.5
2
2,4,8
0.5
16
0
2,4,8,16 mainctrls
12
acceleration
one depicts the result of the distributed system with n

main controllers, and the lower one depicts the result of
the centralized system running with the distributed system.
Due to the restriction of resources, the distributed
system and the centralized system are running in one
computer simultaneously. But from Figure 2 we can find
out that the centralized system (only with one main
controller computer)s performance remains invariable
when resources are shared with 2,4 or 8 distributed main
controllers. While there are 16 distributed main
controllers, the performance decreases greatly. On the
other hand, when the number of main controllers is below
8, the more main controllers, the higher the crawling
efficiency of the distributed system. Finally, when the
number reaches 16, because of the overload on the system
resources, the performance will decrease as that of the
centralized system.
5
time
10
Figure 2. Crawling efficiency of both distributed

system and centralized system
The work of this paper was supported by National

Grand Fundamental Research Program (973) of China
(Grant No. G1999032706), and we are grateful to
Zhengmao Xie, Jianghua Zhao, and Songwei Shan for
their helpful comments.
7. References
[1] J. Liu, M. Lei, J. Wang, and B. Chen. Digging for gold on
the Web: Experience with the WebGather. In Proceedings
of the 4th International Conference on High Performance
Computing in the Asia-Pacific Region, Beijing, P.R.China,
May 14-17, 2000. IEEE Computer Society Press. PP: 751755.
[2] Google Search Engine. http://www.google.com
[3] http://searchenginewatch.com/reports/sizes.html
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
[4] C. Mic Bowman, et al. The Harvest Information Discovery

and Access System, Technical Report, University of
Colorado Boulder, 1995
[5] S. Brin and L. Page. The anatomy of a large-scale
hypertextual Web search engine. In 7th International World
Wide Web Conference, Brisbane, Australia, 1998.
[6] CERNIC information service,
http://www.nic.edu.cn/INFO/cindex.html
[7] China Internet network development status statistical
reports, http://www.cnnic.net.cn/develst/report.shtml
[8] http://www.inktomi.com/Webmap
0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Architectural Design Web Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architectural Design Web Mining

Uploaded by

Copyright:

Available Formats

Architectural Design and Evaluation of an Efficient Web-crawling System

Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo

whole Web pages it hosts. We note that Google [2], born

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

distributed architecture to gather and search information

A search engine has requirements on the rate of

2.2. Google: a typical centralized architecture

2.3. Our work

implementing a search engine, but also fitting for building

3. A model for a distributed Web-crawling

3.2. An architecture and primary design ideas

Main controllers are physically dispersed at

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

they can be placed in Shengyang city, Beijing

3.2.2 Main controllers communication strategies.

From the view of theory and practical applications, we

The prerequisite for the first distribution method is to

3.3 Analyzing the whole architecture

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

communication, main controllers only send URLs among

The existence of the module of WSR (see the detail in

3.5 Dynamic reconfigurability: an enhancement

Figure 1. The Distributed WebGather architecture

3.4 Key technologies and their analyses

Use hash table structure to store mass URL

Use hash function to dynamically allocate

By comparing the performances of the three methods

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

has been crawled by another main controller according to

A[45001], A[45002] ...A[50000] to the 10th main

4. The result of the experiment

Figure 4. Two-stage logical mapping of URLs (The

In Jun 2000, while WebGather is running, we utilize a

4.1 Analyzing the load balance

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

every hour. After analyzing the results of the first ten

By computing variances for the four groups of results,

Table 3. Load variances

D( X ) = [ x k E ( X )]2 p k k = 1,2,... (2)

Explain the above (1) and (2) formulas: X is a random

x1 , x2 ,..., xn . Probability of Xs value is

p1 , p2 ,..., pn . Formula (1) is used to compute Xs mean

x k' is the regulating result of Xk.

is the sum of one

After regulating results of the four groups of

4.2 Amount of communication between main

4.3 Analyzing scalability

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Table 4. Four groups of experimental results

Visited Web pages number

In Figure 3, X-axis is the number of the main

Figure 3. The speedup of distributed system

one depicts the result of the distributed system with n

Figure 2. Crawling efficiency of both distributed

The work of this paper was supported by National

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

[4] C. Mic Bowman, et al. The Harvest Information Discovery

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

You might also like