You are on page 1of 7

An Envorinment for Mirroring Proceedings of JENC7 L.

Kovács

An Environment for Mirroring Hypermedia Documents

László Kovács <laszlo.kovacs@sztaki.hu>


András Micsik <micsik@sztaki.hu>
Gábor Schermann <schermann@sztaki.hu>

Abstract before any requests and requests (including the first)


are locally served.
Mirroring (creation of remote copy) of WWW Caching generates a low profile, permanent traf-
hypermedia documents is discussed. First a Portable fic. Mirroring generates "burst-like" traffic, short
Hypermedia Document (PHM) format is defined. heavily loaded periods and long, silent periods. Inter-
Mirroring service and a software environment for net night traffic is usually much lower than in daily
mirroring are introduced. Mirroring environment hours, mirroring could make less traffic and less trou-
automatizes the process of mirroring and transforms ble in daytime.
WWW hypermedia documents into PHM format. The Mirroring downloads all coherent documents,
algorithm and the architecture of the environment of caching doesn't. In this case mirroring is useful when
mirroring service are described in detail. the documents are big and/or the connection is too
slow to download documents on request.
I. Introduction If documents are downloaded in the mirroring
scheme, data are persistent. Caching has temporal
The World Wide Web (WWW) is a networked documents, if the cache-buffer is filled, it will throw
hypermedia architecture [1], [4], joining millions of out some data. Mirroring makes you sure to have the
documents together nowadays via hypertext links. document locally.
Documents are stored on server machines and client Under mirroring the allocated disk space is static,
software running on practically any kind of net- not changing. Under caching this size is dynamically
worked computer is used to retrieve documents changing. There are various cache-techniques, utili-
through Internet. zation of disk space is heavily depends on it.
Mirroring in the Internet jargon means the cre- The goal of using caching is an overall perfor-
ation of a remote copy for some data or complete mance improvement. Mirroring is frequently used
hypermedia documents. This technique is used for when fast access is needed on some given docu-
information that is very popular or served via low- ments.
speed connections. It can help to decrease network These two techniques can be used together,
traffic over the Internet backbone. Various techniques because they speed up different types of requests. For
of mirroring work well for other types of Internet ser- browsing the Internet, caching is a better choice. But
vices such as FTP [9] or USENET News and have if something important is needed to access at any
enormous significance in the area of World Wide time, mirroring is necessary, because only mirroring
Web that generates most traffic of all services over ensures the documents to have locally. For example,
the Internet. Although there exist a few public if some programs or programming languages have
domain scripts for WWW mirroring [10], the topic is descriptions, FAQs, tutorials, etc. in HTML-format,
in a somewhat premature state according to the and you want to use that, it is simpler to mirror docu-
evolving needs of the society of Internet. This can be ments in a night hour than downloading document
caused by the spread of WWW caches. one-by-one on requests.
As a result one can conclude that caches do not
I.A. Caching versus Mirroring copy WWW documents completely, and therefore
Caching on Internet is similar to disk caching on caches are not able to provide a true secondary loca-
computers. Caches serve the recently viewed docu- tion of a document with a high degree of availability.
ments to avoid repeated download of the same data. For this one need solutions for mirroring. The pro-
The following table shows the differences in func- cess of mirroring should be an intelligent and auto-
tionality between caching and mirroring. matic task. As a side effect the exchange of complex
Caching is an on-request dynamical technique HTML documents over the Internet would be easier.
whilst mirroring is static, request-independent man- In section 2 the nature of WWW links is
ner of document access speedup. In case of caching described. Section 3 concentrates on the portability
document is downloaded at the time of the first of complex HTML documents. The mirroring algo-
request. The second and next requests are served rithm and the environment are presented in section 4.
locally. In mirroring, the documents are downloaded

142-1
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

Caching Mirroring
Time of downloading On the first request Before the first request
Downloading Partial documents Complete documents
Data persistence Temporal Persistent
Network load Evenly spread Burst-like
Allocated disk space Dynamically changing Static
Usage Overall performance improvement Fast access on a per document basis

II. Links in HTML documents http://www.sztaki.hu/pictures/logo.gif


• files of other types
II.A. Standard notation of hypermedia links http://www.sztaki.hu/papers/TR95-1.ps
on the Internet • executables
The language of World Wide Web documents is http://www.sztaki.hu/cgi-bin/news.pl?comp.sys
called Hypertext Markup Language [5]. Links in The server response header contains the appropri-
HTML files are expressed with Uniform Resource ate MIME type of the response content. This informs
Locators [11] which give the information for a the client how to handle the content of the down-
WWW client that is needed to retrieve the linked loaded document. The client either presents it or acti-
document. A URL has the following general syntax: vates the proper viewer application to display the
<protocol name>:<server descriptor> content.
<local descriptor> Generally a file path information is provided in
the URL as a part of the local descriptor. In this case
where protocol name can be substituted with ftp,
the information on a WWW server is stored on a per
gopher, news, telnet, mailto or http. HTTP stands for
file basis inside the operating system’s file system.
Hypertext Transfer Protocol, the data transfer proto-
The file path points to a file, the type of which can be
col of WWW [6]. Server descriptors consist standard
deduced from the file name suffix. These suffices are
IP machine names optionally accompanied by port
mapped to appropriate MIME types which travel
numbers. Local descriptors provide the address for a
together with the content of the file to the client
part of the service inside the server, thus their format
application.
is specific to the protocol chosen. When it is mean-
ingless, either the server or the local descriptor is II.C. Links to executable files
omitted. Examples for the different types of URLs:
If an URL points at an executable file according
mailto:micsik@sztaki.hu
to the configuration of the server, that file is exe-
news:comp.sys.next.announcements cuted, and the results are transferred to the client.
telnet://www.sztaki.hu:80 This mechanism is done via the Common Gateway
gopher://gopher.eunet.hu/00/ripe/About-ripe Interface [2]. CGI programs are recognized either by
ftp://ftp.sztaki.hu/pub/unix/INDEX their suffix or by residing in CGI directories. Parame-
http://www.sztaki.hu/sztaki/contactinfo.html
ters can be passed to the script in two ways of which
only that case is interesting for us, when parameters
II.B. Links between nodes of the World are encoded into the URL, just after the file path.
Wide Web Parameters are separated from the file path with a
question mark. Even the file path may contain para-
The http-URLs are discussed in more detail, since metric information, the so-called extended path info,
replicating other types of Internet services is not our which starts right after the path of the program. For
intention yet. An http-URL can point at different example:
types of data:
http://www.sztaki.hu/cgi-bin/imagemap/sztaki/
• hypertext file 2ndfloor.map?12,12
http://www.sztaki.hu/sztaki/contactinfo.html this means that the cgi-bin/imagemap program is
• section of a hypertext file called with the extended path info of sztaki/2ndfloor.-
http://www.sztaki.hu/sztaki/contactinfo.html- map and parameters 12,12. This CGI program han-
phones dles clickable images. The parameter sztaki/
• in-line image 2ndfloor.map points at the map file containing the

142-2
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

data of clickable areas on the image. The parameter connectedness, that in most cases all vertices are
12,12 is the coordinates of the actual click. accessible via edges from a root vertex or call it an
According to present Web server software and entry point. This is not a syntactical rule rather a
their current usage, to decide whether a link is to an semantic requirement otherwise the document can be
executable or to a solid file is not always possible. considered as buggy or useless.
There are cases when neither any of the special suf-
fices nor the parameter part are present, usually when
III.A.2. Storage structure
the server has special directories containing CGI pro- There is another secondary structure of an HTML
grams. There is no way to find out which are these document which we call storage structure. It repre-
directories on a server. The hidden danger for a mir- sents how the file set is stored in the directory hierar-
roring program is to regard the output of a CGI pro- chy of a file system. This structure is claimed to be
gram as a stationary file and thus creating a false secondary because this structure need not be revealed
mirror. Therefore it would be desirable to distinguish for the user and does not make an impact on the
executables from other files by their suffices. usage of the document.
Usually HTML format links are given with the
II.D. Absolute and relative links
help of the storage structure, so the link structure and
Abbreviations of URLs are called relative links, the storage structure become closely related. How-
where omitted parts of the URL are copied from the ever the storage structure of an HTML document can
URL of the referencing document. In accordance to be modified while the link structure remains the
this an URL is qualified as absolute when all seman- same.
tic parts of it are present. So let us consider this abso-
lute URL: III.B. Idea of Portable Hypermedia
http://www.sztaki.hu/sztaki/contactinfo.html Generally moving of hypermedia documents con-
And have a look at some of its possible relative sisting of several files corrupts the links inside them.
representations: Correction can be time-consuming (manual) work. A
portable format for hypertext documents is needed.
contactinfo.html (from the same directory)
Documents in this format is freely movable inside
/sztaki/contactinfo.html (from the same server)
file systems or between machines. Furthermore it can
//www.sztaki.hu/sztaki/contactinfo.html (with be served without changes from any WWW server or
the same protocol) can be archived (compressed). This issue is sponta-
In relative URLs one can also use the .. notation neously raised by some webmasters, and similar
for the parent directory: effort can be seen in Rohit Khare’s eText software
../../divisions/afe.html [7], but there isn’t a well-established and standard-
An important property of URLs is that any URL ized format yet.
can be converted from relative to absolute and vice A PHM is a directory or an archive containing the
versa regarding the URL of its referencing page. document file set, and a set of associated parameters.
Parameters may contain: title, authors, copyright,
III. Portable hypermedia documents abstract, entry point, file format descriptions, file for-
mat statistics etc. Minimally the entry point should
be present. The included document file set has to
III.A. Structure of HTML documents meet the following requirements:
An HTML document could be considered as a set 1. URLs pointing at files inside the file set are
of files, and a set of links. The file set contains all relative
files needed for the document. Files are generally
HTML hypertexts but other file formats can occur as 2. URLs pointing outside the file set or referring
well. The link set can be split as real hypertext links, to other Internet protocols are absolute
links for in-line images and links for executables or A set of different operations can be defined on
for other Internet protocols. The link set can also be PHM such as flattening the directory hierarchy of the
split according the destination of the link, so outside file set or masking out some types of files (images,
links pointing out of the file set and inside links are movies, sounds, etc.) from the file set. PHM format is
distinguished. to be used in our mirroring software environment for
III.A.1. Link structure representing the result.

The link structure of an HTML document is rep- IV. A software environment for mirroring
resented by vertices as files and edges as links possi-
bly labeled with a position inside the file. The most A robust mirroring software was built with a
important feature of this directed graph is a way of clearly specified behavior and an extendible architec-

142-3
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

ture. First the two-phase mirroring algorithm is pre- for every retrieved document with links do
sented. After the architecture of the software Convert links to relative where necessary
environment is discussed. end for
Create PHM as output
IV.A. Two-phase mirroring algorithm end program

The mirroring algorithm navigates on the link IV.A.1. Classification of URLs with respect
structure of the HTML files which is a directed to retrieval
graph, so it is natural to use a specialized graph
search as skeleton algorithm. The input of the proce- The program has to decide which files are to be
dure is an URL serving as an entry point of the result. downloaded and which files are not. The first limiting
The result is given in PHM format which will be factor is a de facto standard on World Wide Web.
moved to the desired final location. Servers maintain a list of forbidden areas in their
The algorithm has two phases: first phase is document space. This list is stored in a file called
downloading all related document files and in the robots.txt at the root of the document space. Opera-
second phase (relocation) the URLs are converted. tions of automatic network retrieval processes (net-
The first phase operates with an URL set to be pro- work robots) have to be limited compared to human
cessed. This contains the URL of the entry point users on Web servers, because robots can destroy the
when the process is started. When a file is retrieved, performance of the server by unintelligent repeating
links are extracted from it, analyzed and placed into retrievals at high speed. For example robots can step
the URL set. The next URL to be downloaded is cho- into an infinite retrieval loop. This can be due to a
sen from the set, and that is the point where heuris- bug or failing to recognize that they are retrieving
tics can be applied in determining the order of the documents generated by CGI programs. Falling into
retrievals. Retrievals are logged with enough infor- the robot category, our algorithm has to check the
mation to handle time-outs, or fatal errors. In case of limitations for robots on the server.
time-outs the retrieval is attempted again later. If the Another limitation is that mirroring is done only
mirroring process is broken, it can be resumed at that via HTTP, all other types of URLs remain unproc-
point where the abort occurred, so earlier processing essed. The functionality to handle multiple protocols
is not wasted. can be added, but the result is questionable. One
The conversion of links takes place in the second solution could be to integrate files retrieved via other
phase after the retrieval of all files is finished and the protocols into the PHM, so from the mirror side they
URL set is empty. Then all URLs are analyzed again would be served via HTTP.
and converted to relative or absolute according the If an URL matches the above criteria, there are
rules of PHM. This operation is called relocation of still two kinds of information which normally cannot
links. Postponing relocation after the retrieval phase be retrieved via HTTP: CGI programs and maps for
has advantages: first it can be checked if the file clickable images. Therefore requests for these ser-
pointed at by a local link exists, second it is easier to vices are not mirrored but forwarded to the original
add several types of relocations or combine it with host from the mirror site. With HTML 3.0 separate
other operations on PHMs. imagemap files will not be needed. CGI programs
The skeleton of the algorithm: seem to be irrelevant to mirror, but still the safe iden-
tification of links to CGI programs is needed. A pos-
program Mirror sible solution would be to standardize a set of
(input: root-URL, options; output: result-PHM) suffices for CGI programs.
Initialize URL set Also there are options which control the location
Check remote server's limitations for robots and formats to be mirrored. A scope is given for
while URL set is not empty do determining the locations, URLs outside of the scope
Choose next URL for retrieval from the set are not mirrored. Usually the scope is the directory of
(heuristics) the entry point. Other scopes can be: the server of the
Retrieve document for selected URL
entry point or several directories on a server. For the
Handle errors, write log
if document is successfully downloaded
formats either wanted or unwanted file types are
then listed. This way one can mirror a document without
if document has links then images or without Postscript files, etc.
Select URLs to download,
convert them to absolute,
IV.A.2. Choosing the next URL to download
add them to URL set Document images should be downloaded right
end if after the document itself. After the images a URL is
Store document chosen from the set applying heuristics. The set
end if
includes those links as well for which the retrieval
end while
failed with a time-out. The retrieval for these links

142-4
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

should be retired after well chosen intervals. Heuris- and a practical description should be made.
tics should take into account the actual behavior of Another problem is the image-maps in HTML-2.
network traffic and the user’s expectations. The pos- Every image-map has a description which cannot be
sible positive effects of using depth-first, width-first retrieved, so if the user clicks the map, a request is
or more elaborate heuristic strategies under certain sent to the original site. The answer could be any-
circumstances should be investigated. thing, e.g. HTML pages with links. These links could
not be mirrored. In HTML-3 this problem is solved,
IV.B. Architecture of the software environ- the HTML-3 specification contains a well-defined
ment manner of inserting an image-map into a HTML doc-
The environment provides services such as mir- ument, and in this way all points can be followed.
roring on-demand, timed mirroring, continuous mir- Perpetual usage of the program raised the prob-
roring. Mirroring on-demand can be initiated by any lem of handling together downloaded Portable
user. The environment collects the required docu- Hypermedia documents. If a document has a link to
ment and sends it to the user in a PHM archive. another document, and both are mirrored, but sepa-
Timed mirroring is a preprogrammed task where the rately, the link in the local mirror will point to origi-
document is downloaded at a preset time. This can nal place. One could expect that the link in the mirror
help to perform mirroring in a low traffic period on points to the other mirrored document. To handle this
the network. Continuous mirroring maintains stable problem, the idea of “mirrorspace” could be defined,
copy of a changing document. This is achieved by which is a set of PHM-s, where links between PHM-
intelligently repeated mirroring updates. During an s are converted to relative. Maintaining mirrorspace
update only the modified files are retrieved and put can be difficult. If one or more documents (text or
into the existing mirrored document. image) are added to it, each existing document
The user interface for the environment is imple- should be revised whether it have links to the added
mented as a set of WWW forms. The user can set the documents. Similarly, the newly added HTML docu-
address (URL) of the requested document, the ment should be checked for every documents in other
options for the mirroring software and the options for PHM-s. Deletion of one or more documents raises
the result document format and the mode of delivery. the same problem. These operations are very
For continuously mirrored documents the pre- resource-demanding, takes a lot of time and data
ferred rate and time of the refreshment can be config- transfer. If one could solve these problems, the mir-
ured together with the previously mentioned options. rorspace could be a good way of future development.
A mirror scheduler is responsible for the control of However, the program could be accelerated. Two
the timed mirroring tasks. Every completed operation passes of the algorithm could run in parallel. Parallel
is logged, which is used when a broken mirroring is programming language implementation would be
restarted. The mirror-log is stored and analyzed, if faster. Pass one (downloading the documents) could
necessary, the next mirroring is rescheduled or the be divided by the URLs as each task retrieves an
administrator is notified of the abnormal completion URL. There could be a limit for the number pro-
of the process. The following figure illustrates the cesses running at the same time, because this action
elements of the environment. burdens both the remote server and the network con-
nection. Pass two (relocating) is an operation exe-
IV.C. Experiences cuted on pairs of elements of URL-sets, so it can be
scheduled on URL-pairs.
Using the mirroring environment approximately
15 Internet sites have been mirrored, transferring 150
Megabytes of information. V. Summary
One of the problems during the usage, was the
lack of information about remote server's storage A mature algorithm for mirroring and a standard-
structure and conventions. It is embarrassing when ized portable hypermedia format can ease the distri-
the directory-index file (index.html by convention) is bution of hypermedia documents through the World
renamed and is referred with full name and with Wide Web. In this paper a two-phase mirroring algo-
directory-referencing at the same time. Also, the rithm was developed. The algorithm is able to create
server limitations for CGI programs mentioned in remote copy of a complex HTML document stored in
4.1.1 (robots.txt) was missing several times. another WWW server. The algorithm results the mir-
The solution could be an improvement for HTTP rored document in a portable hypermedia format
protocol. For example, there could be a request to (PHM) defined in this paper as well. Hypermedia
server like 'NNN Show server conventions'. The documents in PHM format can be transferred with no
answer would be a text, each line containing a 'vari- need for further semantic transformations.
able=value' pair, like 'directory-index=index.html' or A software environment based on the previously
'binary-directory=/cgi-bin'. This should be cleared mentioned algorithm for mirroring hypermedia docu-
ments was built. The environment provides different

142-5
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

M Logging Log
i
r
Selecting next URL
U r
s o
e r URL set
M Error registration with status,
r i format, etc.
S
r
I c Document parsing
r
n h
o
t e
r
e d Relocation
i
r u
n File set
f l
g Store/Load files
a e
locally
c r
e
Download files over
the network

Limitation check Server


limitations

high-level, intelligent automatic mirroring services [5] Hypertext Markup Language, URL: http://
via usual WWW interface (set of forms). The proper www.w3.org/hypertext/WWW/MarkUp/Mark-
use of this software environment can decrease the Up.html
network load during peak periods and can increase [6] Hypertext Transfer Protocol, URL: http://
the accessibility of selected hypermedia documents. www.w3.org/hypertext/WWW/Protocols/Over-
The mirroring technique developed here can be view.html
the first step in the direction of introducing a separate
[7] Rohit Khare: “eText: An Interactive Hyperme-
protocol and/or protocol extension for mirroring pur-
dia Publishing Environment“ Proceedings of
poses similar to that was proposed in [8].
ACM Hypertext’93 - Demonstrations
VI. References [8] László Kovács, András Micsik: Replication
within Distributed Digital Document Libraries.
[1] Tim Berners-Lee, Robert Cailliau, Jean-Fran- Proceedings of the 8th ERCIM Database
cois Groff, Bernard Pollermann: “World-Wide Research Group Workshop on Database Issues
Web: the Information Universe”, Electronic and Infrastructure in Cooperative Information
Networking, Vol. 2. No. 1. Systems, Trondheim, Norway, 1995

[2] Rob McCool: The CGI Specification, URL: [9] L. McLoughlin et.al: Mirror - Mirror Packages
http://hoohoo.ncsa.uiuc.edu/cgi/interface.html on Remote Sites (UNIX man-pages)

[3] Douglas E. Comer: Internetworking with TCP/ [10] Oscar Nierstrasz, Gorm Haug Eriksen, Karl
IP, Volume I., Prentice Hall International Edi- Guggisberg: w3mir, URL:ftp://sauce.uio.no/
tions, 1991 pub/src/w3mir

[4] Andrew Ford: Spinning the Web, How to pro- [11] WWW Names and Addresses, URIs, URLs,
vide Information on the Internet, International URNs, URL: http://www.w3.org/hypertext/
Thomson Publishing, 1995 WWW/Addressing/Addressing.html

142-6
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács

Author Information

László Kovács works for the MTA SZTAKI, the


Computer and Automation Institute of the Hungarian
Academy of Sciences, as head of Distributed Sys-
tems Department. After his study he was involved in
different projects in the areas of computer network
protocol specifications, verifications and implemen-
tations. During his career he taught years in different
foreign universities and research establishments
including the University of Delaware, Newark/Dela-
ware/USA and the Ecole Normale Supérieure de
Cachan, Cachan/France. During the last years, his
interests include research and development of distrib-
uted applications, World Wide Web services, CSCW,
groupware systems, distributed digital library sys-
tems. At present, multimedia services, audio/video
conferencing and virtual art are also included in his
professional activities.

András Micsik works for the Distributed Sys-


tems Department of MTA SZTAKI. His activities
include design and implementation of World Wide
Web services, setting up audio and video conference
environments, and teaching different Internet tech-
nologies at the application level. He is a Ph.D. stu-
dent in Computer Science at Eötvös Loránd
University, Budapest, where he got his M.Sc. degree
in 1992. His research topics is about distributed digi-
tal libraries.

Gábor Schermann studies for his M.Sc. degree


in Computer Science. at Eötvös Loránd University,
Budapest. He also works for Distributed Systems
Department, MTA SZTAKI. He is involved in imple-
menting World Wide Web services as well as algo-
rithms for digital libraries.

142-7

You might also like