Professional Documents
Culture Documents
Kovács
142-1
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
Caching Mirroring
Time of downloading On the first request Before the first request
Downloading Partial documents Complete documents
Data persistence Temporal Persistent
Network load Evenly spread Burst-like
Allocated disk space Dynamically changing Static
Usage Overall performance improvement Fast access on a per document basis
142-2
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
data of clickable areas on the image. The parameter connectedness, that in most cases all vertices are
12,12 is the coordinates of the actual click. accessible via edges from a root vertex or call it an
According to present Web server software and entry point. This is not a syntactical rule rather a
their current usage, to decide whether a link is to an semantic requirement otherwise the document can be
executable or to a solid file is not always possible. considered as buggy or useless.
There are cases when neither any of the special suf-
fices nor the parameter part are present, usually when
III.A.2. Storage structure
the server has special directories containing CGI pro- There is another secondary structure of an HTML
grams. There is no way to find out which are these document which we call storage structure. It repre-
directories on a server. The hidden danger for a mir- sents how the file set is stored in the directory hierar-
roring program is to regard the output of a CGI pro- chy of a file system. This structure is claimed to be
gram as a stationary file and thus creating a false secondary because this structure need not be revealed
mirror. Therefore it would be desirable to distinguish for the user and does not make an impact on the
executables from other files by their suffices. usage of the document.
Usually HTML format links are given with the
II.D. Absolute and relative links
help of the storage structure, so the link structure and
Abbreviations of URLs are called relative links, the storage structure become closely related. How-
where omitted parts of the URL are copied from the ever the storage structure of an HTML document can
URL of the referencing document. In accordance to be modified while the link structure remains the
this an URL is qualified as absolute when all seman- same.
tic parts of it are present. So let us consider this abso-
lute URL: III.B. Idea of Portable Hypermedia
http://www.sztaki.hu/sztaki/contactinfo.html Generally moving of hypermedia documents con-
And have a look at some of its possible relative sisting of several files corrupts the links inside them.
representations: Correction can be time-consuming (manual) work. A
portable format for hypertext documents is needed.
contactinfo.html (from the same directory)
Documents in this format is freely movable inside
/sztaki/contactinfo.html (from the same server)
file systems or between machines. Furthermore it can
//www.sztaki.hu/sztaki/contactinfo.html (with be served without changes from any WWW server or
the same protocol) can be archived (compressed). This issue is sponta-
In relative URLs one can also use the .. notation neously raised by some webmasters, and similar
for the parent directory: effort can be seen in Rohit Khare’s eText software
../../divisions/afe.html [7], but there isn’t a well-established and standard-
An important property of URLs is that any URL ized format yet.
can be converted from relative to absolute and vice A PHM is a directory or an archive containing the
versa regarding the URL of its referencing page. document file set, and a set of associated parameters.
Parameters may contain: title, authors, copyright,
III. Portable hypermedia documents abstract, entry point, file format descriptions, file for-
mat statistics etc. Minimally the entry point should
be present. The included document file set has to
III.A. Structure of HTML documents meet the following requirements:
An HTML document could be considered as a set 1. URLs pointing at files inside the file set are
of files, and a set of links. The file set contains all relative
files needed for the document. Files are generally
HTML hypertexts but other file formats can occur as 2. URLs pointing outside the file set or referring
well. The link set can be split as real hypertext links, to other Internet protocols are absolute
links for in-line images and links for executables or A set of different operations can be defined on
for other Internet protocols. The link set can also be PHM such as flattening the directory hierarchy of the
split according the destination of the link, so outside file set or masking out some types of files (images,
links pointing out of the file set and inside links are movies, sounds, etc.) from the file set. PHM format is
distinguished. to be used in our mirroring software environment for
III.A.1. Link structure representing the result.
The link structure of an HTML document is rep- IV. A software environment for mirroring
resented by vertices as files and edges as links possi-
bly labeled with a position inside the file. The most A robust mirroring software was built with a
important feature of this directed graph is a way of clearly specified behavior and an extendible architec-
142-3
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
ture. First the two-phase mirroring algorithm is pre- for every retrieved document with links do
sented. After the architecture of the software Convert links to relative where necessary
environment is discussed. end for
Create PHM as output
IV.A. Two-phase mirroring algorithm end program
The mirroring algorithm navigates on the link IV.A.1. Classification of URLs with respect
structure of the HTML files which is a directed to retrieval
graph, so it is natural to use a specialized graph
search as skeleton algorithm. The input of the proce- The program has to decide which files are to be
dure is an URL serving as an entry point of the result. downloaded and which files are not. The first limiting
The result is given in PHM format which will be factor is a de facto standard on World Wide Web.
moved to the desired final location. Servers maintain a list of forbidden areas in their
The algorithm has two phases: first phase is document space. This list is stored in a file called
downloading all related document files and in the robots.txt at the root of the document space. Opera-
second phase (relocation) the URLs are converted. tions of automatic network retrieval processes (net-
The first phase operates with an URL set to be pro- work robots) have to be limited compared to human
cessed. This contains the URL of the entry point users on Web servers, because robots can destroy the
when the process is started. When a file is retrieved, performance of the server by unintelligent repeating
links are extracted from it, analyzed and placed into retrievals at high speed. For example robots can step
the URL set. The next URL to be downloaded is cho- into an infinite retrieval loop. This can be due to a
sen from the set, and that is the point where heuris- bug or failing to recognize that they are retrieving
tics can be applied in determining the order of the documents generated by CGI programs. Falling into
retrievals. Retrievals are logged with enough infor- the robot category, our algorithm has to check the
mation to handle time-outs, or fatal errors. In case of limitations for robots on the server.
time-outs the retrieval is attempted again later. If the Another limitation is that mirroring is done only
mirroring process is broken, it can be resumed at that via HTTP, all other types of URLs remain unproc-
point where the abort occurred, so earlier processing essed. The functionality to handle multiple protocols
is not wasted. can be added, but the result is questionable. One
The conversion of links takes place in the second solution could be to integrate files retrieved via other
phase after the retrieval of all files is finished and the protocols into the PHM, so from the mirror side they
URL set is empty. Then all URLs are analyzed again would be served via HTTP.
and converted to relative or absolute according the If an URL matches the above criteria, there are
rules of PHM. This operation is called relocation of still two kinds of information which normally cannot
links. Postponing relocation after the retrieval phase be retrieved via HTTP: CGI programs and maps for
has advantages: first it can be checked if the file clickable images. Therefore requests for these ser-
pointed at by a local link exists, second it is easier to vices are not mirrored but forwarded to the original
add several types of relocations or combine it with host from the mirror site. With HTML 3.0 separate
other operations on PHMs. imagemap files will not be needed. CGI programs
The skeleton of the algorithm: seem to be irrelevant to mirror, but still the safe iden-
tification of links to CGI programs is needed. A pos-
program Mirror sible solution would be to standardize a set of
(input: root-URL, options; output: result-PHM) suffices for CGI programs.
Initialize URL set Also there are options which control the location
Check remote server's limitations for robots and formats to be mirrored. A scope is given for
while URL set is not empty do determining the locations, URLs outside of the scope
Choose next URL for retrieval from the set are not mirrored. Usually the scope is the directory of
(heuristics) the entry point. Other scopes can be: the server of the
Retrieve document for selected URL
entry point or several directories on a server. For the
Handle errors, write log
if document is successfully downloaded
formats either wanted or unwanted file types are
then listed. This way one can mirror a document without
if document has links then images or without Postscript files, etc.
Select URLs to download,
convert them to absolute,
IV.A.2. Choosing the next URL to download
add them to URL set Document images should be downloaded right
end if after the document itself. After the images a URL is
Store document chosen from the set applying heuristics. The set
end if
includes those links as well for which the retrieval
end while
failed with a time-out. The retrieval for these links
142-4
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
should be retired after well chosen intervals. Heuris- and a practical description should be made.
tics should take into account the actual behavior of Another problem is the image-maps in HTML-2.
network traffic and the user’s expectations. The pos- Every image-map has a description which cannot be
sible positive effects of using depth-first, width-first retrieved, so if the user clicks the map, a request is
or more elaborate heuristic strategies under certain sent to the original site. The answer could be any-
circumstances should be investigated. thing, e.g. HTML pages with links. These links could
not be mirrored. In HTML-3 this problem is solved,
IV.B. Architecture of the software environ- the HTML-3 specification contains a well-defined
ment manner of inserting an image-map into a HTML doc-
The environment provides services such as mir- ument, and in this way all points can be followed.
roring on-demand, timed mirroring, continuous mir- Perpetual usage of the program raised the prob-
roring. Mirroring on-demand can be initiated by any lem of handling together downloaded Portable
user. The environment collects the required docu- Hypermedia documents. If a document has a link to
ment and sends it to the user in a PHM archive. another document, and both are mirrored, but sepa-
Timed mirroring is a preprogrammed task where the rately, the link in the local mirror will point to origi-
document is downloaded at a preset time. This can nal place. One could expect that the link in the mirror
help to perform mirroring in a low traffic period on points to the other mirrored document. To handle this
the network. Continuous mirroring maintains stable problem, the idea of “mirrorspace” could be defined,
copy of a changing document. This is achieved by which is a set of PHM-s, where links between PHM-
intelligently repeated mirroring updates. During an s are converted to relative. Maintaining mirrorspace
update only the modified files are retrieved and put can be difficult. If one or more documents (text or
into the existing mirrored document. image) are added to it, each existing document
The user interface for the environment is imple- should be revised whether it have links to the added
mented as a set of WWW forms. The user can set the documents. Similarly, the newly added HTML docu-
address (URL) of the requested document, the ment should be checked for every documents in other
options for the mirroring software and the options for PHM-s. Deletion of one or more documents raises
the result document format and the mode of delivery. the same problem. These operations are very
For continuously mirrored documents the pre- resource-demanding, takes a lot of time and data
ferred rate and time of the refreshment can be config- transfer. If one could solve these problems, the mir-
ured together with the previously mentioned options. rorspace could be a good way of future development.
A mirror scheduler is responsible for the control of However, the program could be accelerated. Two
the timed mirroring tasks. Every completed operation passes of the algorithm could run in parallel. Parallel
is logged, which is used when a broken mirroring is programming language implementation would be
restarted. The mirror-log is stored and analyzed, if faster. Pass one (downloading the documents) could
necessary, the next mirroring is rescheduled or the be divided by the URLs as each task retrieves an
administrator is notified of the abnormal completion URL. There could be a limit for the number pro-
of the process. The following figure illustrates the cesses running at the same time, because this action
elements of the environment. burdens both the remote server and the network con-
nection. Pass two (relocating) is an operation exe-
IV.C. Experiences cuted on pairs of elements of URL-sets, so it can be
scheduled on URL-pairs.
Using the mirroring environment approximately
15 Internet sites have been mirrored, transferring 150
Megabytes of information. V. Summary
One of the problems during the usage, was the
lack of information about remote server's storage A mature algorithm for mirroring and a standard-
structure and conventions. It is embarrassing when ized portable hypermedia format can ease the distri-
the directory-index file (index.html by convention) is bution of hypermedia documents through the World
renamed and is referred with full name and with Wide Web. In this paper a two-phase mirroring algo-
directory-referencing at the same time. Also, the rithm was developed. The algorithm is able to create
server limitations for CGI programs mentioned in remote copy of a complex HTML document stored in
4.1.1 (robots.txt) was missing several times. another WWW server. The algorithm results the mir-
The solution could be an improvement for HTTP rored document in a portable hypermedia format
protocol. For example, there could be a request to (PHM) defined in this paper as well. Hypermedia
server like 'NNN Show server conventions'. The documents in PHM format can be transferred with no
answer would be a text, each line containing a 'vari- need for further semantic transformations.
able=value' pair, like 'directory-index=index.html' or A software environment based on the previously
'binary-directory=/cgi-bin'. This should be cleared mentioned algorithm for mirroring hypermedia docu-
ments was built. The environment provides different
142-5
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
M Logging Log
i
r
Selecting next URL
U r
s o
e r URL set
M Error registration with status,
r i format, etc.
S
r
I c Document parsing
r
n h
o
t e
r
e d Relocation
i
r u
n File set
f l
g Store/Load files
a e
locally
c r
e
Download files over
the network
high-level, intelligent automatic mirroring services [5] Hypertext Markup Language, URL: http://
via usual WWW interface (set of forms). The proper www.w3.org/hypertext/WWW/MarkUp/Mark-
use of this software environment can decrease the Up.html
network load during peak periods and can increase [6] Hypertext Transfer Protocol, URL: http://
the accessibility of selected hypermedia documents. www.w3.org/hypertext/WWW/Protocols/Over-
The mirroring technique developed here can be view.html
the first step in the direction of introducing a separate
[7] Rohit Khare: “eText: An Interactive Hyperme-
protocol and/or protocol extension for mirroring pur-
dia Publishing Environment“ Proceedings of
poses similar to that was proposed in [8].
ACM Hypertext’93 - Demonstrations
VI. References [8] László Kovács, András Micsik: Replication
within Distributed Digital Document Libraries.
[1] Tim Berners-Lee, Robert Cailliau, Jean-Fran- Proceedings of the 8th ERCIM Database
cois Groff, Bernard Pollermann: “World-Wide Research Group Workshop on Database Issues
Web: the Information Universe”, Electronic and Infrastructure in Cooperative Information
Networking, Vol. 2. No. 1. Systems, Trondheim, Norway, 1995
[2] Rob McCool: The CGI Specification, URL: [9] L. McLoughlin et.al: Mirror - Mirror Packages
http://hoohoo.ncsa.uiuc.edu/cgi/interface.html on Remote Sites (UNIX man-pages)
[3] Douglas E. Comer: Internetworking with TCP/ [10] Oscar Nierstrasz, Gorm Haug Eriksen, Karl
IP, Volume I., Prentice Hall International Edi- Guggisberg: w3mir, URL:ftp://sauce.uio.no/
tions, 1991 pub/src/w3mir
[4] Andrew Ford: Spinning the Web, How to pro- [11] WWW Names and Addresses, URIs, URLs,
vide Information on the Internet, International URNs, URL: http://www.w3.org/hypertext/
Thomson Publishing, 1995 WWW/Addressing/Addressing.html
142-6
An Envorinment for Mirroring Proceedings of JENC7 L. Kovács
Author Information
142-7