You are on page 1of 19

1

HTTP CACHING PROXY SERVER


Synopsis
The objective of this project is to implement caching mechanism in a web server
in order to minimize page seek latency thereby enabling faster page download from the
server to client.

What is a web cache?


2

In their simplest form, web caches store temporary copies of web objects. They are
designed primarily to improve the accessibility and availability of this type of data to end
users. Caching is not an alternative to increased connectivity, but instead optimises the
usage of available bandwidth.

How will a cache benefit us?


Caching minimises the number of times an identical web object is transferred from its
host server by retaining copies of requested objects in a repository or cache. Requests
for previously cached objects result in the cached copy of the object being returned to
the user from the local repository rather than from the host server. This results in little or
no extra network traffic over the external link and increases the speed of delivery.
Caches are limited by the amount of disk space – when a cache is full, older objects are
removed and replaced with newer content. Some systems may implement 'persistency'
measures, however, to preserve certain types of content at the discretion of the
administrator.

Where are web caches used?


Caches may be installed in different locations on networks for a variety of reasons:

• Local caches are the most common type; they sit on the edge of the LAN just
before the Internet connection. All outbound web requests are directed through
them in an effort to fulfil web requests locally before passing traffic over the
Internet connection.

• ISP caches are used on the networks of most Internet Service Providers
(ISPs). They provide customers with improved performance and conserve
bandwidth on their own external connections to the Internet.

• Reverse caches are used to reduce the workload of content provider’s web
servers. They position the cache between the web server and its internet
connection, so that when a remote user requests a web page, the request must
first pass through the cache before reaching the web server. If the cache has a
stored copy of the requested item, it delivers it direct rather than passing the
request through to the web server.

What are the advantages of caching?

• Fast performance on cached content – if content is already in the cache it is


returned more quickly, even for multiple users wanting to access the same
content.
3

• Improved user perception and productivity – quicker delivery of content


means less waiting time and increased user satisfaction with the performance of
the system.

• Less bandwidth used – if content is cached locally on the LAN, web requests do
not consume Internet connection bandwidth.

• User monitoring and logging – if a cache manages all web requests


(behaving in some ways like a proxy), a centralised log can be kept of all user
access. Care must be taken that any information held is in accordance with
appropriate privacy regulations and the institution's policy.

• Caching benefits both the single end user and the content providers –
ISPs and other users of the same infrastructure all benefit greatly from the
reduction in bandwidth usage.

How does a cache differ from a proxy?


A cache server is not the same as a proxy server. Cache servers have a proxy function
with regard to requests for certain content from the World Wide Web. When a client
passes all their requests for web objects via a cache, this cache is effectively acting as a
proxy server. Caching is a common function of proxy servers.Proxy servers perform a
number of other functions, too, mainly centred on security and administrative control.
Broadly speaking, a proxy server sits between a number of clients and the Internet. Any
requests made to the Internet from a LAN computer are forwarded to the proxy server
which will then make the requests itself.

The key differences between a proxy and caches are:


1. A proxy server will handle more requests than just those for web content.
2. A proxy server does not by default cache any data that passes through it.
3. There are certain security benefits based on the fact that proxy servers hide other
computers on the network from the Internet making it is impossible for individual
machines to be targeted for attack.
4. The requirement for 'public' IP addresses is also removed, so that any number of
computers can share one public address that is configured to the proxy rather
than each computer needing a unique IP address.

The response time of a WWW service often plays an important role in its success
or demise. From a user's perspective, the response time is the time elapsed from
when a request is initiated at a client to the time that the response is fully loaded
by the client. This paper presents a framework for accurately measuring the
client-perceived response time in a WWW service. Our framework provides
4

feedback to the service provider and eliminates the uncertainties that are
common in existing methods. This feedback can be used to determine whether
performance expectations are met, and whether additional resources (e.g. more
powerful server or better network connection) are needed. The framework can
also be used when a consolidator provides Web hosting service, in which case the
framework provides quantitative measures to verify the consolidator's compliance
to a specified Service Level Agreement. Our approach assumes the existing
infrastructure of the Internet with its current technologies and protocols. No
modification is necessary to existing browsers or servers, and we accommodate
intermediate proxies that cache documents. The only requirement is to
instrument the documents to be measured, which can be done automatically
using a tool we provide.

The number of servers and the amount of information available in the World Wide Web
have been growing exponentially in the last five years. The use of World Wide Web as an
information retrieving mechanism has also become popular. As a consequence, popular
Web servers have been receiving an increasing number of requests. Some servers
receive up to 100 million requests daily which results in more than one request per
millisecond on average. Thus, in order for a Web server to be able to respond at such a
rate, it should reduce the overhead of handling the requests to a minimum.

Currently, the greatest fraction of server latency for document requests (excluding the
execution of CGI-scripts) comes from disk accesses. When there is a request for a
document at a Web server, the server makes one or more file system calls to open, read
and close the requested file. These file system calls result in disk accesses, and when the
file is not on the local disk, file transfers through a network.
5

Hence, it is interesting to cache the files in the main memory so as to reduce access to
the local and remote disks. Indeed, RAM is much faster (by several orders of magnitude)
than magnetic disks. Such an idea has already been used in some software (for example,
harvest httpd accelerator. Such a caching mechanism was called main memory caching
or document caching. In this project we shall refer it as server caching. Server caching
might appear to have less impact on the quality of Web applications than the client
caching (or proxy caching), which aims at reducing network delay by caching remote
files. This indeed seems to be true in traditional networks where the retrieval time of a
document is dominated by transfer time due to the low-bandwidth interconnections.
However, even in such a situation, a significant portion of requests of a Web server may
be from local users at academic institutions or large companies. These clients are
typically connected to the server through high bandwidth LANs (e.g. FDDI or ATM) so that
the retrieval time is likely to be dominated by the server's latency. In the near future,
with the deployment of ATM WANs, the information retrieval time is also likely to be
dominated by the latency at the server.
6

While client caching is characterized by relatively low hit rates (varying from 20% to
40%, the server caching, however, can achieve very high hit rates due to the
particularity of Web traffic where a small number of documents of a Web server
dominates the requirements of clients. It is shown in by analyzing request traces of
several Web servers that even a small amount of main memory (512 Kbytes) can hold a
significant portion (60%) of the documents required.

In existing system there is no exact cache present at the server. Most caches are
maintained at the proxies itself. Moreover caching policies in the existing system use
Least Recently Used (LRU) page replacement algorithm in their server cache (if one is
present), but the throughput level is low in LRU when compared with our caching policy.
7

SYSTEM DESIGN
FLOW CHART FOR PROPOSED SYSTEM

Configure the web


server. Handle the
client Request.
Identify the request.

Process the request


H get the exact page
required by the client.

Check
If not found
for the
check for
page in
the page in
the
the Server.
cache

Increment hit counter.If


present in the cache then Page
get the time stamp of the Prese
page in the cache. nt
If not
present

A C B
8

A B

Is Time Yes then fetch


stamp at page from
Cache server and
<= cache it in
Time stamp cache for
at server future.
page.

Is
Cache
full

If no fetch the page


from the cache.

Yes No
D

E F
9

Get the Hit


counts of all
pages visited.

Select the least


hit page.

Is there No. Then Place


more the
than one Fetched page
page from server in
the cache.

Yes then Get


the size of all
pages.

Select the page which is


more in size and replace
it with page from the
G
server.
10

F G

Then Place the


page from
Calculate the
server in the
cache penalty.
cache for future.

Calculate the
total
D turnaround
time.

Dispatch the
C requested page to
the client.

Flash
Listen for next
Page Not
request.
found.

H
11

IMPLEMENTATION DETAILS

WHAT HAS BEEN DONE:

Tomcat 4.1 is Jakarta’s Web Server which implements Servlet 2.3 and Java Server
pages. It provides a platform for developing and deploying web applications and web
services. This server is used as the web server in our project. This server is responsible
for handling all transactions between the client and the server. The cache that we
designed inside the server can also be viewed as a middle man between the client and
the server. It can be compared with the high speed cache in the memory systems.

Client Request Processing:

In this phase we configure the Servlet program to handle client request. Tasks like
sending pages from server to the client using input output stream is implemented. Pages
can have images also. Dynamic file size generation is a part of next step which gives us
the file size details which is an important criteria needed for coming phases. In the next
step the mapping of URL ‘s are implemented along with URL navigation. The page name
is obtained from the client. The content type of the response is set in the response
header and the objects for the input and output streams are created. The page is first
located and then fetched either from the cache or from the server (explained in the next
phase) and dispatched to the client.
Implementing the Processing Logic:

Processing logic is the main phase of the project where caching algorithm is
implemented. In the processing logic the following tasks are performed. Firstly the time
stamp of the page in the cache (main memory) is checked with its original copy in the
Server(secondary memory). If the page is not found in the Cache the page is fetched
from the Root if found there and is cached in the Cache. A page replacement logic is
applied to the page contents in the Cache so that new page can be accommodated
within the same Cache space by replacing the intended page if the cache lacks space to
make way for the new page.
If the page is found in the cache even then the following steps are executed before
delivering the page to the client to ensure cache consistency.

 Check for time stamp


 Incrementing hit count
Then the page is swooped into a stream and sent to the client.
If the page is absent in the cache, then we check for the page in the server itself. When
the search is successful, the page is dispatched to the client and a copy is put in the
12

cache for future use. If the search is unsuccessful then “Page not found” message is
flashed to the client.
CACHE DESIGN:
In their simplest form, web caches store temporary copies of web objects. They
are designed primarily to improve the accessibility and availability of this type of data to
end users. Caching is not an alternative to increased connectivity, but instead optimizes
the usage of available bandwidth. After the initial access/download, users can access a
single locally stored copy of the content rather than repeatedly requesting the same
content from the origin
server.

Here we have constructed a cache of predefined size to hold WebPages within this
size. This is a volatile cache that is the cache is present only as long as the server runs.
Once the server goes down and is started again a cache comes up. Again the cache gets
populated based on user requests.

The cache here is divided into two parts. The two parts are key and value. The key
acts as the index to the value part. This diagram best explains the cache.

Key Value
(Page name) (Page contents)

Index.html 0100000011110101110101 ……………………

Page2.html 1011100101110101010010……………………….

Home.html 1010111101010001010111…………………...

The binary data of the page is packed into an user defined object which also contains the
following

• Size of the page


• Time when the file was modified
13

CACHING POLICY:
In this system we present a cache that can cache static pages and we apply Least
Frequently Used (LFU) page replacement algorithm as our caching policy. Here you
will have a cache that scoops every page requested by the user. It then checks for the
presence of the page if the page is found the time of the page is checked with that of the
same page in the server. Depending on the time stamp the page is fetched either from
the server or from the cache. That is the recent page is fetched. This is to check whether
the user has modified the page contents in the server. If so there is no point in fetching
the page from cache as it contains the stale copy. By this way we see that whatever
page user gets from the cache is the same copy that is present in the server. And again
because the cache size is fixed we have to suggest a caching policy in order to maintain
the cache with the required pages. We use the LFU algorithm for page replacement. This
algorithm actually suits this environment as cache means hits and misses. LFU also
makes use of hits in its implementation so we justify its use here. The proposed system
has shown that the cache penalty is always low and same is the case of misses. We
have also implemented the time difference between request of page from cache and the
same page from server.
LFU Algorithm
•Algorithm: Least Frequently Used
•Least frequently used documents are removed first.
•Advantages: Simplicity., to reduce latency so that the client requests
will get served fast then LFU would be the best choice.
LFU algorithm: when free space in cache is smaller than S, repeat the following until
free cache space is at least S: replace LFU document.

LFU algorithm

LFU(req_file RF, size_of_RF){


while (free_space < size_of_RF) {
locate document with smallest use
count, remove that document from
cache, update free_space
if (free_space still < size_of_RF)
continue
else
add RF to cache, update free_space
}
return ‘document cached’ status
}
14

MONITORING HITS:

KEY VALUE
(PAGE NAME) (HITS)

22
Index.html

Page2.html
10

Home.html
37

Results.html
500

Admit.html
61

News.html
7

The hit counter decides the fate of the page in the cache. The number of hits is directly
proportional to the page ‘s stay in cache. That is page with more hits is likely to stay in
cache than the page with fewer hits.

Whenever the cache is found full and a new page is to be placed in it the following steps
are taken.
Hit counts of all the pages are obtained. The page with the least hit count is selected.
This page is replaced with the incoming page. What if two pages have the same hit?
Then we replace the larger page.

SCREEN SHOTS:

INPUT
15

OUTPUT
16

PAGE FETCHED FROM SERVER


17

PAGE FETCHED FROM CACHE


18

CACHE STATISTICS:
19

You might also like