You are on page 1of 3

Web Content Filter: technology for social safe browsing

Ilya Tikhomirov
Institute for Systems Analysis
of the Russian Academy of Sciences
matandra@isa.ru

The HTTP traffic is about 50 % of the information transfer in the Web. Some part of this
information is inappropriate: porno and extremist sites, dating, social networks and free file
hostings, etc. Most categories of users like children, students, employees must not access content
of these sites. The main goal of content control software, also known as censorware or Web
filtering software is to prevent people from viewing content which is considered inappropriate.
Most Web filters use predefined ban lists containing URLs or IP addresses of
inappropriate Web sites and Web pages. The ban lists are formed manually by content filter
developers or network administrators. The exponential growth of the Web and dynamically
changing content cause incomplete coverage of inappropriate Web resources.
HTTP proxy (Web proxy, “anonymizer”) servers became popular in Web surfing. It
makes usage of predefined ban lists inefficient and useless. Therefore some content filters use
filtering rules for HTTP response headers. The corresponding content would be blocked if any
restricted signature is found in the HTTP response. With this approach, content is seen just as
byte streams. This approach has low recall and precision. It means that a new technology is
necessary for social-safe Web-browsing.
The Web Content Filter presents a technology for blocking inappropriate content and for
social safe Web browsing. It uses an automatic classification method based on full-text content
analysis of Web pages: system processes documents and denies access to recognized
inappropriate content “on the fly”. This approach is called dynamic content filtering.

Web Content Filter architecture is presented at Fig. 1.

User PROXY server WEB server

LAN Web
Web server
Redirector
application

Classification
URL cache
subsystem

Requested
Web page
Linguistic analyzer

Fig. 1. Web Content Filter architecture


We assume a transparent HTTP Proxy is deployed in the local network. All HTTP traffic
goes through this proxy. The Web Content Filter contains the following components:
− Redirector module of HTTP-Proxy server.
− Classification subsystem.
− Linguistic analyzer.
− URL cache.
The Classification subsystem, Linguistic analyzer and URL cache are distributed modules
which can operate in heterogeneous environment to withstand load in big computer networks.
Web Content Filter algorithm is presented at Fig. 2.

1. User’s request for the


web document

2. Is the document URL NO


present in URL cache?

3. Analyze the document


YES text

4. Classify document

5. Add document’s URL to


the URL cache

NO 6. Is the document YES


inappropriate ?

7-a. Allow 7-b. Deny


access access

Fig. 2. Web Content Filter algorithm

The automatic classification method examines Web pages according to term importance
in a natural language text. The method uses morphological analysis for terms normalization.
The URL cache stores often requested URLs. The cache is included in system to decrease
load on network and automatic classification subsystem.
Deployment of Web Content Filter consists of the following stages:
− Forming inappropriate categories.
− Preparing learning examples.
− Running the automatic classifier in learning mode.
− System setup, testing and parameters customization.
− Running the automatic classifier in filtering mode
The main advantages of technology are:
− Higher precision and recall of filtering than traditional methods.
− Transparency for users: no advanced settings of Web browsers and other Web
applications are required.
− Adaptability to users’ behavior: only requested pages are examined.
− Scalability: distributed modules withstand load in big computer networks.
− Strictness level customization.
Web Content Filter is the intelligent solution for social safe Web browsing at:
− Educational institutions.
− State institutions.
− Companies.
− Home networks.

For more information:


Institute for Systems Analysis of the Russian Academy of Sciences
117312, Moscow, Russia
Pr. 60-letya Octiabrya, 9, www.isa.ru
Phone/fax: +7 (499) 135-04-63

You might also like