Professional Documents
Culture Documents
Ilya Tikhomirov
Institute for Systems Analysis
of the Russian Academy of Sciences
matandra@isa.ru
The HTTP traffic is about 50 % of the information transfer in the Web. Some part of this
information is inappropriate: porno and extremist sites, dating, social networks and free file
hostings, etc. Most categories of users like children, students, employees must not access content
of these sites. The main goal of content control software, also known as censorware or Web
filtering software is to prevent people from viewing content which is considered inappropriate.
Most Web filters use predefined ban lists containing URLs or IP addresses of
inappropriate Web sites and Web pages. The ban lists are formed manually by content filter
developers or network administrators. The exponential growth of the Web and dynamically
changing content cause incomplete coverage of inappropriate Web resources.
HTTP proxy (Web proxy, “anonymizer”) servers became popular in Web surfing. It
makes usage of predefined ban lists inefficient and useless. Therefore some content filters use
filtering rules for HTTP response headers. The corresponding content would be blocked if any
restricted signature is found in the HTTP response. With this approach, content is seen just as
byte streams. This approach has low recall and precision. It means that a new technology is
necessary for social-safe Web-browsing.
The Web Content Filter presents a technology for blocking inappropriate content and for
social safe Web browsing. It uses an automatic classification method based on full-text content
analysis of Web pages: system processes documents and denies access to recognized
inappropriate content “on the fly”. This approach is called dynamic content filtering.
LAN Web
Web server
Redirector
application
Classification
URL cache
subsystem
Requested
Web page
Linguistic analyzer
4. Classify document
The automatic classification method examines Web pages according to term importance
in a natural language text. The method uses morphological analysis for terms normalization.
The URL cache stores often requested URLs. The cache is included in system to decrease
load on network and automatic classification subsystem.
Deployment of Web Content Filter consists of the following stages:
− Forming inappropriate categories.
− Preparing learning examples.
− Running the automatic classifier in learning mode.
− System setup, testing and parameters customization.
− Running the automatic classifier in filtering mode
The main advantages of technology are:
− Higher precision and recall of filtering than traditional methods.
− Transparency for users: no advanced settings of Web browsers and other Web
applications are required.
− Adaptability to users’ behavior: only requested pages are examined.
− Scalability: distributed modules withstand load in big computer networks.
− Strictness level customization.
Web Content Filter is the intelligent solution for social safe Web browsing at:
− Educational institutions.
− State institutions.
− Companies.
− Home networks.