You are on page 1of 12

Web Usage Mining: Discovery and Applications of Usage

Patterns from Web Data

Jaideep Srivastava * t , Robert Cooley:l: , Mukund Deshpande, Pang-Ning Tan


Department of Computer Science and Engineering
University of Minnesota
200 Union St SE
Minneapolis, MN 55455
{srivast a,cooley,deshpaqd,pt an} ~cs .umn.edu

ABSTRACT vey. An early taxonomy of Web mining is provided in [29],


which also describes the architecture of the WebMiner sys-
Web usage mining is the application of data mining tech-
tem [42], one of the first systems for Web Usage mining. The
niques to discover usage patterns from Web data, in order to
proceedings of the recent WebKDD workshop [41], held in
understand and better serve the needs of Web-based appli-
conjunction with the KDD-1999 conference, provides a sam-
cations. Web usage mining consists of three phases, namely
pling of some of the current research being performed in the
preprocessing, pattern discovery, and pattern analysis. This
area of Web Usage Analysis, including Web Usage mining.
paper describes each of these phases in detail. Given its ap-
plication potential, Web usage mining has seen a rapid in- This paper provides an up-to-date survey of Web Usage min-
crease in interest, from both the research and practice com- ing, including both academic and industrial research efforts,
munities. This paper provides a detailed taxonomy of the as well as commercial offerings. Section 2 describes the var-
work in this area, including research efforts as well as com- ious kinds of Web data that can be useful for Web Usage
mercial offerings. An up-to-date survey of the existing work mining. Section 3 discusses the challenges involved in dis-
is also provided. Finally, a brief overview of the WebSIFT covering usage patterns from Web data. The three phases
system as an example of a prototypical Web usage mining are preprocessing, pattern discovery, and patterns analysis•
system is given. Section 4 provides a detailed taxonomy and survey of the
existing efforts in Web Usage mining, and Section 5 gives
Keywords: data mining, world wide web, web usage hain-
an overview of the WebSIFT system [31], as a prototypical
ing.
example of a Web Usage mining system, finally, Section 6
discusses privacy concerns and Section 7 concludes the pa-
1. INTRODUCTION per.
The ease and speed with which business transactions can
be carried out over the Web has been a key driving force 2. W E B DATA
in the rapid growth of electronic commerce. Specifically, e-
One of the key steps in Knowledge Discovery in Databases
commerce activity that involves the end user is undergoing
[33] is to create a suitable target data set for the data mining
a significant revolution. The ability to track users' browsing
tasks. In Web Mining, data can be collected at the server-
behavior down to individual mouse clicks has brought the
side, client-side, proxy servers, or obtained from an organi-
vendor and end customer closer than ever before. It is now
zation's database (which contains business data or consoli-
possible for a vendor to personalize his product message for
dated Web data). Each type of data collection differs not
individual customers at a massive scale, a phenomenon that
only in terms of the location of the data source, but also
is being referred to as mass customization.
the kinds of data available, the segment of population from
The scenario described above is one of many possible appli-
which the data was collected, and its method of implemen-
cations of Web Usage mining, which is the process of apply-
tation.
ing data mining techniques to the discovery of usage patterns
There are many kinds of data that can be used in Web Min-
]rom Web data, targeted towards various applications• Data
ing. This paper classifies such data into the following types
mining efforts associated with the Web, called Web mining,
can be broadly divided into three classes, i.e. content min-
ing, usage mining, and structure mining . Web Structure
mining projects such as [34; 54] and Web Content mining C o n t e n t : The real data in the Web pages, i.e. the
projects such as [47; 21] are beyond the scope of this sur- data the Web page was designed to convey to the users•
This usually consists of, but is not limited t6;"text and
*Can be contacted at jaideep~amazon.com graphics.
~Supported by NSF grant NSF/EIA-9818338 S t r u c t u r e : Data which describes the organization of
:~Supported by NSF grant EHR-9554517 the content. Intra-page structure information includes
the arrangement of various HTML or XML tags within
a given page. This can be represented as a tree struc-
ture, where the (html) tag becomes the root of the tree.

SIGKDD Explorations. Ja~l 2000. Volume 1, Issue 2 - page 12


T h e principal k i n d of inter-page s t r u c t u r e i n f o r m a t i o n Client-side d a t a collection can b e i m p l e m e n t e d by using a re-
is hyper-links c o n n e c t i n g one page to another. m o t e agent (such as Javascripts or J a v a applets) or by mod-
ifying t h e source code of a n existing browser (such as Mo-
• U s a g e : D a t a t h a t describes t h e p a t t e r n of usage of saic or Mozilla) to e n h a n c e its d a t a collection capabilities.
Web pages, such as I P addresses, page references, a n d T h e i m p l e m e n t a t i o n of client-side d a t a collection m e t h o d s
t h e date a n d t i m e of accesses. requires user cooperation, either in e n a b l i n g t h e functional-
ity of t h e Javascripts a n d J a v a applets, or to voluntarily use
* U s e r P r o f i l e : Data. t h a t provides d e m o g r a p h i c in- t h e modified browser. Client-side collection has an advan-
formation a b o u t users of t h e W e b site. T h i s includes tage over server-side collection because it ameliorates b o t h
registration d a t a a n d c u s t o m e r profile information. t h e caching a n d session identification problems. However,
J a v a applets perform no b e t t e r t h a n server logs in t e r m s of
2.1 Data Sources d e t e r m i n i n g t h e actual view t i m e of a page. In fact, it m a y
incur some additional overhead especially w h e n t h e J a v a ap-
T h e usage d a t a collected at t h e different sources will rep-
resent t h e navigation p a t t e r n s of different segments of t h e plet is loaded for t h e first time. Javascripts, on t h e o t h e r
overall Web traffic, r a n g i n g from single-user, single-site brows- h a n d , c o n s u m e little i n t e r p r e t a t i o n t i m e b u t c a n n o t cap-
ing b e h a v i o r to multi-user, multi-site access p a t t e r n s . t u r e all user clicks (such as reload or back b u t t o n s ) . These
m e t h o d s will collect only single-user, single-site browsing be-
2.1.1 Server Level Collection havior. A modified browser is m u c h more versatile a n d will
allow d a t a collection a b o u t a single user over mult!ple W e b
A W e b server log is an i m p o r t a n t source for performing W e b
sites. T h e most difficult p a r t of using this m e t h o d is con-
Usage Mining because it explicitly records t h e browsing be-
vincing t h e users to use t h e browser for t h e i r daily browsing
havior of site visitors. T h e d a t a recorded in server logs re-
activities. This can be done by offering incentives to users
flects t h e (possibly c o n c u r r e n t ) access of a W e b site by mul-
w h o are willing to use t h e browser, similar to t h e incen-
tiple users. These log files can b e stored in various f o r m a t s
tive p r o g r a m s offered by companies such as NetZero [9] a n d
such as C o m m o n log or E x t e n d e d log formats. A n exam-
A l l A d v a n t a g e [2] t h a t reward users for clicking on b a n n e r
ple of E x t e n d e d log f o r m a t is given in Figure 2 (Section 3).
a d v e r t i s e m e n t s while surfing t h e Web.
However, t h e site usage d a t a recorded by server logs m a y
not b e entirely reliable due to t h e presence of various levels
2.1.3 Proxy Level Collection
of caching w i t h i n t h e W e b e n v i r o n m e n t . Cached page views
are not recorded in a server log. In addition, arty i m p o r t a n t A W e b proxy acts as a n i n t e r m e d i a t e level of caching be-
information passed t h r o u g h t h e P O S T m e t h o d will n o t b e tween client browsers a n d W e b servers. Proxy caching (:an
available in a server log. Packet sniffing technology is a n b e used to reduce t h e loading t i m e of a W e b page expe-
alternative m e t h o d to collecting usage d a t a t h r o u g h server rienced by users as well as t h e network traffic load a t t h e
logs. Packet sniffers m o n i t o r network traffic coming to a server a n d client sides [27]. T h e p e r f o r m a n c e of proxy caches
Web server a n d e x t r a c t usage d a t a directly from T C P / I P d e p e n d s on t h e i r ability to predict future page requests cor-
packets. T h e W e b server can also store o t h e r kinds of usage rectly. P r o x y traces m a y reveal t h e actual H T T P requests
information such as cookies a n d query d a t a in separate logs. from multiple clients to multiple W e b servers. This m a y
serve as a d a t a source for characterizing t h e browsing be-
Cookies are tokens g e n e r a t e d by t h e W e b server for individ-
havior of a group of a n o n y m o u s users, sharing a c o m m o n
ual client browsers in order to automatically track t h e site
proxy server.
visitors. Tracking of individual users is not a n easy task
due to t h e stateless connection model of t h e H T T P proto-
col. Cookies rely on implicit user cooperation a n d t h u s have 2.2 Data Abstractions
raised growing concerns regarding user privacy, which will T h e i n f o r m a t i o n provided by t h e d a t a sources described
b e discussed in Section 6. Query d a t a is also typically gen- above c a n all b e used to c o n s t r u c t / i d e n t i f y several d a t a ab-
erated by online visitors while searching for pages relevant stractions, n o t a b l y users, server sessions, episodes, click-
to t h e i r i n f o r m a t i o n needs. Besides usage data, t h e server streams, a n d page views. In order to provide some consis-
side also provides c o n t e n t data, s t r u c t u r e information a n d t e n c y in t h e way these t e r m s are defined, t h e W 3 C ~reb
W e b page m e t a - i n f o r m a t i o n (such as t h e size of a file a n d C h a r a c t e r i z a t i o n Activity ( W C A ) [14] h a s p u b l i s h e d a draft
its last modified time). of W e b t e r m definitions relevant to analyzing W e b usage. A
T h e W e b server also relies on o t h e r utilities such as C G I user is defined as a single individual t h a t is accessing file
scripts to h a n d l e d a t a sent back from client browsers. W e b from one or more W e b servers t h r o u g h a browser. While
servers i m p l e m e n t i n g t h e C G I s t a n d a r d parse t h e U R I 1 of this definition seems trivial, in practice it is very difficult to
t h e requested file to d e t e r m i n e if it is a n application pro- uniquely a n d repeatedly identify users. A user m a y access
gram. T h e U R I for C G I p r o g r a m s m a y contain additional t h e W e b t h r o u g h different machines, or use more t h a n one
p a r a m e t e r values to b e passed to t h e C G I application. Once a g e n t on a single machine. A page view consists of every file
t h e C G I p r o g r a m h a s completed its execution, t h e W e b t h a t c o n t r i b u t e s to t h e display on a user's browser at one
server send t h e o u t p u t of t h e C G I application back to t h e time. Page views are usually associated with a single user
browser. action (such as a mouse-click) a n d can consist of several files
such as frames, graphics, a n d scripts. W h e n discussing a n d
2.1.2 Client Level Collection analyzing user behaviors, it is really t h e aggregate page view
t h a t is of i m p o r t a n c e . T h e user does not explicitly ask for
1Uniform Resource Identifier (URI) is a more general defi- "n" frames a n d "m" graphics to be loaded into his or her
nition t h a t includes t h e c o m m o n l y referred to Uniform Re- browser, t h e user requests a "Web page." All of t h e infor-
source Locator (UI:tL). m a t i o n to d e t e r m i n e which files c o n s t i t u t e a page view is

S I G K D D Explorations. • J a n 2000. Volume 1, Issue 2 - page 13


accessible from the W e b server. A click-stream is a sequen- from o t h e r servers are n o t typically available, it is difficult
tial series of page view requests. Again, t h e d a t a available to know w h e n a user h a s left a W e b site. A t h i r t y m i n u t e
from t h e server side does not always provide enough infor- t i m e o u t is often used as t h e default m e t h o d of b r e a k i n g a
m a t i o n to r e c o n s t r u c t t h e full click-stream for a site. Any user's click-stream into sessions. T h e t h i r t y m i n u t e t i m e o u t
page view accessed t h r o u g h a client or proxy-level cache will is b a s e d on t h e results of [23]. W h e n a session ID is em-
not b e "visible" from t h e server side. A user session is t h e b e d d e d in each URI, t h e definition of a session is set by t h e
click-stream of page views for a singe user across t h e entire c o n t e n t server.
Web. Typically, only t h e p o r t i o n of each user session t h a t is W h i l e t h e exact c o n t e n t served as a result of each user ac-
accessing a specific site can b e used for analysis, since access tion is often available from t h e request field in t h e server
i n f o r m a t i o n is not publicly available from t h e vast m a j o r i t y logs, it is s o m e t i m e s necessary to have access to t h e c o n t e n t
of W e b servers. T h e set of page-views in a user session server i n f o r m a t i o n as well. Since c o n t e n t servers can m a i n -
for a particular W e b site is referred t o as a server session t h i n s t a t e variables for each active session, t h e i n f o r m a t i o n
(also c o m m o n l y referred to as a visit). A set of server ses- necessary to d e t e r m i n e exactly w h a t c o n t e n t is served by a
sious is t h e necessary i n p u t for a n y W e b Usage analysis or user request is n o t always available in t h e URI. T h e final
d a t a m i n i n g tool. T h e e n d of a server session is defined as p r o b l e m e n c o u n t e r e d w h e n preprocessing usage d a t a is t h a t
t h e p o i n t w h e n t h e user's browsing session at t h a t site has of inferring cached page references. As discussed in Section
ended. Again, this is a simple concept t h a t is very difficult 2.2, t h e only verifiable m e t h o d of tracking cached page views
to track reliably. Any semantically m e a n i n g f u l subset of a is to m o n i t o r usage from t h e client side. T h e referrer field
user or server session is referred t o as a n episode by t h e W 3 C for each request can b e used to detect some of t h e instances
WCA. w h e n c a c h e d pages have b e e n viewed.
Figure 2 shows a sample log t h a t illustrates several of t h e
3. WEB USAGE MINING p r o b l e m s discussed above ( T h e first c o l u m n would n o t b e
As shown in Figure 1, t h e r e are t h r e e m a i n t a s k s for per- present in a n a c t u a l server log, a n d is for illustrative pur-
forming W e b Usage Mining or W e b Usage Analysis. This poses only). I P address 1 2 3 . 4 5 6 . 7 8 . 9 is responsible for
section presents a n overview of t h e tasks for each step a n d t h r e e server sessions, a n d I P addresses 2 0 9 . 4 5 6 . 7 8 . 2 a n d
discusses t h e challenges involved. 209.45.778.3 are responsible for a f o u r t h session. Using
a c o m b i n a t i o n of referrer a n d agent information, lines 1
3.1 Preprocessing t h r o u g h 11 c a n b e divided into t h r e e sessions of A-B-F-Q-6,
Preprocessing consists of c o n v e r t i n g t h e usage, content, a n d L-R, a n d A-B-C-J. P a t h completion would a d d two page ref-
s t r u c t u r e information c o n t a i n e d in t h e various available d a t a erences t o t h e first session A-B-F-I3-F-B-G, a n d one reference
sources into t h e d a t a a b s t r a c t i o n s necessary for p a t t e r n dis- to t h e t h i r d session A-B-A-C-J. W i t h o u t using cookies, a n
covery. e m b e d d e d session ID, or a client-side d a t a collection m e t h o d ,
t h e r e is n o m e t h o d for d e t e r m i n i n g t h a t lines 12 a n d 13 are
3.1.1 UsagePreprocessing actually a single server session.
Usage preprocessing is a r g u a b l y t h e m o s t difficult task in
t h e W e b Usage Mining process due to t h e incompleteness of 3.1.2 Content Preprocessing
t h e available data. Unless a client side t r a c k i n g m e c h a n i s m C o n t e n t preprocessing consists of converting t h e t e x t , im-
is used, only t h e I P address, agent, a n d server side click- age, scripts, a n d o t h e r files such as m u l t i m e d i a into forms
s t r e a m are available to identify users azld server sessions. t h a t are useful for t h e W e b Usage M i n i n g process. Often,
Some of t h e typically e n c o u n t e r e d p r o b l e m s are: t h i s consists of p e r f o r m i n g c o n t e n t m i n i n g such as classi-
fication or clustering. W h i l e a p p l y i n g d a t a m i n i n g to t h e
• Single I P a d d r e s s / M u l t i p l e Server Sessions - I n t e r n e t c o n t e n t of W e b sites is a n interesting area of research in its
service providers (ISPs) typically have a pool of proxy own right, in t h e c o n t e x t of W e b Usage M i n i n g t h e c o n t e n t
servers t h a t users access t h e W e b t h r o u g h . A single of a site c a n b e used to filter t h e i n p u t to, or o u t p u t from
proxy server m a y have several users accessing a W e b t h e p a t t e r n discovery algorithms. For example, results of
site, potentially over t h e same t i m e period. a classification a l g o r i t h m could b e used t o limit t h e discov-
ered p a t t e r n s t o those c o n t a i n i n g page views a b o u t a c e r t a i n
• Multiple I P a d d r e s s / S i n g l e Server Session - Some ISPs s u b j e c t or class of products. In a d d i t i o n to classifying or
or privacy tools r a n d o m l y assign each request from a clustering page views b a s e d on topics, page views c a n a l s o
user t o one of several I P addresses. In this case, a b e classified according to t h e i r i n t e n d e d use [50; 30]. Page
single server session c a n have multiple I P addresses. views c a n b e i n t e n d e d to convey i n f o r m a t i o n ( t h r o u g h t e x t ,
graphics, or o t h e r m u l t i m e d i a ) , g a t h e r i n f o r m a t i o n from t h e
• Multiple I P a d d r e s s / S i n g l e User - A user t h a t accesses
user, allow n a v i g a t i o n ( t h r o u g h a list of h y p e r t e x t links), or
t h e W e b from different m a c h i n e s will have a different
some c o m b i n a t i o n uses. T h e i n t e n d e d use of a page view
I P address from session to session. T h i s m a k e s track-
c a n also filter t h e sessions before or after p a t t e r n discovery.
ing r e p e a t visits from t h e same user difficult.
In order t o r u n c o n t e n t m i n i n g algorithms on page views~
• Multiple A g e n t / S i n g e User - Again, a user t h a t uses t h e i n f o r m a t i o n m u s t first b e c o n v e r t e d into a quantifiable
more t h a n one browser, even on t h e same machine, format. Some version of t h e vector space m o d e l [51] is typ-
will a p p e a r as multiple users. ically u s e d t o accomplish this. Text files c a n b e b r o k e n u p
into vectors of words. Keywords or t e x t descriptions c a n
A s s u m i n g each user h a s now b e e n identified ( t h r o u g h cook- b e s u b s t i t u t e d for graphics or m u l t i m e d i a . T h e c o n t e n t of
ies, logins, or I P / a g e n t / p a t h analysis), t h e click-stream for static page views c a n b e easily preprocessed b y parsing t h e
each user m u s t b e divided into sessions. Since page requests H T M L a n d r e f o r m a t t i n g t h e i n f o r m a t i o n or r u n n i n g addi-

S I G K D D Explorations. J a n 2000. Volume 1, Issue 2 - page 14


Site Files

Preprocessing

Raw Logs Preprocessed "Interesting"


Ciickstream Rules, Patterns,
Data and Statistics Rules, Patterns,
and Statistics

Figure 1: High Level Web Usage Mining Process

IP Address Usedd Time MethodJURU Protocol Statue Size Referrer Agent


123.456.78.9 [25/Apr/1998:03:94:41-0580] "GETA.h~l HI-FP/1.0" 200 3290 Mozla/3.04 (Win95, I)
123.456.78.9 [23/Apd1998:03:05:34-0500] "GETB.html I..ITFP/1.0" 200 2050 A.h~l Moziga/3.94(Win95,1)
123.456.78.9 [25/April998:03:05:39,0500] 'GET Lhlrnl H'ITPI1.0" 200 4130 Moziga/3.94(Win95, I)
123A56.78.9 [25/April998:03:06:02 -0500] "GET F.html HTTP/1.ff' 200 5896 B.hlml Moziga/3.04(Win95,1)
123.456.78.9 [25/April998:03:06:58-0580] "GET A.h~l HTrP/1.0' 200 3290 Mozilla/3.01{Xll, I, IRIX6.2, IP22)
123,456.78.9 [25/Apr/1998:03:07:42-0500] "GETB.hlml HTTP/1.0" 200 2050 A.html MoziBa/3.01(X11,I, IRIX6.2, IP22)
123.456.76.9 [25/April998:03:07:55-0500] "GETR.html HTTPI1.0" 200 8140 Lhtml Mozma/3.94(Win95,1)
123.456.78.9 [25/April998:03:09:50-0500] "GETC.html HI-rP/1.0" 200 1820 A.hknl Mozgla/3.01(XI1.I, IRIX6.2,1P22)
123.458,78.9 [25/April998:03:10:02..0500] "GETO.hlml HTIP/1.0" 200 2270 F,html MoziBa/3.94(Win95,1)
123.456.78.9 [25/Apr/1998:03:10:45..0500] 'GET J.html HTTP/I.0" 200 9430 C.html Moziga/3.01(X11,I, IRIX62, IP22)
123.456.78.9 [25/Apr/1998:03:12:23-0500] "GETG.html HTTP/I.0" 200 7220 B.htnd MoziBa/3.94(Win95,1)
209,458.782 [25/,Apr/1998:05:05:22-0500] "GETA.html H'FrP/I.0" 200 3290 Mozgla/3.940Nin95, I)
209.456.78.3 [225/Apr/1998:05:06:03-0500] 'GET D.h~l HTTP/1.0' 200 1680 A.hb'nl Moziga/3.94(Win95,1)

Figure 2: Sample Web Server Log

SIGKDD Explorations. Jan 2000. Volume 1, Issue 2 - page 15


tional algorithms as desired. Dynamic page views present In the context of Web Usage Mining, association rules refer
more of a challenge. Content servers that employ personal- to sets of pages that are accessed together with a support
ization techniques a n d / o r draw upon databases to construct value exceeding some specified threshold. These pages may
the page views may be capable of forming more page views not be directly connected to one another via hyperlinks. For
than can be practically preprocessed. A given set of server example, association rule discovery using the Apriori algo-
sessions may only access a fraction of the page views possible rithm [18] (or one of its variants) may reveal a correlation
for a large dynamic site. Also the content may be revised between users who visited a page containing electronic prod-
on a regular basis. T h e content of each page view to be pre- ucts to those who access a page about sporting equipment.
processed must be "assembled", either by an H T T P request Aside from being applicable for business and marketing ap-
from a crawler, or a combination of template, script, and plications, the presence or absence of such rules can help
database accesses. If only the portion of page views that Web designers to restructure their Web site. The association
are accessed are preprocessed, the output of any classifica- rules may also serve as a heuristic for prefetching documents
tion or clustering algorithms may be skewed. in order to reduce user-perceived latency when loading a
page from a remote site.
3.1.3 Structure Preprocessing
The structure of a site is created by the hypertext links be- 3.2.3 Clustering
tween page views. The structure can be obtained and pre- Clustering is a technique to group together a set of items
processed in the same manner as the content of a site. Again, having similar characteristics. In the Web Usage domain,
dynamic content (and therefore links) pose more problems there are two kinds of interesting clusters to be discovered :
than static page views. A different site structure may have usage clusters and page clusters. Clustering of users tends
to be constructed for each server session. to establish groups of users exhibiting similar browsing pat-
terns. Such knowledge is especially useful for inferring user
3.2 Pattern Discovery demographics in order to perform market segmentation in
Pattern discovery draws upon methods and algorithms de- E-commerce applications or provide personalized Web con-
veloped from several fields such as statistics, data mining, tent to the users. On the other hand, clustering of pages
machine learning and pattern recognition. However, it is will discover groups of pages having related content. This
not the intent of this paper to describe all the available algo- information is useful for Internet search engines and Web
rithms and techniques derived from these fields. Interested assistance providers. In both applications, permanent or
readers should consult references such as [33; 24]. This sec- dynamic H T M L pages can be created t h a t suggest related
tion describes the kinds of mining activities t h a t have been hyperlinks to the user according to the user's query or past
applied to the Web domain. Methods developed from other history of information needs.
fields must take into consideration the different kinds of d a t a
abstractions and prior knowledge available for Web Mining. 3.2.4 Classification
For example, in association rule discovery, the notion of a Classification is the task of mapping a data item into one
transaction for market-basket analysis does not take into of several predefined classes [33]. In the Web domain, one
consideration the order in which items are selected. How- is interested in developing a profile of users belonging to a
ever, in Web Usage Mining, a server session is an ordered particular class or category. This requires extraction and
sequence of pages requested by a user. Furthermore, due to selection of features that best describe the properties of a
the difficulty in identifying unique sessions, additional prior given class or category. Classification can be done by using
knowledge is required (such as imposing a default timeout supervised inductive learning algorithms such as decision
period, as was pointed out in the previous section). tree classifiers, naive Bayesian classifiers, k-nearest neigh-
bor classifiers, Support Vector Machines etc. For example,
3.2.1 Statistical Analysis classification on server logs may lead to the discovery of in-
Statistical techniques are the most common m e t h o d to ex- teresting rules such as : 30% of users who placed an online
tract knowledge about visitors to a Web site. By analyzing order i n / P r o d u c t / M u s i c are in the 18-25 age group and live
the session file, one can perform different kinds of descrip- on the West Coast.
tive statistical analyses (frequency, mean, median, etc.) on
variables such as page views, viewing time and length of a 3.2.5 Sequential Patterns
navigational path. Many Web traffic analysis tools produce The technique of sequential pattern discovery a t t e m p t s to
a periodic report containing statistical information such as find inter-session patterns such t h a t the presence of a set of
the most frequently accessed pages, average view time of a items is followed by another item in a time-ordered set of ses-
page or average length of a p a t h through a site. This report sions or episodes. By using this approach, Web marketers
may include limited low-level error analysis such as detect- can predict future visit patterns which will be helpful in
ing unauthorized entry points or finding the most common placing advertisements aimed at certain user groups. Other
invalid URI. Despite lacking in the depth of its analysis, types of temporal analysis that can be performed on sequen-
this type of knowledge can be potentially useful for improv- tim patterns includes trend analysis, change point detection,
ing the system performance, enhancing the security of the or similarity analysis.
system, facilitating the site modification task, and providing
support for marketing decisions. 3.2.6 Dependency Modeling
Dependency modeling is another useful pattern discovery
3.2.2 Association Rules task in Web Mining. The goal here is to develop a model
Association rule generation can be used to relate pages t h a t capable of representing significant dependencies among the
are most often referenced together in a single server session. various variables in the Web domain. As an example, one

S I G K D D Explorations. Jan 2000. Volume 1, Issue 2 - page 16


may be interested to build a model representing the different Web Usage Mining in general, without extensive tailoring of
stages a visitor undergoes while shopping in an online store the process towards one of the various sub-categories. The
based on the actions chosen (ie. from a casual visitor to a se- W e b S I F T project is discussed in more detail in the next sec-
rious potential buyer). There are several probabilistic learn- tion. Cheu et al. [25] introduced the concept of maximal
ing techniques that can be employed to model the browsing forward reference to characterize user episodes for the min-
behavior of users. Such techniques include Hidden Markov ing of traversal patterns. A maximal forward reference is the
Models and Bayesian Belief Networks. Modeling of Web us- sequence of pages requested by a user up to the last page be-
age patterns will not only provide a theoretical framework fore backtracking occurs during a particular server session.
for analyzing the behavior of users but is potentially useful The SpeedTracer project [56] from IBM Watson is built on
for predicting future Web resource consumption. Such infor- the work originally reported in [25]. In addition to episode
mation may help develop strategies to increase the sales of identification, SpeedTracer makes use of referrer and agent
products offered by the Web site or improve the navigational information in the preprocessing routines to identify users
convenience of users. and server sessions in the absence b f additional client side
information. The Web Utilization Miner (WUM) system
3.3 Pattern Analysis [55] provides a robust mining language in order to specify
Pattern analysis is the last step in the overall Web Usage characteristics of discovered frequent paths t h a t are interest-
mining process as described in Figure 1. The motivation ing to the analyst. In their approach, individual navigation
behind pattern analysis is to filter out uninteresting rules or paths, called trails, are combined into an aggregated tree
patterns from the set found in the pattern discovery phase. structure. Queries can be answered by mapping them into
The exact analysis methodology is usually governed by the the intermediate nodes of the tree structure. Han et al. [58]
application for which Web mining is done. The most com- have loaded Web server logs into a data cube structure in
mon form of pattern analysis consists of a knowledge query order to perform data mining as well as On-Line Analytical
mechanism such as SQL. Another m e t h o d is to load usage Processing (OLAP) activities such as roll-up and drill-down
data into a data cube in order to perform O L A P operations. of the data. Their WebLogMiner system has been used to
Visualization techniques, such as graphing patterns or as- discover association rules, perform classification and time-
signing colors to different values, can often highlight overall series analysis (such as event sequence analysis, transition
patterns or trends in the data. Content and structure infor- analysis and trend analysis). Shahabi et. al. [53; 59] have
mation can be used to filter out patterns containing pages one of the few Web Usage mining systems that relies on
of a certain usage type, content type, or pages that match client side data collection. The client side agent sends back
a certain hyperlink structure. page request and time information to the server every time
a page containing the Java applet (either a new page or a
4. TAXONOMY AND PROJECT SURVEY previously cached page) is loaded or destroyed.
Since 1996 there have been several research projects and
commercial products that have analyzed Web usage data 4.2.1 Personalization
for a number of different purposes. This section describes Personalizing the Web experience for a user is the holy grail
the dimensions and application areas t h a t can be used to of m a n y Web-based applications, e.g. individualized mar-
classify Web Usage Mining projects. keting for e-commerce [4]. Making dynamic recommenda-
tions to a Web user, based on her/his profile in addition to
4.1 Taxonomy Dimensions usage behavior is very attractive to many applications, e.g.
While the number of candidate dimensions that can be used cross-sales and up-sales in e-commerce. Web usage mining
to classify Web Usage Mining projects is many, there are is an excellent approach for achieving this goal, as illustrated
five major dimensions that apply to every project - the data in [43] Existing recommendation systems, such as [8; 6], do
sources used to gather input, the types of input data, the not currently use data mining for recommendations, though
number of users represented in each d a t a set, the number of there have been some recent proposals [16].
Web sites represented in each data set, and the application The WebWatcher [37], SiteHelper [45], Letizia [39], and chts-
area focused on by the project. Usage data can either be tering work by Mobasher et. al. [43] and Yan et. al. [57]
gathered at the server level, proxy level, or client level, as have all concentrated on providing Web Site personalization
discussed in Section 2.1. As shown in Figure 3, most projects based on usage information. Web server logs were used by
make use of server side data. All projects analyze usage Yan et. al. [57] to discover clusters of users having sim-
d a t a and some also make use of content, structure, or profile ilar access patterns. The system proposed in [57] consists
data. The algorithms for a project can be designed to work of an offline module that will perform cluster analysis and
on inputs representing one or many users and one or many an online module which is responsible for dynamic link gen-
Web sites. Single user projects are generally involved in the eration of Web pages. Every site user will be assigned to
personalization application axea. The projects that provide a single cluster based on their current traversal pattern.
multi-site analysis use either client or proxy level input data The links that are presented to a given user axe dynami-
in order to easily access usage d a t a from more than one cally selected based on what pages other users assigned to
Web site. Most Web Usage Mining projects take single-site, the same cluster have visited. The SiteHelper project learns
multi-user, server-side usage data (Web server logs) as input. a users preferences by looking at the page accesses for each
user. A list of keywords from pages that a user has spent
4.2 Project Survey a significant amount of time viewing is compiled and pre-
As shown in Figures 3 and 4, usage patterns extracted from sented to the user. Based on feedback about the keyword
Web d a t a have been applied to a wide range of applica- list, recommendations for other pages within the site are
tions. Projects such as [31; 55; 56; 58; 53] have focused on made. WebWatcher "follows" a user as be or she browses

S I G K D D Explorations. Jan 2000. Volume 1, Issue 2 - page 17


. 'YJ2~' . . . . . . . . . . . . . . . . . £t_~J/zlUdtlUl! L/diG ~OUtlL~; I ]./~td ! ~ yp~ Pr' Plrl
Focus ~erver Proxy iC|ient :Structure Content Usage Profile
I i
WebSIFT(CTS99) General x x X X B U l K
Sp._yedTmcer(WYB98,CPY96) General x X alBUm
WUM(SF98) General x x X l U l l
Sl~ahabi(SZAS97,ZASS97) General x x X
$i!eHelper(NW97) Personalization x X X I I ~ l m
Letizia(Lie95) Personalization x X X I ~ i l m
.~b Watcher(JFM97) Personalization x X X
i n l m
Krishnapuram(NKJ99) Personalizafion x X

~nalog(YJGD96) Pcrsonalization x X
B U n K
Mobasher(MCS99) Personalizafion x x X

T~zhilin(PT98) Business x X
l u l l
SurfAid Business x X X
B.~chner(BM98) Business x X ~ u n m
WebTrends,Hitlist,Aecrue,etc. Business x X
R u i n
~ebLogMiner(ZXH98) Business x X
P-ageGather,SCML(PE98,PE99) Site Modification x x X
/ u l m
Manley(Man97) Characterization x X l ~ l m
Arlitt(AW96) Characterization x X ~ ~ l m
P~tkow(PIT97,PIT98) Characterization x x X
~ ~ l m
A=lmeida(ABC96) Characterization x X
~ ~ l m
Rexford(CKR98) SystemImprove. x x X
/ u / n
S_ehechter(SKS98) SystemImprove. x X

Aggarwal(AY97) SystemImprove. x X

Figure 3: Web Usage Mining Research Projects and Products

1~i eWebSlFT
Web Usage oWUM
• SpeedTracer
Mining eWebLogMiner
•Shahabi

Personalization System Site Usage


Improvement Modification Characterization

• Site Helper • Rexford =AdaptiveSites eSurfhid • Pitkow


oLetizia oSchecter •Buchner *Aditt
eWeb Watcher oAggarwal •Tuzhilin =Manley
eMobasher =Almeida
eAnalog
eKrishnapuram

Figure 4: Major Application Areas for Web Usage Mining

SIGKDD Explorations. Jan 2000. Volume 1, Issue 2 - page 18


the Web and identifies links that are potentially interesting define a Web log data hypercube t h a t will consolidate Web
to the user. The WebWatcher starts with a short descrip- usage data along with marketing d a t a for e-commerce appli-
tion of a users interest. Each page request is routed througi~ cations. They identified four distinct steps in customer rela-
the WebWatcher proxy server in order to easily track the tionship life cycle that can be supported by their knowledge
user session across multiple Web sites and mark any inter- discovery techniques : customer attraction, customer reten-
esting links. WebWatcher learns based on the particular tion, cross sales and customer departure. There are several
user's browsing plus the browsing of other users with sim- commercial products, such as SurfAid [11], Accrue [1], Net-
ilar interests. Letizia is a client side agent that searches Genesis [7], Aria [3], Hitlist [5], and WebTrends [13] that
the Web for pages similar to ones that the user has already provide Web traffic analysis mainly for the purpose of gath-
viewed or bookmarked. The page recommendations in [43] ering business intelligence. Accrue, NetGenesis, and Aria
are based on clusters of pages found from the server log for a axe designed to analyze e-commerce events such as prod-
site. The system recommends pages from clusters that most ucts bought and advertisement click-through rates in addi-
closely match the current session. Pages t h a t have not been tion to straight forward usage statistics. Accrue provides a
viewed and are not directly linked from the current page axe p a t h analysis visualization tool and IBM's SurfAid provides
recommended to the user. [44] a t t e m p t s to cluster user ses- O L A P through a data cube and clustering of users in addi-
sions using a fuzzy clustering algorithm. [44] allows a page tion to page view statistics. P a d m a n a b h a n et. al. [46] use
or user to be assigned to more t h a n one cluster. Web server logs to generate beliefs about the access patterns
of Web pages at a given Web site. Algorithms for finding in-
4.2.2 SystemImprovement teresting rules based on the unexpectedness of t.he rule were
Performance and other service quality attributes axe crucial also developed.
to user satisfaction from services such as databases, net-
works, etc. Similar qualities are expected from the users of 4.2.5 UsageCharacterization
Web services. Web usage mining provides the key to under- While most projects that work on characterizing the usage,
standing Web traffic behavior, which can in turn be used content, and structure of the Web don't necessarily con-
for developing policies for Web caching, network transmis- sider themselves to be engaged in data mining, there is a
sion [27], load balancing, or d a t a distribution. Security is an large amount of overlap between Web characterization re-
acutely growing concern for Web-based services, especially search and Web Usage mining. Catledge et al. [23] discuss
as electronic commerce continues to grow at an exponen- the results of a study conducted at the Georgia Institute of
tial rate [32]. Web usage mining can also provide patterns Technology, in which the Web browser Xmosaic was modi-
which are useful for detecting intrusion, fraud, a t t e m p t e d fied to log client side activity. The results collected provide
break-ins, etc. detailed information about the user's interaction with the
Almeida et al. [19] propose models for predicting the local- browser interface as well as the navigational strategy used go
ity, both temporal as well as spatial, amongst Web pages browse a particular site. The project also provides detailed
requested from a particular user or a group of users access- statistics about occurrence of the various client side events
ing from the same proxy server. The locality measure can such as the clicking the back/forward buttons, saving a file,
then be used for deciding pre-fetching and caching strategies adding to bookmarks etc. Pitkow et al. [36] propose a model
for the proxy server. The increasing use of dynamic content which can be used to predict the probability distribution fi)r
has reduced the benefits of caching at both the client and various pages a user might visit on a given site. This model
server level. Schechter et. al. [52] have developed algorithms works by assigning a value to all the pages on a site based on
for creating path profiles from data contained in server logs. various attributes of that page. The formulas and thresh-
These profiles are then used to pre-generate dynamic H T M L old values used in the model are derived from an extensive
pages based on the current user profile in order to reduce la- empirical study carried out on various browsing communi-
tency due to page generation. Using proxy information from ties and their browsing patterns Arlitt et. al. [20] discuss
pre-fetching pages has also been studied by [27] and [17]. various performance metrics for Web servers along with de-
tails about the relationship between each of these metrics
4.2.3 Site Modification for different workloads. Manley [40] develops a technique
The attractiveness of a Web site, in terms of both content for generating a custom made benchmark for a given site
and structure, is crucial to m a n y applications, e.g. a product based on its current workload. This benchmark, which he
catalog for e-commerce. Web usage mining provides detailed calls a self-configuring benchmark, can be used to perform
feedback on user behavior, providing the Web site designer scalability and load balancing studies on a Web server. Chi
information on which to base redesign decisions. et. al. [35] describe a system called W E E V (Web Ecology
While the results of any of the projects could lead to re- and Evolution Visualization) which is a visualization tool to
designing the structure and content of a site, the adaptive study the evolving relationship of web usage, content and
Web site project (SCML algorithm) [48; 49] focuses on au- site topology with respect to time.
tomatically changing the structure of a site based on usage
patterns discovered from server logs. Clustering of pages is 5. WEBSIFT OVERVIEW
used to determine which pages should be directly linked. The W e b S I F T system [31] is designed to perform Web Us-
age Mining from server logs in the extended N S C A format
4.2.4 Business Intelligence (includes referrer and agent fields). The preprocessing algo-
Information on how customers axe using a Web site is critical rithms include identifying users, server sessions, and infer-
information for marketers of e-tailing businesses. Buchner ring cached page references through the use of the referrer
et al [22] have presented a knowledge discovery process in or- field. The details of the algorithms used for these steps axe
der to discover marketing intelligence from Web data. They contained in [30]. In addition to creating a server session

S I G K D D Explorations. Jail 2000. Volume 1, Issue 2 - page 19


file, the WebSIFT system performs content and structure ing pages from the site, otherwise a negotiation protocol is
preprocessing, and provides the option to convert server ses- used to arrive at a setting which is acceptable to the user.
sions into episodes. Each episode is either the subset of all Another aim of P3P is to provide guidelines for independent
content pages in a server session, or all of the navigation organizations which can ensure that sites comply with the
pages up to and including each content page. Several algo- policy statement they are publishing [12].
rithms for identifying episodes (referred to as transactions The European Union has taken a lead in setting up a regu-
in the paper) are described and evaluated in [28]. latory framework for Internet Privacy and has issued a di-
The server session or episode files can be run through se- rective which sets guidelines for processing and transfer of
quential pattern analysis, association rule discovery, cluster- personal data [15]. Unfortunately in U.S. there is no uni-
ing, or general statistics algorithms, as shown in Figure 5. fying framework in place, though U.S. Federal Trade Com-
The results of the various knowledge discovery tools can be mission (FTC) after a study of commercial Web sites has
analyzed through a simple knowledge query mechanism, a recommended that Congress develop legislation to regulate
visualization tool (association rule map with confidence and the personal information being collected at Web sites[26].
support weighted edges), or the information filter (OLAP
tools such as a data cube are possible as shown in Figure 5,
but are not currently implemented). The information filter
7. CONCLUSIONS
makes use of the preprocessed content and structure infor- This paper has attempted to provide an up-to-date survey
mation to automatically filter the results of the knowledge of the rapidly growing area of Web Usage mining. With
discovery algorithms for patterns that are potentially inter- the growth of Web-based applications, specifically electronic
esting. For example, usage clusters that contain page views commerce, there is significant interest in analyzing Web us-
from multiple content clusters are potentially interesting, age data to better understand Web usage, and apply the
whereas usage clusters that match content clusters may not knowledge to better serve users. This has led to a number of
be interesting. The details of the method the information fil- commercial offerings for doing such analysis. However, Web
ter uses to combine and compare evidence from the different Usage mining raises some hard scientific questions that must
data sources are contained in [31]. be answered before robust tools can be developed. This ar-
ticle has aimed at describing such challenges, and the hope
is that the research community will take up the challenge of
6. PRIVACY ISSUES addressing them.
Privacy is a sensitive topic which has been attracting a lot of
attention recently due to rapid growth of e-commerce. It is 8. REFERENCES
further complicated by the global and self-regulatory nature
of the Web. The issue of privacy revolves around the fact [1] Accrue. http://www.accrue.com.
that most users want to maintain strict anonymity on the
Web. They are extremely averse to the idea that someone is [2] Alladvantage. http://www.alladvantage.com.
monitoring the Web sites they visit and the time they spend
on those sites. [3] Andromedia aria. http://www.andromedia.com.
On the other hand, site administrators are interested in find- [4] Brogdvision. http://www.broadvision.com.
ing out the demographics of users as well as the usage statis-
tics of different sections of their Web site. This information [5] Hit list commerce, http://www.marketwave.com.
would allow them to improve the design of the Web site and
would ensure that the content caters to the largest popu- [6] Likeminds. http://www.andromedia.com.
lation of users visiting their site. The site administrators
also want the ability to identify a user uniquely every time [7] Netgenesis. http://www.netgenesis.com.
she visits the site, in order to personalize the Web site and
improve the browsing experience. [8] Netperceptions. http://www.netperceptions.com.
The main challenge is to come up with guidelines and rules [9] Netzero. http://www.netzero.com.
such that site administrators can perform various analyses
on the usage data without compromising the identity of an [10] Platform for privacy project.
individual user. Furthermore, there should be strict regula- http://www.w3.org/P3P/.
tions to prevent the usage data from being exchanged/sold
to other sites. The users should be made aware of the pri- [11] Surfaid analytics, http://surfald.dfw.ibm.com.
vacy policies followed by any given site, so that they can
make an informed decision about revealing their personal [12] Truste: Building a web you can believe in.
data. The success of any such guidelines can only be guar- http://www.truste.org/.
anteed if they are backed up by a legal framework.
[13] Webtrends log analyzer, http://www.webtrends.com.
The W3C has an ongoing initiative called Platform for Pri-
vacy Preferences (P3P) [10; 38]. P3P provides a protocol [14] World wide web committee web usage characterization
which allows the site administrators to publish the privacy activity, http://www.w3.org/WCA.
policies followed by a site in a machine readable format.
When the user visits the site for the first time the browser [15] European commission, the directive on the protection
reads the privacy policies followed by the site and then com- of individuals with regard ot the processing of per-
pares that with that security setting configured by the user. sonal data and on the free movement of such data.
If the policies are satisfactory the browser continues request- http://www2.echo.lu/, 1998.

SIGKDD Explorations. Jazl 2000. Volume l, Issue 2 - page 20


# ,

Z SiteFiles AccessLog ReferrerLog AgentLog Regi


strati
Remote onor
Agent
~~_~ ~ _ .... ~ _ a t a _ ......

~ r=~===~ siteTopology j/ s e r v e r , s e s s i o n I-lie

i ~Slite ~'' C 'o nJt e~n t Epi~eFile JJ / \

Pattern
,,,°,°o ) t ClusteringJ t ,-,oov
........
Ru,eMining) t Statistics <>oo~oo°

:o
ul>
~l,U

F,
g /
SequentiaPatterns
l PageClusters UserClusters AssociationRules UsageStatistics

otl
>- Filter
,,,.I
Z

Z
I,i,i
I1. "Interesting"Rules,Patterns,
andStatistics
Figure 5: A r c h i t e c t u r e for t h e W e b S I F T System

S I G K D D Explorations. .Jail 2000. Volume 1, Issue 2 - page 21


[16] Data mining: Crossing the chasm, 1999. Invited talk at [30] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-
the 5th ACM SIGKDD Int'l Conference on Knowledge vastava. Data preparation for mining world wide web
Discovery and Data Mining(KDD99). browsing patterns. Knowledge and Information Sys-
tems, 1(1), 1999.
[17] Charu C Aggarwal and Philip S Yu. On disk caching
of web objects in proxy servers. In CIKM 97, pages [31] Robert Cooley, Pang-Ning Tan, and Jaideep Srivastava.
238-245, Las Vegas, Nevada, 1997. Discovery of interesting usage patterns from web data.
Technical Report TR 99-022, University of Minnesota,
[18] R. Agrawal and R. Srikant. Fast algorithms for mining 1999.
association rules. In Proc. of the 20th VLDB Confer-
ence, pages 487-499, Santiago, Chile, 1994. [32] T. Fawcett and F. Provost. Activity monitoring: Notic-
ing interesting changes in behavior. In Fifth ACM
[19] Virgilio Almeida, Azer Bestavros, Mark Crovella, and SIGKDD International Conference on Knowledge Dis-
Adriana de Oliveira. Characterizing reference locality covery and Data Mining, pages 53-62, San Diego, CA,
in the www. Technical Report TR-96-11, Boston Uni- 1999. ACM.
versity, 1996.
[33] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From
[20] Martin F Arlitt and Carey L Williamson. Internet web data mining to knowledge discovery: An overview. In
servers: Workload characterization and performance Proc. ACM KDD, 1994.
implications. 1EEE/A CM Transactions on Networking,
5(5):631-645, 1997. [34] David Gibson, Jon Kleinberg, and Prabhakar Ragha-
van. Inferring web communities from link topology. In
[21] M. Balabanovie and Y. Shoham. Learning informa- Conference on Hypertext and Hypermedia. ACM, 1998.
tion retrieval agents: Experiments with automated
web browsing. In On-line Working Notes of the AAAI [35] Chi E. H., Pitkow J., Mackinlay J., Pirolli P., Goss-
Spring Symposium Series on Information Gathering weiler, and Card S. K. Visualizing the evolution of web
from Distributed, Heterogeneous Environments, 1995. ecologies. In CHI '98, Los Angeles, California, 1998.

[22] Alex Buchner and Maurice D Mulvenna. Discovering [36] Bernardo Huberman, Peter Pirolli, James Pitkow, and
internet marketing intelligence through online analyt- Rajan Kukose. Strong regularities in world wide web
ical web usage mining. SIGMOD Record, 27(4):54-61, surfing. Technical report, Xerox PARC, 1998.
1998. [37] T. Joachims, D. Freitag, and T. Mitchell. Webwatcher:
[23] L. Catledge and J. Pitkow. Characterizing browsing be- A tour guide for the world wide web. In The 15th Inter-
haviors on the world wide web. Computer Networks and national Conference on Artificial Intelligence, Nagoya,
ISDN Systems, 27(6), 1995. Japan, 1997.

[24] M.S. Chen, J. Hart, and P.S. Yu. Data mining: An [38] Reagle Joseph and Cranor Lorrie Faith. The platform
for privacy preferences. 42(2):48-55, 1999.
overview from a database perspective. IEEE Transac-
tions on Knowledge and Data Engineering, 8(6):866- [39] H. Lieberman. Letizia: An agent that assists web
883, 1996. browsing. In Proe. of the 1995 International Joint Con-
ference on Artificial Intelligence, Montreal, Canada,
[25] M.S. Chen, J.S. Park, and P.S. Yu. Data mining
1995.
for path traversal patterns in a web environment. In
16th International Conference on Distributed Comput- [40] Stephen Lee Manley. An Analysis of Issues Facing
ing Systems, pages 385-392, 1996. World Wide Web Servers. Undergraduate, Harvard,
1997.
[26] Roger Clarke. Internet privacy concerns conf the case
for intervention. 42(2):60-67, 1999. [41] B. Masand and M. Spiliopoulou, editors. Workshop on
Web Usage Analysis and User Profiling (WebKDD),
[27] E. Cohen, B. Krishnamurthy, and J. Rexford. Improv- 1999.
ing end-to-end performance of the web using server
volumes and proxy filters. In Proe. ACM SIGCOMM, [42] B. Mobasher, N. Jaln, E. Hart, and J. Srivastava. Web
pages 241-253, 1998. mining: Pattern discovery from world wide web trans-
actions. (TR 96-050), 1996.
[28] Robert Cooley, Bamshad Mobasher, and Jaideep Sri-
vastava. Grouping web page references into transac- [43] Bamshad Mobasher, Robert Cooley, and Jaideep Sri-
tions for mining world wide web browsing patterns. In vastava. Creating adaptive web sites through usage-
Knowledge and Data Engineering Workshop, pages 2-9, based clustering of urls. In Knowledge and Data En-
Newport Beach, CA, 1997. IEEE. gineering Workshop, 1999.
[29] Robert Codley, Bamshad Mobasher, and Jaideep Sri- [44] Olfa Nasraoui, Raghu Krishnapuram, and Anupam
vastava. Web mining: Information and pattern discov- Joshi. Mining web access logs using a fuzzy rela-
ery on/th/e world wide web. In International Confer- tional clustering algorithm based on a robust estimator.
ence on Tools with Artificial Intelligence, pages 558- In Eighth International World Wide Web Conference,
567, Newport Beach, 1997. IEEE. Toronto, Canada, 1999.

SIGKDD Explorations. Jan 2000. Volume 1, Issue 2 - page 22


[45] D.S.W. Ngu and X. Wu. Sitehelper: A localized agent [59] Amir Zarkesh, Jafar Adibi, Cyrus Shahabi, Reza Sadri, and
t h a t helps incremental exploration of the world wide Vishal Shah. Analysis and design of server informative www-
web. In 6th International World Wide Web Conference, sites. In Sixth International Conference on Information and
Santa Clara, CA, 1997. Knowledge Management, Las Vegas, Nevada, 1997.

[46] Balaji P a d m a n a b h a n and Alexander Tuzhilin. A belief-


driven m e t h o d for discovering unexpected patterns. In About the Authors :
Fourth International Conference on Knowledge Discov-
ery and Data Mining, pages 94-100, New York, New J a i d e e p S r i v a s t a v a received the B.Tech. degree in computer
York, 1998. science from the Indian Institute of Technology, Kanpur, India,
in 1983, and the M.S. and Ph.D. degrees in computer science
[47] M. Pazzani, L. Nguyen, and S. Mantik. Learning from from the University of California, Berkeley, in 1985 and 1988,
hotlists and coldlists: Towards a www information fil- respectively. Since 1988 he has been on the faculty of the Com-
tering and seeking agent. In IEEE 1995 International puter Science Department, University of Minnesota, Minneapolis,
Conference on Tools with Artificial Intelligence, 1995. where he is currently an Associate Professor. In 1983 he was a
research engineer with Uptron Digital Systems, Lucknow, India.
[48] Mike Perkowitz and Oren Etzioni. Adaptive web sites: He has published over 110 papers in refereed journals and con-
Automatically synthesizing web pages. In Fifteenth Na- ferences in the areas of databases, parallel processing, artificial
tional Conference on Artificial Intelligence, Madison, intelligence, and multi-media. His current research is in the areas
WI, 1998. of databases, distributed systems, and multi-media computing.
He has given a number of invited talks and participated in panel
[49] Mike Perkowitz and Oren Etzioni. Adaptive web sites: discussions on these topics. Dr. Srivastava is a senior member of
Conceptual cluster mining. In Sixteenth International the IEEE Computer Society and the ACM. His professional ac-
Joint Conference on Artificial Intelligence, Stockholm, tivities have included being on various program committees, and
Sweden, 1999. refereeing for journals, conferences, and the NSF.
[50] Peter Pirolli, James Pitkow, and R a m a n a Rao. Silk
from a sow's ear: Extracting usable structures from R o b e r t C o o l e y is currently pursuing a Ph.D. in computer sci-
the web. In CHI-96, Vancouver, 1996. ence at the University of Minnesota. He received an M.S. in
computer science from Minnesota in 1998. His research interests
[51] G. Salton and M.J. McGill. Introduction to Modern In- include Data Mining and Information Retrieval.
formation Retrieval. McGraw-Hill, New York, 1983.
M u k u n d D e s h p a n d e is a Ph.D. student in the Department of
[52] S. Schechter, M. Krishnan, and M. D. Smith. Using Computer Science at the University of Minnesota. He received
path profiles to predict http requests. In 7th Interna- an M.E. in system science & automation from Indian Institute of
tional World Wide Web Conference, Brisbane, Aus- Science, Bangalore, India in 1997.
tralia, 1998.
P a n g - N i n g Tan is currently working towards his Ph.D. in Com-
[53] Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and
puter Science at University of Minnesota. His primary research
Vishal Shah. Knowledge discovery from users web-page
interest is in Data Mining. He received an M.S. in Physics from
navigation. In Workshop on Research Issues in Data
University of Minnesota in 1996.
Engineering, Birmingham, England, 1997.

[54] E. Spertus. Parasite : Mining structural information on


the web. Computer Networks and ISDN Systems: The
International Journal of Computer and Telecommuni-
cation Networking, 29:1205-1215, 1997.

[55] Myra Spiliopoulou and Lukas C Faulstich. Wum: A


web utilization miner. In E D B T Workshop WebDB98,
Valencia, Spain, 1998. Springer Verlag.

[56] Kun-lung Wu, Philip S Yu, and Allen Ballman. Speed-


tracer: A web usage mining and analysis tool. I B M
Systems Journal, 37(1), 1998.

[57] T. Yah, M. Jacobsen, H. Garcia-Molina, and U. Dayal.


From user access patterns to dynamic hypertext link-
ing. In Fifth International World Wide Web Confer-
ence, Paris, France, 1996.

[58] O. R. Zaiane, M. Xin, and J. Han. Discovering web


access patterns and trends by applying olap and data
mining technology on web logs. In Advances in Digital
Libraries, pages 19-29, Santa Barbara, CA, 1998.

S I G K D D Explorations. Jan 2000. Volume 1, Issue 2 - page 23

You might also like