You are on page 1of 20

Web Mining

By-
Pawan Singh
Piyush Arora
Pooja Mansharamani
Pramod Singh
Praveen Kumar
1
Outline

 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions

2
Four Problems

 Finding relevant information


 Low precision-which is due to the irrelevance of many of the search results. This
results in a difficulty finding the relevant information.
 LOW RECALL which is due to the inability to index all the information available
on the web.This results in a difficulty finding the unindexed information that is
relevant.

 Creating new knowledge out of available


information on the web
 While the problem above is a query-triggered process (retrieval
oriented), this problem is a data-triggered process .

3
Personalizing the information
 Catering to personal preference in content and presentation(associated
with the type and presentation of the information )

Learning about the consumers


 What does the customer want to do?
 Using web data to effectively market products and/or services

4
Other Approaches

Web mining is NOT the only approach


 Database approach (DB)
 Information retrieval (IR)
 Natural language processing (NLP)
 In-depth syntactic and semantic analysis
 Web document community
 Standards, manually appended meta-information,
maintained directories, etc

5
Direct vs. Indirect Web Mining

 Web mining techniques can be used to solve the


information overload problems:
 Directly
Attack the problem with web mining techniques
E.g. newsgroup agent classifies news as relevant
 Indirectly
Used as part of a bigger application that addresses
problems
E.g. used to create index terms for a web search service

6
The Research

 Converging research from: Database,


information retrieval, and artificial
intelligence (specifically NLP and machine
learning)
 Focusing on research from the machine
learning point of view

7
Web Mining: Definition

 “Web mining refers to the overall process of


discovering potentially useful and previously
unknown information or knowledge from the
Web data.”
Can be viewed as four subtasks
Not the same as Information Retrieval
Not the same as Information Extraction

8
Web Mining: Subtasks

 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from
retrieved web resources.
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns

9
Web Mining: Not IR

 Information retrieval (IR) is the automatic


retrieval of all relevant documents while at
the same time retrieving as few of the non-
relevant documents as possible
 Web document classification, which is a Web
Mining task, could be part of an IR system
(e.g. indexing for a search engine)

10
Web Mining: Not IE

 Information extraction (IE) aims to extract


the relevant facts from given documents
while IR aims to select relevant documents.
IE systems for the general Web are not feasible
Most focus on specific Web sites or content

11
IE - IR

Information Retrieval Information Extraction


 Automatic retrieval of  Extract relevant facts from documents
relevant documents
 Primary Goals:  Primary Goals:
o Indexing Text o Transform collection of retrieved
o Searching for useful documents to information.
documents in a collection o Structure of representation of a document
o “Web document classification “ task is an
o “Bag of unordered words” instance of IR
o “Web document o IE has a higher level of granularity
classification “ task is an o Result:
instance of IR o Structured Database
o Compression or summary of Text or
documents

12
Types of IE

 I E from unstructured texts


( Classical)
• Unstructured ?? Free texts eg.News
stories
• Basic to deep linguistic pre-processing.

 IE from semi-structured texts


(Structural)
• Semi-Structured ?? HTML
• Uses meta-information eg. HTML tags
Wrapper Induction,
Machine learning used to build systems
(semi-)automatically

13
Web Mining and Machine
Learning
 Machine learning is concerned with the
development of algorithms and techniques that
allow computers to "learn".
 Web mining is NOT learning from the Web.
 Some applications of machine learning on the web
are NOT Web Mining
 Methods used for Web Mining are NOT limited to
machine learning
 There is a close relationship between web mining
and machine learning

14
Web Mining and Machine Learning

• Machine learning techniques support and help web mining


as they could be applied to the processes in the web mining.

• For example, recent research shows that applying machine


learning techniques could improve the text classification
process compared to the traditional IR techniques.

• In short,web mining intersects with the application of the


machine learning on the web.

15
Web Mining Categories

 Web Content Mining


 Discovering useful information from web
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures
(topology) on the Web. E.g. discovering authorities and
hubs
 Web Usage Mining
 Make sense of data generated by surfers
 Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc. 16
Web Content Data Structure

 Unstructured – free text


 Semi-structured – HTML
 More structured – Table or Database
generated HTML pages
 Multimedia data – receive less attention than
text or hypertext

17
Web Structure Mining

 Interested in the structure between Web documents


(not within a document)
 Example: PageRank – Google
 Application: Discovering micro-communities in the
Web
 Measuring the “completeness” of a Web site

18
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches
 Map usage data into relational tables before using
adapted data mining techniques
 Use log data directly by utilizing special pre-processing
techniques
19
Thank you!

20

You might also like