Unit III

CHAPTER 5
WEB DATA MINING

Learning Objectives 1. Explain what web mining is all about 2. Define the relevant web terminology 3. Explain what web content mining is and how it may be used for discovering useful information from the web 4. Analyze the structure of the web using the HITS algorithm designed for web structure mining 5. Understand user behavior in interacting with the web or with a web site in order to improve the quality of service 5.1 INTRODUCTION Definition: Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from Web data. It is normally expected that either the hyperlink structure of the Web or the Web log data or both have been used in the mining process. Web mining can be divvied into several categories: 1. Web content mining: it deals with discovering useful information or knowledge from Web page contents. In contrast to Web usage mining and Web structure mining, we are contented with mining focuses on the Web page content rather than the links 2. Web structure mining: It deals with the discovering and modeling the link structure of the Web. Work has been carried out to model the Web based on the topology of the hyperlinks. This can help in discovering similarity between sites or in discovering important sites for a particular topic or discipline or in discovering Web communities. 3. Web usage mining: It deals with understanding user behavior in interacting with the Web or with the Web site. One of the aims is to obtain information that may assist Web
site recognition or assist site adaptation to better suit the user. The mined data often includes data logs of users interactions with the Web. The logs include the Web server logs, proxy server logs, and browser logs. The logs include information about the referring pages, user identification, time a user spends at a site and the sequence of pages visited. The three categories above are not independent since Web structure mining is closely related to Web content mining and both are related to Web usage mining. 1. Hyperlink: The text documents do not have hyperlinks, while the links are very important components of Web documents. In hard copy documents in a library, the documents are usually structured (e.g. books) and they have been catalogued by cataloguing experts. No linkage between these documents is identified except that two documents may have been catalogued in the same classification and therefore deal with similar topics. 2. Types of Information: As noted above, Web pages differ widely in structure, quality and their usefulness. Web pages can consist of text, frames, multimedia objects, animation and other types of information quite different from text documents which mainly consist of text but may have some other objects like tables, diagrams, figures and some images. 3. Dynamics: The text documents do not change unless a new edition of a book appears while Web pages change frequently because the information on the Web including linkage information is updated all the time (although some Web pages are out of date and never seem to change!) and new pages appear every second. Finding a previous version of a page is almost impossible on the Web and links pointing to a page may work today but not tomorrow. 4. Quality: The text documents are usually of high quality since they usually go through some quality control process because they are very expensive to produce. In contrast,
much of the information on the Web is of low quality. Compared to the size of the Web, it may be that less than 10% of Web pages are really useful and of high quality. 5. Huge size: Although some of the libraries are very large, the Web in comparison is much larger, perhaps its size is appropriating 100 terabytes. That is equivalent to about 200 million books. 6. Document use: Compared to the use of conventional documents, the use of Web documents is very different. The Web users tend to pose short queries, browse perhaps the first page of the results and then move on. 5.2 WEB TERMINOLOGY AND CHARACTERISTICS The World Wide Web (WWW) is the set of all the nodes which are interconnected by hypertext links. A link expresses one or more relationships between two or more resources. Links may also be establishes within a document by using anchors. A Web page is a collection of information, consisting of one or more Web resources, intended to be rendered simultaneously, and identified by a single URL. A Web site is a collection of interlinked Web pages, including a homepage, residing at the same network location. A Uniform Resource Locator (URL) is an identifier for an abstract or physical resource, for example a server and a file path or index. URLs are location dependent and each URL contains four distinct parts, namely the protocol types (usually http), the name of the Web server, the directory path and the file name. If a file name is not specified, index.html is assumed. A Web server serves Web pages using http to client machines so that a browser can display them. A client is the role adopted by an application when it is retrieving a Web resource.
A proxy is an intermediary which acts as both a server and a client for the purpose of retrieving resources on behalf of other clients. Clients using a proxy know that the proxy is present and that it is an intermediary. A domain name server is a distributed database of the name to address mappings. When a DNS server looks up a computer name, it either finds it in its list, or asks another DNS server which knows more names. A cookie is the data sent by a Web server to a Web client, to be stored locally by the client and sent back to the server on subsequent requests. Obtaining information from the Web using a search engine is called information pull while information sent to users is called information push. For example users may register with a site and then information is sent (pushed) to such users without their requesting it. Graph Terminology A directed graph as a set of nodes (pages) denoted by V and edges (links) denoted by E. Thus a graph is (V,E) where all edges are directed, just like a link that points from one page to another, and may be considered an ordered pair of nodes, the nodes that thy link. An undirected graph also is represented by nodes and edges (V, E) but the edges have no direction specified. Therefore an undirected graph is not like the pages and links on the Web unless we assume the possibility of traversal in both directions. The back button on the browsers does provide the possibility of back traversal once a link has been traversed in one direction, but in general both way traversal of links is not possible on the Web. A graph may be searched either by a breadth-first search or by a depth-first search. The breadth-first search is based on first searching all the nodes that can be reached from the node where the search is starting and once these nodes have been
searched, searching the nodes at the next level that can be reached from those nodes and so on. The depth first search is based more on first searching any unvisited descendants of a given node, than visiting the node and then any brother nodes. Essentially, the search algorithm involves going down before going across to a brother node. The diameter of the graph is defined as the maximum of the minimum distances between all possible ordered node pairs (u,v), that is , it is the maximum number of links that one would need to follow starting from any page u to reach any page v assuming that the best path has been followed. The Strongly Connected Core (SCC)- This part of the Web was found to consist of about 30% of the Web, which is still very large given more than four billion pages on the Web in 2004. This core may be considered the heart of the Web and its main property is that pages in the core can reach each other following directed edges. (i.e. hyperlinks) The IN Group This part of the Web was found to consist of about 20% of the Web. The main property of the IN group is that pages in the group can reach the SCC but cannot be reached from it. The OUT Group This part of the Web was found to consist of about 20% of the Web. The main property of the OUT group is that pages in the group can be reached from the SCC but cannot reach the SCC. Tendrils This part of the Web was found to consist of about 20% of the Web. The main property of pages in this group is that the pages cannot be reached by the SCC and cannot reach the SCC. It does not imply that these pages have no linkages to pages outside the group since they could well have linkages from the IN Group and to the OUT Group. The Disconnected Group - This part of the Web was found to be less than 10% of the Web and is essentially disconnected from the rest of the Web world. These pages could include, for example, personal pages at many sites that link to no other page and have no links to them.
Size of the Web The deep Web includes information stored in searchable databases often inaccessible to search engines. This information can often only be accessed by using interface of each website. Some of these information may be available only to subscribers. The shallow Web (indexable web) is the information on the Web that the search engine can access without accessing the Web data bases. In many cases use of Web makes good sense, for example, it is better to put even short announcements in an enterprise on the Web rather than send them by email since emails sit in many mail boxes wasting disk storage while putting information on the Web can be more effective as well as help in maintaining a record of communications. If such uses grow, which appears likely, then a very large number of such Web pages with a short life span and low connectivity to other pages are likely to be generated each day. The large numbers of Web sites that disappear everyday do create enormous problems on the Web. Links from well known sites do not always work. Not all results of a search engines are guaranteed to work. The URLs cited in scholarly publications also cannot be relied upon to be still available. A study of papers presented at the WWW conferences found that links cited in them had a decay rate that grew with the age of the papers. Abandoned sites therefore are a nuisance. To overcome these problems, it may become necessary to categorize Web pages. The following categorization is one possibility: 1. a Web page that is guaranteed not to change ever 2. a Web page that will not delete any content, may add content/links but the page will not disappear 3. a Web page that may change content/ links but the page will not disappear 4. a Web page without any guarantee Web Metrics There have been a number of studies that have tried to measure the Web, for example, its size and its structure. There are a number of other properties about the Web that are useful to measure.
Some of the measures are based on distances between the nodes of the Web graph. It is possible to define how well connected a node is by using the concept of the centrality of a node. Centrality may be out-centrality is based on the distances measured from other nodes that are connected to the node using the in-links. Based on these metrics, it is possible to define the concept of compactness that varies from 0 to 1, for a completely disconnected Web graph and 1 for a fully connected Web graph. 5.3 LOCALITY AND HIERARCHY IN THE WEB A Web site of any enterprise usually has the homepage as the root of the tree as in any hierarchical structure. For example, if one looks at a typical university Web site the homepage will provide some basic information about the institution and then provide links, for example, to:
Prospective students Staff Research Information for current students Information for current staff The Prospective students node will have a number of links, for example, to: Courses offered Admission requirements Information for international students Information for graduate students Scholarships available Semester dates Similar structure would be expected for other nodes at this level of the tree. Web sites are fetching information from a database to ensure that the information is accurate and timely. A recent study found that almost 40% of all URLs fetched information from database.
It is possible to classify Web pages into several types: 1. Homepage or the head page: These pages represent an entry for the Web site of an enterprise or a section within the enterprise or an individuals Web page. 2. Index page: These pages assist the user to navigate through of the enterprise Web site. A homepage in some cases may also act as an index page. 3. Reference page: These pages provide some basic information that is used by a number of other pages. For example, each page in a Web site may have a link to a page that provides the enterprises privacy policy. 4. Content page: These pages only provide content and have little role in assisting a users navigation. Often these pages are larger in size, have few out-links, and are well down the tree in a Web site. They are the leaf nodes of a tree. A number of principles have been developed to help design the structure and content of a Web site. For example, three basic principles are: 1. Relevant linkage principle: It is assured that links from a page point to other relevant resources. This is similar to the assumption that is made for citations in scholarly publications, where it is assumed that a publication cites only relevant publications. Links are often assumed to reflect the judgment of the page creator. By providing a link to another page it is assumed that the creator is making a recommendation for the other relevant page. 2. Topical unity principle: It is assumed that Web pages that are co-cited (i.e. linked from the same pages) are related. Many Web mining algorithms make use of this relevance assumption as a measure of mutual relevance between Web pages. 3. Lexical affinity principle: It is assumed that the text and the links within a page are relevant to each other. Once again, it is assumed that the text on a page has been chosen carefully by the creator to be related to a theme.
5.4 WEB CONTENT MINING The area of Web mining deals with discovering useful information from the Web. Normally when we need to search for content on the Web, we use one of the search engines like Google or a subject directory like Yahoo! Some search engines find pages based on location and frequency of keywords on the page although some now use the concept of page rank. The algorithm proposed is called Dual Iterative Pattern Relation Extraction (DIPRE). It works as follows: 1. Sample: Start with a Sample S provided by the user. 2. Occurrences: Find occurrences of tuples starting with those in S. Once tuples are found the context of every occurrence is saved. Let these be O. O S 3. Patterns: Generate patterns based on the set of occurrences O. This requires generating patterns with similar contexts. PO 4. Match Patterns The Web is now searched for the patterns. 5. Stop if enough matches are found, else go to step 2. Web document clustering Web document clustering is another approach to find relevant documents on a topic or about query keywords. The popular search engines often return a huge, unmanageable list of documents which contain the keywords that the user specified. Finding the most useful documents from a large list is usually tedious, often impossible. The user could apply clustering to a set of documents returned by a search engine in response to a query with the aim of finding semantically meaningful clusters, rather than a list of ranked documents. In cluster analysis techniques, in particular we discussed the K-means method and agglomerative method. These methods can be used for Web document cluster analysis as well but these methods assume that each document has a fixed set of attributes that
appear in all documents. Similarity between documents can then be computed based on these values. One could possibly have a set of words and their frequencies in each document and then use those values for clustering them. Suffix Tree Clustering (STC) is an approach that takes a different path and is designed specifically for Web document cluster analysis, and it uses a phrase-based clustering approach rather than use single word frequency. In STC, the key requirements of a Web document clustering algorithm include the following: 1. Relevance: This is the most obvious requirement. We want clusters that are relevant to the user query and that cluster similar documents together. 2. Browsable summaries: The cluster must be easy to understand. The user should be quickly able to browse the description of a cluster and work out whether the cluster is relevant to the query. 3. Snippet tolerance: The clustering method should not require whole documents and should be able to produce relevant clusters based only on the information that the search engine returns. 4. Performance: The clustering method should be able to process the results of the search engine quickly and provide the resulting clusters to the user. Finding Similar Web pages There is a proliferation of similar of identical documents on the Web. It has been found that almost 30% of all Web pages are very similar to other pages and about 22% are virtually identical to other pages. There are many reasons for identical pages. For example: 1. A local copy may have been made to enable faster access to the material. 2. FAQs on the important topics are duplicated since such pages may be used frequently locally.
3. Online documentation of popular software like Unix or LATEX may be duplicated for local use. 4. There are mirror sites that copy highly accessed sites to reduce traffic (e.g. to reduce international traffic from India or Australia). In some cases, documents are not exactly identical because different formatting might be used at different site. There may be some customization or use of templates at different sites. A large document may be split into smaller documents or a composite document may be joined together to build a single document. Copying a single Web page is often called replication; on the other hand copying an entire Web site is called mirroring. Discussion is focused on content-based similarity which is based on comparing the textual content of the Web pages. The Web pages also have non-text content but we will not consider it. We define two concepts: 1. Resemblance: Resemblance of two documents is defined to be a number between 0 and 1 with 1 indicating that the two documents are virtually identical and any value close to 1 indicating that the documents are very similar. 2. Containment: Containment of one document in another is also defined as a number between 0 and 1 with 1 indicating that the first document is completely contained in the second. There are number of ways by which similarity of documents can be assessed. One brute force approach is to compare two documents using software like the tool diff available in the Unix operating system which essentially compares the two documents as files. Other string comparison algorithms can be used to find how many characters need to be deleted, changed or added to transform one document into the other, but these approaches are very expensive if one wishes to compare millions of documents.
There are other issues that must be considered in document matching. Firstly, if we are looking to compare millions of documents then the storage requirement of the method should not be large. Secondly, documents may be in HTML, PDF, Postscript, FrameMaker, TeX, PageMaker or MS Word. They need to be converted to text for comparison. The conversion can introduce some errors. Finally, the method should be robust, that is, it should not be possible to circumvent the matching process with modest changes to a document. Fingerprinting An approach for comparing a large number of documents is based on the idea of fingerprinting documents. A document may be divided into all possible substrings of length L. These substrings are called shingles. Based on the shingles one can define resemblance R(X,Y) and containment C(X,Y) between two documents X and Y as follows. We assume S(X) and S(Y) to be a set of shingles for documents X and Y respectively. R(X,Y) ={S(X) S(Y)} / {S(X) U S(Y)} and C(X,Y) ={S(X) S(Y)} / {S(X)} Following algorithm be used to find similar documents: 1. Collect all the documents that one wishes to compare. 2. Choose a suitable shingle width and compute the shingles for each document. 3. Compare the shingles for each pair of documents. 4. Identify those documents that are similar. Full fingerprinting: The web is very large and this algorithm requires enormous storage for the shingles and very long processing time to finish pair wise comparison for say even 100 million documents. This approach is called full fingerprinting.
Example 5.1 Consider a simple example in which we wish to find if the two documents with the following content are similar: Document 1 Document 2 the Web continues to grow at a fast rate the Web is growing at a fast rate
The first step involves making a set of shingles from the content of each document. We obtain the shingles in the following table, if we assume the shingle length to be three words. Table 5.1 Shingles of length 3 in both documents Shingles in Document 1 the Web continues Web continues to continues to grow to grow at grow at a at a fast a fast rate Shingles in Document 2 the Web is Web is growing is growing at growing at a at a fast a fast rate
Comparing the two sets of shingles we find that only two of them are identical. Thus, for this simple example, the documents are not very similar.
Shingles in Document 1 the Web continues Web continues to continues to grow to grow at grow at a at a fast
Number of letters 17 16 17 10 9 9
Shingles in Document 1 the Web is Web is growing is growing at growing at a at a fast a fast rate
Number of letters 10 14 13 12 9 11
a fast rate
11 Table 5.2 Number of letters in shingles
For comparison, we select the three shortest shingles. For the first document, these are to grow at, grow at a and at a fast. For the second document the shortest shingles are the Web is, at a fast and a fast rate. There is only one match out of the three shingles, providing a resemblance ratio 0.33. False positives with the original document would be obtained for documents like the Australian economy is growing at a fast rate. False negatives would be obtained for string like the Web is growing at quite a rapid rate. It has been found that small length shingles cause many false positives while large shingles result in more false negatives. Following issues are ignored, when comparing documents using fingerprinting. 1. How long should the shingle be for good performance? 2. Should shingle length be in number of words or number of characters? 3. How much storage would we need to store shingles? 4. Should upper-case and lower-case letters be treated differently? 5. Should spaces between words be removed? 6. Should punctuation marks be ignored? 7. Should end of line marker be ignored? 8. Should end of paragraph be ignored? 9. Should stop words like to, a and the be removed? 10. Should stemming be carried out to transform growing into grow? This is the one approach to find similar pages. 5.5 WEB USAGE MINING 1. Number of hits the number of times each page in the Web site has been viewed 2. Number of visitors the number of users who came to the site
3. Visitor referring Web site - the Web site URL of the site the user came from 4. Visitor referring Web site - the Web site URL of the site where the user went when he /she left the Web site 5. Entry point which Web site page the user entered from 6. Visitor time and duration the time and day of visit and how long the visitor browsed the site 7. Path analysis - a list of the path of pages that the user took 8. Visitor IP address this helps in finding which part of the world the user came from 9. Browser type 10. Platform 11. Cookies Even this simple information about a Web site and pages within it can assist an enterprise to achieve the following: 1. Shorten the paths to high visit pages 2. Eliminate or combine low visit pages 3. Redesign some pages including the homepage to help user navigation 4. Redesign some pages so that the search engines can find them 5. Help evaluate effectiveness of an advertising campaign Web usage mining may also involve collecting much more information that has been listed. For example, it may be desirable to collect information on: 1. Path traversed: What paths do the customers traverse? What are the most commonly traversed paths through a Web site? These patterns need to be interpreted, analyzed, visualized and acted upon. 2. Conversion rates: What are the look-to-click, click-to-basket-to-buy rates for each product? Are there significant differences in these rates for different products? 3. Impact of Advertising: Which banners are pulling in the most traffic? What is their conversion rate? 4. Impact of promotions: Which promotions generate the most sales? Is there a particular level in the site where promotions are most effective?
5. Web site design: Which links do the customers click most frequently? What links do they buy from most frequently? Are there some features of these links that can be identified? 6. Customer segmentation: What are the features of customers who abandon their trolley without buying? Where do the most profitable customers come from? 7. Enterprise search: Which customers use enterprise search? Are they more likely to purchase? What do they search for? How frequently does the search engine return a failed result? How frequently does the search engine return too many results? Log data analysis has been investigated using the following techniques: Using association rules Using composite association rules Using cluster analysis
5.6 WEB STRUCTURE MINING The aim of web structure mining is to discover the link structure or the model that is assumed to underlie the Web. The model may be based on the topology of the hyperlinks. This can help in discovering similarity between sites or in discovering authority sites for a particular topic or disciple or in discovering overview or survey sites that point to many authority sites (such sites are called hubs). Link structure is only one kind of information that may be used in analyzing the structure of the Web.
The HITS (Hyperlink-Induced Topic Search) algorithm has 2 major steps: 1. Sampling step It collects a set of relevant Web pages given a topic. 2. Iterative step It finds hubs and authorities using the information collected during sampling.
The HITS method uses the following algorithm. Step 1 Sampling Step The first step involves finding a subset of nodes or a subgraph S, which is rich in relevant authoritative pages. To obtain such a subgraph, the algorithm starts with a root set of, say, 200 pages selected from the result of searching for the query in a traditional search engine. Let the root set be R. Starting from the root set R, wish to obtain a set S that has the following properties: 1. S is relatively small 2. S is rich in relevant pages given the query 3. S contains most (or many) of the strongest authorities. HITS Algorithm expands the root set R into a base set S by using the following algorithm: 1. Let S=R 2. For each page in S, do steps 3 to 5 3. Let T be the set of all pages S points to 4. Let F be the set of all pages that point to S 5. Let S = S + T + some of all of F (some if F is large) 6. Delete all links with the same domain name 7. This S is returned Step 2 Finding Hubs and Authorities The algorithm for finding hubs and authorities works as follows: 1. Let a page p have a non-negative authority weight xp and a non-negative hub weight yp, Pages with relatively large weights xp will be classified to be the authorities (similarly for the hubs with large weights yp) 2. The weights are normalized so their squared sum for each type of weight is 1 since only the relative weights are important. 3. For a page p, the value of xp is updated to be the sum of yq over all pages q that link to p. 4. For a page p, the value of yp is updated to be the sum of xq over all pages q that p link to.
5. Continue with step 2 unless a termination condition has been reached. 6. On termination, the output of the algorithm is a set of pages with the largest xp weighs that can be assumed to be authorities and those with the largest yp weights that can be assured to be the hubs. Kleinberg provides example of how the HITS algorithm works and it is shown to perform well.
Theorem: The sequences of weights xp and yp converge. Proof: Let G=(V,E). The graph can be represented by an adjacency matrix A where each element (i, j) is 1 if there is an edge between the two vertices, and 0 otherwise. The weights are modified according to simple operations x=ATy and y=Ax. Therefore, x=AT Ax. Similarly, y=AATy. The iterations therefore converge to the principal eigenvectors of AAT. Problems with the HITS Algorithm There has been much research done in evaluating the HITS algorithm and it has been shown that while the algorithm works well for most queries, it does not work well for some others. There are a number of reasons for this: 1. Hubs and authorities: A clear-cut distinction between hubs and authorities may not be appropriate since many sites are hubs as well as authorities. 2. Topic drift: Certain documents of tightly connected documents, perhaps due to mutually reinforcing relationships between hosts, can dominate the HITS computation. These documents in some instances may not be the most relevant to the query that was posed. It has been reported that in one case when the search item was jaguar the HITS algorithm converged to a football team called Jaguars. Other examples of topic drift have been found on topics like gun control, abortion, and movies. 3. Automatically generated links: Some of the links are computer generated and represent no human judgement but HITS still gives them equal importance.
4. Non-relevant documents: Some queries can return non-relevant documents in the highly ranked queries and this can lead to erroneous results from the HITS still gives them equal importance. 5. Efficiency: The real-time performance of the algorithm is not good given the steps that involve finding sites that are pointed to by pages in the root pages. A number of proposals have been made for modifying HITS. These include: More careful selection of the base set will reduce the possibility of topic drift. One possible approach might be to modify the HITS algorithm so that the hub authority weights are modified only based on the best hubs and the best authorities. One may argue that the in-link information is more important than the out-link information. A hub can become important by pointing to a lot of authorities. Web Communities A Web community is generated by a group of individuals that share a common interest. It manifests on the Web as a collection of Web pages with a common interest as the theme. These could, for example, be communities about a sub-discipline, a religious group, about sport or a sport team, a hobby, an event, a country, a state, or whatever. The communities include in them newsgroups, portals and the large ones may include directories in sites like Yahoo! The HITS algorithm finds authorities and hubs for a specified broad topic. The idea of cyber communities of to find all Web 5.7 WEB MINING SOFTWARE Many general purpose data mining software packages include Web mining software. For eg. Clementine from SPSS includes Web mining modules. The following list includes variety of Web mining software. 123LogAnalyzer from a company with the same name is low-cost Web mining software that provides an overview of Web sites performance, statistics about Web server activity including the number of visitors, the number of unique IPs,
Web pages viewed, files downloaded, directories that were accessed, number of hits, broken down by day of the week and hour of the day and images that were accessed. Analog, claims to be ultra-fast, scalable, highly configurable, works on any operating system and is free. Azure Web Log Analyzer from Azure Desktop claims to find the usual Web information including the most popular pages, the number of visitors and where they are from, what browser and computer they used. Click Tracks from a company by the same name is Web mining software offering number of modules including Analyzer, Optimizer and Pro that use log files to provide Web site analysis. Allows desktop data mining. Datanautics G2 and Insight 5 from Datanautics. Web mining software for data collection, processing, analysis and reporting. LiveStats.NET and LiveStats.BIZ from DeepMetrix provide website analysis, data visualization and statistics on distinct visitors, repeat visits, popular entry and exit pages, time spent on pages, geographic report which breakdown visits by country and continent, click paths , keywords by search engine and more. NetTracker Web analytics from Sane Solutions claims to analyze log files (from Web servers, proxy servers and firewalls), data gathered by JavaScript page tags, or a hybrid of both. Nihuo Web Log Analyzer from LogAnalyser provides reports on how many visitors came to the website, where they came from, which pages they viewed, how long they spent on the site. WebAnalyst from Megaputer is based on PolyAnalyst text mining software.
WebLog Expert 3.5 from a company with the same name produces reports that include the following information: activities statistics, accessed files, paths throw the site, information about referring pages, search engines, browsers, operating systems and more.
WebTrends 7 from NetIQ is a collection of modules that provide a variety of Web data including navigation analysis, customer segmentation and more.
WUM; Web utilization Miner is an open source project. WUMprep is a collection of Perl scripts for data preprocessing tasks suggests sessonizing, root deduction and maping of URLs on to concepts.WUM is integrated Java-based Web mining software for log file preparation, basic reporting, discovery of sequential patterns and visualization.
CONCLUSION The World Wide Web has become an extremely valuable resource for a large number of people all around the world. During the last decade, the Web revolution has had a profound impact on the way we search and find information at home and at work. Although information resources like libraries have been available to the public for a long time, the Web provides instantaneous access to a huge variety of information. From its beginning in the early 1990s the Web has grown to perhaps more than eight billion Web pages which are accessed all over the world every day. Millions of Web pages are added every day and millions of others are modified or deleted. The Web is an open medium with no controls on who puts up what kind of material. The openness has meant but the Web has grown exponentially, which is its strength as well as its weakness. The strength is that one can find information on just about any topic. The weakness is the problem of abundance of information.
REVIEW QUESTIONS 1. Define the three types of Web mining. What are their major differences? 2. Define the following terms: a) Browser b) Uniform resource locator c) Domain name server d) Cookie 3. Describe three major differences between the conventional textual documents and Web documents. 4. What is Lotkss Inverse-Square law regarding scholarly publications? What relation does it have to the power laws of distribution of in-links and out-links from Web pages? 5. Describe the bow-tie structure of the Web. What percentage of pages from the Strongly Connected Core? What is the main property of the Core? 6. What is the difference between the deep and shallow Web? Why do we need the deep Web? 7. How can clustering be used in Web usage mining? 8. What is the basis of Kleinberg HITS algorithm? 9. Provide a step-by-step description of Kleinbergs HITS algorithm for finding authorities and hubs for topic data mining. 10. Discuss the advantages and disadvantages of the HITS algorithm. 11. What is a Web Community? How do you discover them? 12. Use the HITS algorithm to find hubs and authorities from the following five web pages: Page A (out-links to B, C, D) Page B (out-links to A, C, D) Page C (out-links to D) Page D (out-links to C, E) Page E (out-links to B, C, D) 13. What are the major differences between classical information retrieval and Web search?
CHAPTER 6
SEARCH ENGINES
Learning Objectives 1. Understand the role of search engines in Web content mining. 2. Explain the characteristics of Web search. 3. Describe typical search engine architecture. 4. Describe page ranking algorithm. 5. Discuss the characteristics of widely used search engines 6.1 INTRODUCTION The Web is a very large collection of documents, perhaps more than four billion in mid2004, with no catalogue. The search engines, directories, portals and indexes are the Webs catalogues allowing a user to carryout the task of the searching the Web for information that he or she requires. Google accounted for a about 40% of the search engine visits which makes it the largest global search engine in the market. Yahoo! about 25%, portals and indexes are the Webs catalogues allowing a user to carryout the task of the searching the Web for information that he or she requires. Google accounted a about 40% of the search engine visits which makes it the largest global search engine the market. Yahoo! about 25% of the search engine visits and msn.com about 15%. On the other hand 60% of portal visits were to Yahoo! which clearly dominates, with msn.com accounting for about 25% of the portal visits. Web search is very different from a normal information retrieval search of the printed or text document because of the following factors. Bulk: The Web is very large, much larger than any set of documents used in information retrieval applications.
Diversity: The Web is very diverse. It consists of text, images, movies, audio, animation and other multimedia content. The authors can be from anywhere with any background. The pages are built using a variety of software and are stored on a variety of machines all over the world. Growth: The Web continues to grow in size exponentially. The number of new pages added to the Web each day is estimated to be between one million and seven million. Dynamic: The Web changes significantly with time. One estimate suggests that about one third of the Web changes each year while another estimate suggests that 40% of the dot com pages change every day. The link structure of the Web itself changes quickly as new links are established and some current ones removed. In another study, 50% of pages were found to disappear within 50 days. Demanding users: Users are very impatient. They demand immediate results otherwise they abandon the search and move on to something else. They rarely read more than one or two screens of results returned by a search engine even when the query they have posed is short and ambiguous and is not carefully thought out. More and more users are looking for search engines for all kinds of information including the latest news. Duplication: It has been estimated that as mush as 30% of the content on the Web is duplicated. Hyperlinks: A normal text document contains citations but the Web documents contain hypertext links to other Web documents. It has been estimated that about half of the traffic on the Web is users navigating using links. Index pages: many search results return index pages from various sides providing little content but many links. Queries: generally search engine queries are short (1 to 3 words), quite ambiguous due to synonyms and homonyms. 6.2 CHARACTERISTICS OF SEARCH ENGINES In a library, every book is indexed. A more automated system is needed for the web given volume of information. Web search engines follow two approaches although the line between the two appears to be blurring: they either build directories (e.g. Yahoo!) or they build full-text indexes (e.g Google) to allow searches. There are also some meta-search
engines that do not build and maintain their own databases but instead search the databases of other search engines to find the information the user is searching for. Search engines do not limit themselves to search mechanisms alone; they have subject directories available as well. In a contrast to search engines, directories like Yahoo! are compiled using Web crawlers. Generally they depend on humans for compiling entries. Often people submit details of their Web site to a directory or pay to put the Web site on their directory, but given that entries do not depend on Web crawlers for updating, a web site could change while the directory entry does not. For a business to use a search engine to maximum advantage, a good Web site and appropriate advertising are essential. Advertising on the search engine requires careful analysis of options. Advertising is driven by keywords, that is, the advertisement will appear when a particular keyword, which has been bought by the company, is in the user query. Search engines are huge databases of Web pages as well as software packages for indexing and retrieving the pages that enable users to find information of interest to them. Problems with Searching Using a Search Engine Querying a search engine involves the user specifying a (small) number of keywords that are used by the search engine to search its indexes to find relevant documents. One of the major problems with this approach is that specifying query keywords can be quite a tricky business. For ex, if I wish to find what the search engines try to do, I could use many different keywords including the following: Objectives of search engines Goals of search engines What search engines try to do
Other problems can arise because of the nature of English language in which the same word can have more than one meaning (homonyms). For example, the following words
have more than one meaning and use of such words in a query is likely to result in a variety of documents: Gates Windows Bush Current Wide variety of users The Web is a search for a wide variety of information. Although the use is changing with time, and the following topics were found to be a most common in 1999: Category Arts Computers Regional Society Adult Recreation About computers About business Individual or family information including public figures Related to education About medical issues About entertainment About politics and government Shopping and product information About hobbies Searching for images News About travel and holidays About finding jobs Percentage 14.6 13.8 10.3 8.7 8 7.3 Major sub-category Arts: Music Computers: Software Computers: Internet Regional: North America Regional: Europe Adult: Image Galleries Percentage 6.1 3.4 3.2 5.3 1.8 4.4 Screen Slight Muffler Bore
Business
7.2 0 term query 1 term query 2 term query 21% 26% 26%
Business: Industries 3 term query 4+ term query
2.3 15% 12%
The goals of Web Search There has been considerable research into the nature of search engine queries. A recent study deals with the information needs of the user making a query. It ha been suggested that the information needs of the user may be divided into three classes: 1. Navigational: The primary information need in these queries is to reach a Web site that the user has in mind. The user may know that such a site exists (for example, a user knows that there is a Web site for Air India) or may have visited the site earlier but does not know the site URL. 2. Informational: The primary information need in such queries is to find a Web site that provides useful information about a topic of interest. The user does not have a particular Web site in mind. For example, a user may wish to find out the difference between an original digital camera and a digital SLR camera or a user may want to investigate IT career prospects in Kolkata. 3. Transactional: The primary need in such queries is to perform some kind of transaction. The user may or may not know the target Web site. An example is to buy a book. The user may wish to go to amazon.com or the user may want to go to a local Web site if there is one. Another user may be interested in downloading some images or some other multimedia content from a Web site. The study estimates the percentage of queries from each of these categories. The percentages were estimated to be in the following ranges: Navigational queries 20 25%
Informational queries Transactional queries
40-50% 30-35%
The Quality of Search Results The results form a search engine ideally should satisfy the following quality requirements: 1. Precision: Only relevant documents should be returned. 2. Recall: All the relevant documents should be returned 3. Ranking: A ranking of documents providing some indication of the relative ranking of the results should be returned. 4. First screen: The first page of results should include the most relevant results. 5. Speed: Results should be provided quickly since users have little patience. Precision deals with questions like what percentage of documents retrieved are relevant? and what rankings are the most relevant documents being given? Recall deals with questions like what percentage of relevant documents are returned by the search engine? The requirement of a high level of recall is generally in conflict with precision and correct ranking. If we wish to improve recall, precision is always going to down unless some user feedback mechanism is used, which adds complexity and cost to the search process. In the information retrieval field, it is known that precision and recall usually have a trade-off relationship that is often shown by a precision-recall graph. It is well known that in this trade-off relationship: 1. Precision decreases faster than recall increases 2. It is much more difficult to improve precision than to improve recall
Definition Precision and Recall Precision is the proportion of items retrieved that are relevant. Precision = number of relevant retrieved / total retrieved = {Total Retrieved Total Relevant} /Total Retrieved Recall is the proportion of relevant items retrieved. Recall = number of relevant retrieved / total relevant items = {Total Retrieved Total Relevant} /Total Relevant
Relevant Retrieved Items Retrieved
Relevant items
Figure 6.1 Recall and precision in information retrieval Although recall and precision are appropriate concepts for evaluating an information retrieval system, the major problem with using these two metrics for Web search is that recall is impossible to measure. Furthermore, the recall and precision criteria require that each item be labeled either relevant or irrelevant. Often there are documents that cannot be put in either of these two categories. Finally, it should be noted that the quality of users query formulation and the search results obtained do not appear to be high. As many as 80% of users have tried another query after viewing the results of a prior query posed by them.
6.3 SEARCH ENGINE FUNCTIONALITY
A search engine is rather complex collection of software modules. Here we discuss a number of functional areas: A search engine carries out a variety of tasks. These include: 1. Collecting information: A search engine would normally collect Web pages or information about them by Web crawling or by human submission of pages. 2. Evaluating and categorizing information: In some cases, for example, when Web pages are submitted to a directory, it may be necessary to evaluate and decide whether a submitted page should be selected. In addition it may be necessary to categorize information based on some ontology used by the search engine. 3. Creating a database and creating indexes: The information collected needs to be stored either in a database or some kind of file system. Indexes must be created so that the information may be searched efficiently. 4. Computing ranks of the Web documents: A variety of methods are being used to determine the rank of each page retrieved in response to a user query. The information used may include frequency of keywords, value of in-links and out-links from the page and frequency of use of the page. In some cases human judgment may be used. 5. Checking queries and executing them: Queries posed by users need to be checked, for example, for spelling errors and whether words in the query are recognizable. Once checked, a query is executed by searching the search engine database. 6. Presenting results: How the search engine presents the results to the user is important. The search engine must determine what results to present and how to display them. 7. Profiling the users: To improve search performance, the search engines carry out user profiling that deals with the way users the search engines.
6.4 SEARCH ENGINE ARCHITECTURE No two search engines are exactly the same in terms of size, indexing techniques, page ranking algorithms, or speed of search. It has been found that if the same query is posed to two search engines, the results are often different although both results usually contain some common items.
Typical search engine architecture as shown in following Figure 6.2, consists of many components including the following three major components to carry out the functions. 1. The crawler and the indexer: It collects pages from the Web, creates and maintains the index 2. The user interface: It allows user to submit queries and enables result presentation 3. The database and query server: It stores information about the Web pages and processes the query and returns results. All search engines include a crawler, an indexer and a query server. The Web
Users Query
Web crawler
Page submission
Page evaluation Query checking Strategy selection User Profiling
Representation Indexing
Query execution
Result presentation
History maintenance
Figure 6.2 Architecture of search engine The crawler
The crawler is an application program that carries out a task similar to graph traversal. It is given a set of starting URLs that it uses to automatically traverse the Web by retrieving a page, initially from the starting set. Some search engines use a number of distributed crawlers. Crawlers tend to return to each site on a regular basis, once in two months, to look for changes. Often search engines have a list of sites, that change frequently, for example, news paper sites, and visit them much more frequently, even every few hours. Since a crawler has a simple task, it is not CPU-bound. It is bandwidth bound. In crawling, bandwidth can become a bottleneck. A Web crawler must take into account the load (bandwidth, storage) on the search engine machines and also on the machines being traversed in guiding its traversal. For example, when a crawler is being run, the resources used by it should not be so high that it disrupts the uses of the Web by other uses. Each page found by the crawler is not stored as a separate file otherwise four billion pages would require managing four billion files. Usually, lots of pages are stuffed into one file. Crawlers follow an algorithm like the following: 1. Find base URLs- a set of known and working hyperlinks are collected 2. Build a queue- put the base URLs the queue and add new URLs to the queue as more are discovered 3. Retrieve the next page- retrieve the next page in the queue, process and store in the search engine database. 4. Add to the queue- check if the out-links of the current page have already been processed. Add the unprocessed out-links to the queue of URLs. 5. Continue the process until some stopping criteria are met. The indexer Building and index requires document analysis and term extraction. Term extraction may involve extracting all the words from each page, elimination of stop words and stemming to help modify for example the inverter file structure. It may also involve analysis of hyperlinks.
In automatic indexing of Web documents, many parts of the documents are difficult to use for indexing. This for example applies to multi media content. Keywords also need to be extracted from a wide variety of document formats including PDF, post script and Microsoft word. A good indexing approach is to create an index that will relate keywords to pages. Clearly every page has many keywords and an index should allow efficient access to the pages based on any of this keywords. An inverted index, rather than listing keywords for each page, lists pages for each keyword. The rolls of keywords and pages have been inverted
1.5
RANKING OF WEB PAGES The Web consists of a huge number of documents that have been published without any quality control. It is therefore important to devise methods for determining the relative importance and quality of the pages, given a topic of interest or given a query. Search engine differs in size significantly and so the number of documents that they index may be quite different. Also, no two search engines will have exactly the same pages on a given topic even if they are of similar size. Various search engines follow different algorithms for ranking pages although they may still return the same pages at the top of their lists. Even if search engines return good documents at the top of the rankings, it is not uncommon for irrelevant documents to make it to the high rankings and to appear on the top of the two pages. When ranking pages, some search engines give importance to location on frequency of keywords and some may consider the meta-tags. Some search engines may give more importance to pages that have the required keywords close to the top of the page or in the title.
Page Rank Algorithm Google has the most well-known ranking algorithm, called the Page Rank algorithm that has been claimed to supply top ranking pages that are relevant. The Page rank algorithm is based on using the hyperlinks as indicators of the pages importance. It is almost like vote counting in an election. Every unique page is assigned a Page Rank. If a lot of pages vote for a page by linking to it then the page that is being pointed to will be considered important. Page Rank was originally developed based on a probability model of a random surfer visiting a Web page. The probability of a random surfer clicking on a link may be estimated based on the number of links on that page. The probability of a surfer clicking any is assumed to be inversely proportional to the number of links on the page. The more links a page has the less chance that a surfer will click on a particular link. For example, if we assume that the Web has a total of 100 pages and a page had a Page Rank value of 2, it is supposed to indicate that the random surfer would reach that page on average twice if he / she surfed the Web separately 100 times. Therefore the probability for the random surfer reaching one page is the sum of probabilities for the random surfer following links to that page. The probability model also introduces a damping factor to take into account that the random surfer sometimes does not click on any link and jumps to another page at random. If the damping factor is d (assumed to be between 0 and 1) then the probability of the surfer jumping off to some other page is assumed to be 1-d. The higher the value of the more likely the surfer will follow one of the links. Given that the surfer has 1-d probability of jumping to some random page, every page has a minimum page rank of 1-d. The algorithm essentially works as follows. Let A be the page which has a minimum Page Rank PR(A) is required. Let page A be pointed to by pages T1, T2 etc. Let C(T1) be the number of links going out from page T1. Page Rank of A is then given by:
PR(A) = (1-d) + d(PR(T1) / C(T1) + PR(T2) / C(T2) + ) The constant d is the damping factor. Therefore the Page Rank is essentially a count of its votes. The strength of a vote depends on the Page Rank of the voting page and the number of out-links from the voting page. If a page has many out-links, each link has quite a low weight when compared to the weight of a link from a page of similar Page Rank that had only one or two out-links. To compute the Page Rank of a page A we ought to know the Page Rank of each page that points to it (i.e. votes for it) and the number of out-links from each of those pages. Let us consider a simple example followed by a more complex one.
Example 6.1 A simple Three Page Example Let us consider a simple example of only three pages. We are given the following information: 1. The factor is 0.8 2. Page A has an out-link to B 3. Page B has an out-link to A and another to C. 4. Page C has an out-link to A 5. The starting Page Rank for each page is 1. We may show the links between the page as in Figure 6.3.
C Figure 6.3 The three Web pages in Example 6.1 The Page Rank equations may be written as PR(A) = 0.2 + 0.4 PR(B) + (0.8) PR(C) PR(B) = 0.2 + 0.8 PR(A) PR( C) = 0.2 + 0.4 PR(B) Since these are three linear equations in three unknowns, we may solve them. First we write them as follows, if we replace PR(A) by a and others similarly: a 0.4b 0.8c = 0.2 b -0.8a c - 0.4b = 0.2 = 0.2
The solution of the above equations is given by a = PR(A) = 1.19 b = PR(B) = 1.15 c = PR(C) = 0.66 Note that the total of the three Page Ranks is 3.0. Example 6.2 Five Page Example Consider five pages A, B, C, D and E whose Page Ranks have been arbitrarily assigned and out-links are known. As we compute Page Ranks, we will find that the initial Page Ranks do not have much influence on the final Page Ranks. Starting from the initial Page Rank allocations we will recompute the Page Ranks at each iteration. We are given the following information about the pages. 1. Page A has Page Rank of 1 and has one link to B 2. Page B has Page Rank of 2 and has two links to C and D
3. Page C has Page Rank of 3 and has three links to B, D and E 4. Page D has Page Rank of 2 and has four links to A, B, C and E 5. Page E has Page Rank of 3 and has three links to B, C and D The links are displayed in Figure 6.4 below. We are not asserting that these pages are not connected to the rest of the Web but to simplify our calculations of Page Rank we proceed as if these pages are an island.
Figure 6.4 Links between the five pages in Example 6.2 Given the out-links above, we first make a list of the in-links based on the out-links given above. The list is shown in Table 6.3. Table 6.3 In-links of the pages in Example 6.2 -------------------------------------------------------Page name In-links -------------------------------------------------------A D B C D B A, C, D and E B, D and E B, C and E C and D
-------------------------------------------------------We will assume a factor of 0.9 which is essentially used to ensure that in most cases the Page Rank will be slightly damped given the chance that at every page the random surfer will not click on any link but will go off to some random page. The result of this factor is, for example, if a page was two links away from a very highly ranked page then the highly ranked page would not be able to pass all of its Page Rank to the page that is two links away. We now find d*PR(X) / C(X) which is what a page X can pass on to the page that X is pointing to: 1. Page A has one out-link to B, its Page Rank is 1 and therefore PR(A) / C(A) is 1. Since d=0.9, As vote is worth 0.9 and it will pass on 0.9 to page B. 2. Page B has to out-links to C and D, its Page Rank is 2 and therefore PR(B)/C(B) is 1. Therefore B will also pass on 0.9 to C and D. 3. Page C has three out-links to B, D, E, its Page Rank is 3 and therefore PR(B)/C(B) is again 1. C passes 0.9 to B, D and E. 4. Page D has four out-links, to A, B, C, E, its Page Rank is 2, and therefore PR(B)/C(B) is 0.5. D passes on 0.45 to A, B, C and E. 5. Page E has three out-kinks to B, C, D, its Page Rank is 3, and therefore PR(B)/C(B) is 1. E passes on 0.9 to B, C and 5. Now we can carry out an iteration and revise the Page Ranks. PR(A) = (1-d) + d(PR(D)/C(D)) = 0.1 + 0.45 = 0.55 PR(B) = (1-d) + d(PR(A)/C(A)+ PR(C) / C(C) + PR(D)/ C(D) + PR(E)/C(E)) = 0.1 + 0.9 + 0.9 + 0.45 + 0.9 = 3.25 PR(C) = (1-d) + d(PR(B)/C(B) + PR(D) / C(D) + PR(E)/ C(E)) = 0.1 + 0.9 + 0.45 + 0.9 = 2.35 PR(D) = (1-d) + d(PR(B)/C(B)+ PR(C) / C(C) + PR(E)/C(E))
= 0.1 + 0.9 + 0.9 + 0.9 = 2.80 PR(B) = (1-d) + d(PR(C) / C(C) + PR(D)/ C(D)) = 0.1 + 0.9 + 0.45 = 1.45 Although more iterations are required, on the basis of the first iteration we can say that Page B is the most important page amongst the five pages and page A is the least important. Pages C and E had higher Page Ranks when we started. 6.6 THE SEARCH ENGINE INDUSTRY In this section we briefly discuss the search engine market its recent past, the present and what might happen in the future. Recent past Consolidation of Search Engines The search engine market, that is only about 12 years old in 2006, has undergone significant changes in the last few years as consolidation has been taking place in the industry as described below. Some well known search engines of the recent past have been acquired by other companies or have essentially gone out of business leaving only their Web site that is now powered by one of the major search engines. 1. AltaVista Yahoo! has acquired this general-purpose search engine that was perhaps the most popular search engine some years ago. Its Web site www.altavista.com now displays Yahoo! search results. 2. Ask Jeeves-a well known search engine now owned by IAC, www.ask.com 3. DogPile-a meta-search engine that still exists but is not widely used. www.dogpile.com 4. Excite-a popular search engine only 5-6 years ago but the business has essentially folded. www.excite.com 5. HotBot-now uses results from Google or Ask Jeeves, so the business has folded. www.hotbot.com 6. InfoSeek-a general-purpose search engine, widely used 5-6 years ago, now uses search powered by Google. http://infoseek.go.com
7. Lycos-another general purpose search engine of the recent past, also widely used, now uses results from Ask Jeeves. So the business has essentially folded. www.lycos.com Features of Google Google is the common spelling for googol or 10100. The aim of Google is to build a very large scale search engine. Google can justifiably boast of being the premier search engine available currently. The total number of pages searched includes duplicate pages and pages that were crawled sometime ago but no longer exist. Since Google is supported by a team of top researchers, it has been able to keep one step ahead of its competition. Google aims to update pages on a monthly basis while maintaining the largest databases of Web pages as shown below: Indexed Web pages Unindexed pages Pages refreshed daily about 4 billion (early 2004) about 0.5 billion about 3 million
Some of the features of the Google search engine are: 1. It has the largest number of pages indexed. It indexes the text content of all these ages. 2. It has been reported that Google refreshes more 3 million pages daily. It has techniques for keeping track of pages that consistently show changes over time, for example news-related pages. It has been reported that it refreshes news every 5 minutes. A large number of news sites are refreshed. 3. It uses AND as the default between the keywords that the users specify, searches for documents that contain all the keywords. It is possible to use OR as well . 4. It provides a variety of features in addition to the search engine features: a. A calculator is available by simply putting an expression in the search box. b. Definitions are available by entering the word define followed by word whose definition id required.
c. In title and in URL searches are possible for using intitle:word and inurl:word. d. Advanced search allows to search for recent documents only. e. Google not only searches HTML documents but also PDF, Microsoft Office, PostScript, Corel WordPerfect and Lotus 1-2-3 documents. The documents may be converted into HTML documents, if required. f. It provides a facility that takes the user directly to the first Web page returned by the query. g. It provides a news service for query keywords if they match news items. h. There is a facility to search for US patents. i. Like other search engines, Google provides a facility to find similar pages by searching for pages that are related to the page that has been found already. j. Google provides a facility to restrict a search to a particular site. k. Usually truncation is used unless the word is in quotes as a phrase. l. Date may be specified, possible past three months, 6 months or past one year in advanced search. m. It provides spell checking that looks at the spellings of the query keywords. Alternative keywords are suggested if the alternatives have many more search results. n. Google also provides facilities to search for a variety of financial information including stock and mutual fund data. o. It provides US street maps, whether information, and status of US flights. p. It provides translation to English from non-English Web pages that are formed in a number of European languages. q. Information about who links to a Web page may be obtained. 5. Google is now also providing special search for the following: US Government Liux BSD Microsoft
Apple Scholarly publications Universities Catalogues and directories Froogle for shopping Searching for desktop
Features of Yahoo! Yahoo! is one of the early subject directories established in 1994. It is now a search engine (search.yahoo.com), directory(dir.yahoo.com), and portal(www.yahoo.com). Yahoo! has acquired AlltheWeb, AltaVista, Inktomi and Overture in an attempt to obtain some of the best databases and indexing technology. Some of the features of Yahoo! Search are: a) It is possible to search maps and weather by using these keywords followed by location. b) News may be searched by using keywords news followed by words or a phrase. c) Yellow page listing may be searched using the zip code and business type. d) # site:word allows one to find all documents within a particular domain. e) site:abcd.com allows one to find all documents from a particular host only. f) # inurl:word allows one to find a specific keyword as a part of indexed URLs. g) # intitle:word allows one to find a specific keyword as a part of indexed titles. h) Local search is possible in the USA by specifying the zip code. i) It is possible to search for images by specifying words or phrases. HyPursuit HyPursuit is a hierarchical search engine designed at MIT that builds multiple coexisting cluster hierarchies of Web documents using the information embedded in the hyperlinks as well as information from the content. The designers of the system claim that clustering using both types of information is particularly useful since both contain valuable
information. Clustering in HyPursuit takes into account the number of common terms, common ancestors and descendants as well as the number or hyperlinks between them. Clustering is useful given that often page creators do not create single independent pages that rather a collection of related pages. Large documents are often broken up into a number of Web pages. For example, the notes for a class may be provided separately for each hour of lecturer. HyPursuit stores summaries of cluster called content labels that are stored efficiently Harvest Harvest is a Web search tool that includes a number of tools for gathering information from diverse repositories, building topic specific content indexes that are replicated and caching documents as they are retrieved. It is designed to be very efficient in server load, network traffic and index space requirements. The focus of its design is not on search speed. A search to Yahoo! displays the summary results first from the database of the search engine. These results are the same or similar to those produced by clicking on Web sites. The Web pages section is powered by Google. The news is collected from Yahoo! news and the research documents from Northern Light!. The page ranking at Yahoo! is very different from what Google offers since Yahoo! allows paid placements. Yahoo! charges an annual fee for inclusion in it directory and entries in the directory influence the results of the search. 6.7 ENTERPRISE SEARCH Imagine a large university with many degree programs and considerable consulting and research. Such a university is likely to have an enormous amount of information on the Web including the following: Information about the university, its location and how to contact it. Information about degrees offered, admission requirements, degree regulations and credit transfer requirements
Material designed for under graduate and post graduate students who may be considering joining the university Information about courses offered including course descriptions, prerequisites and text books Information about faculties, departments, research centers and other units the university Information about the university management structure List of academic staff, general staff and students(for example, Ph.D. students), their qualifications and expertise where appropriate, office numbers, telephone numbers, email addresses and their Web page URLs
Course material including material archived from previous years University time tables, calendar of semester dates, examination dates and holidays University policies are wide variety of areas, for example, equal opportunity, plagiarism and staff misconduct Information about university research including grants obtained, books and papers published and patents obtained Information about consulting and commercialization Information about graduation, in particular names of graduates and degrees awarded University publications including annual reports of the university, the faculties and departments, internal newsletters and student newspapers Internal newsgroups for employees Press releases Information about University facilities including laboratories and building Information about libraries, their catalogues and electronic collections Information about human resources including terms and conditions of employment, enterprise agreement if the university has one, salary scales for different types of staff, procedures for promotion, leave and sabbatical leave policies
Information about current job vacancies and how to apply for them Information about various university committees, their memberships, their meeting dates, their agendas and minutes Alumni news and newsletters Information about sports facilities, times they are open and booking procedures Information about the student union, student clubs and societies Information about the staff union or association
There are many differences between an Intranet search and an Internet Search. Some major differences are: 1. Intranet documents are created for simple information dissemination, rather than to attract and hold the attention of any specific group of users. 2. A large proportion of queries tend to have a small set of correct answers, and the unique answer pages do not usually have any special characteristics 3. Intranets are essentially spam-free 4. Large portions of intranets are not search-engine friendly. The enterprise search engine (ESE) task is in some ways similar to the major search engine task but there are significant differences. On the Intranet, algorithms like the Page Rank algorithm may not be suitable for ranking results of a query. Some typical characteristics of an enterprise search engines are: 1. The need to access information in diverse repositories, including file systems, HTTP Web servers, Lotus Notes, Microsoft Exchange, content management systems such as Documentation, as well as relational databases. 2. The need to respect fine-grained individual access control rights, typically at the document
level; thus two users issuing the same search/navigation request may see differing sets of documents due to the differences in their privileges. 3. The need to index and search a large variety of document types (formats), such as PDF, Microsoft Word and Powerpoint files, and different languages (Such as English, European and Asian languages). 4. The need to seamlessly and scalably combine structured (E.g. relational) as well as unstructured information in a document for search, as well as for organizational purposes (clustering, classification, etc) and for personalization.
6.8 ENTERPRISE SEARCH ENGINE SOFTWARE The ESE software markets have grown strongly. The major vendors, for example, IBM and Google, are also offering search engine products. The following are comprehensive list of enterprise search tools, which are collected from various vendors Web sites. ASPseek is Internet search engine software developed by Swsoft and licensed as free software under GNU General Public License (GPL). The software can index many URLs and search for words, phrases, use wildcards and do a Booloean Search. Copernic includes three components (Agent Professional, Summarizer and Tracker). It is low-cost Intranet search software that retrieves and indexes data in corporate Intranets, company servers, and public Web sites. Endeca ProFind from Endeca provides Intranet and portal search. Integrated search and navigation technology allows search efficiency and effectiveness. Includes Web site search capabilities.
Extractor from Verity allows users to find patterns managed by a single application such as content management system, or in documents, databases and emails that are stored in applications and servers around the world. Extractor has integrated search, classification and recommendation facilities.
Fast Search and Transfer (FAST) from Fast Search and Transfer (FAST) ASA developed by the Norwegian University of Science and Technology (NTNU) allows customers, employees, managers and partners to access departmental and enterprise-wide information. FAST claims to be capable of scaling to support hundreds of terabytes of data, millions of users and thousands of queries per second.
FreeFind allows daily re-indexing, customization, phrase matching, and simple or Boolean searching. Google Mini and Google Appliance. The Mini offers search for public Web site or Intranet for up to 50,000 documents. The appliance is for up to 15 million documents. They index all forms of content on the Intranet and Web sites.
IDOL Enterprise Desktop Search from Autonomy Corporation provides search for secure corporate networks, Intranets, local data sources, the Web as well as information on the desktop, such as email and office documents.
mnoGoSearch is Intranet search engine software that consists of an indexer which carries out the crawling and indexing and a search module. It is a shareware available free for Unix and Windows under the GNU GPL. The software has a number of features for a wide range of applications including search within Intranet, ftp archive search, MP3 search, news article search.
Weblimpse from Internet workshop is Intranet search software available free for personal and non-profit organizations. It includes a Web administration interface, remote link spider, file indexing and query system.
Websphere Information Integrator OmniFind Edition from IBM provides enterprise search for intranets, extranets and corporate public Web sites. It delivers an enterprise search capability that finds the most relevant enterprise data for the enterprise employees, partners and customers.
CONCLUSION In this chapter we discussed Web content mining using search engines in much more detail. Differences between classical information retrieval and searching the Web were discussed and some of the characteristics of search engines were also described. We also outlined some problems that search engines face given short queries from a diversity of users. Measuring the quality of results returned, called for the introduction of the concepts of precision and recall. We considered the task that the search engines try to tackle. Search engine functionality and architecture were discussed and the roles of a crawler, indexer and query processor were described. Finally, we presented the Page Rank algorithm and illustrated the algorithm using two examples. A discussion of the state of the search engine industry was followed by a presentation of some of the important features of Google and Yahoo! The importance of enterprise search was discussed and this was followed by a list of some enterprise search software.
REVIEW QUESTIONS 1. What are the primary goals of Web search? 2. Discuss some problems that arise while searching the web? 3. What are precision and recall? How do they apply to Web search? 4. What are the different components of a search engine? Describe them briefly. 5. What is a crawler? Present a basic crawling algorithm. 6. Describe the Page Rank algorithm. Using an example, show how it works. 7. What is the most important property of a web page in determining its Page Rank?
8. Why is enterprise search important? 9. What are the major differences between enterprise search and Internet search? 10. Discuss some main features of Google and Yahoo!.

Unit III

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit III

Uploaded by

Copyright:

Available Formats

CHAPTER 5

WEB DATA MINING

11 Table 5.2 Number of letters in shingles

Business: Industries 3 term query 4+ term query

2.3 15% 12%

Informational queries Transactional queries

6.3 SEARCH ENGINE FUNCTIONALITY

Page evaluation Query checking Strategy selection User Profiling

Figure 6.2 Architecture of search engine The crawler

You might also like