You are on page 1of 20

Working of Search Engines

A Technical Seminar Presented by Aditya Rai R. No.- 0905410006

Index

What is Search Engine? History of Search Engine Architecture of Search Engine Working of Search Engine Web Crawling Indexing Googles Page Rank Google Caffeine Google Panda

What is a search engine?


Search engines use automated software programs (spider, crawler, robot) to crawl the WWW by following links. These software programs download web pages into the search engines index (a database). Often every word on a web page will be indexed. Search Engine results are organized in a way that is determined by a special algorithm to rank the results so that the best" results listed first.

What is a search engine?

Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks: They search the Internet -- or select pieces of the Internet -based on important words. They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index

History of search engines

1st Generation (1994): AltaVista, Excite, Infoseek Ranking based on Content The more rare words two documents share the more similar they are Documents are treated as bags of words(no effort to understand the contents) 2nd Generation (1996): Lycos Ranking based on Content + Structure Site Popularity 3rd Generation (1998): Google, Yahoo, Bing Ranking based on Content + Structure + Value Page Reputation In the Works Ranking based on the need behind the query

Architecture of Search Engine

Working of Search Engine

A search engine operates, in the following order


Web crawling Indexing Searching

Web crawling

Web search engines work by storing information about many web pages, from the WWW by a Web crawler (spider) an automated Web browser which follows every link it sees.
Googlebot is Googles web crawling robot. It functions like web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Googles indexer. Search engine spiders do not read pages the way a human does. Instead, they tend to see only particular stuff and are blind for many extras (Flash, JavaScript ,images) that are intended for humans.

Web Crawling

Indexing

Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed -- the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:

The information stored with the data The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

Indexing

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders. Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.

Indexing

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness. In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search

Searching:
The query processor has several parts, including the user interface (search box), the engine that evaluates queries and matches them to relevant documents, and the results formatter.

Searching:

Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search. The Boolean operators most often seen are:
AND - All the terms joined by "AND" must appear in the pages or documents. Some search engines substitute the operator "+" for the word AND. OR - At least one of the terms joined by "OR" must appear in the pages or documents. NOT - The term or terms following "NOT" must not appear in the pages or documents. Some search engines substitute the operator "-" for the word NOT. FOLLOWED BY - One of the terms must be directly followed by the other. NEAR - One of the terms must be within a specified number of words of the other. Quotation Marks - The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.

Page Rank
0.25 A 0.25 0.25 B 1. Everyone gets page rank that is 1/(number of pages) = 2. Each page gets its page rank updated based on incoming links.

0.25

0.25

In this case page rank of A PR(A) is: PR(A) = 0.25 + 0.25 + 0.25 = 0.75

C
0.25

D
0.25
15

Page Rank
Links are weighted based on number of outgoing links
0.25 0.25/2 A B 0.25

The page rank is divided by the number of outgoing links a site has (why?)
e.g., Ds links are worth 0.25/3 because it has 3 outgoing links

0.25/2
0.25/1

0.25/3

0.25/3 now: PR(A)=0.25/2 + 0.25/1 + 0.25/3

C 0.25/3 0.25

D 0.25
16

Google Caffeine
In August 2009, Google announced the rollout of a web indexing system, codenamed Caffeine. Caffeine provides 50 percent fresher results for web searches than our last index, and it's the largest collection of web content Google has offered. Whether it's a news story, a blog or a forum post, we can now find links to relevant content much sooner after it is published than was possible ever before.

Google Caffeine

Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you. With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever beforeno matter when or where it was published.

Google Panda
Google Panda is a change to the Google's search results ranking algorithm that was first released in February 2011 .The change aimed to lower the rank of "low-quality sites", and return higher-quality sites near the top of the search results . CNET reported a surge in the rankings of news websites and social Networking sites, and a drop in rankings for sites containing large amounts of advertising. This change reportedly affected the rankings of almost 12 percent of all search results. Soon after the Panda rollout, many websites, including Google's webmaster forum, became filled with complaints of scrapers/copyright infringers getting better rankings than sites with original content. At one point, Google publicly asked for data points to help detect scrapers better. Google's Panda has received several updates since the original rollout in February 2011, and the effect went global in April 2011. To help affected publishers, Google published an advisory on its blog, thus giving some direction for self-evaluation of a website's quality.

20

You might also like