You are on page 1of 6

Hybrid search engine ranking algorithm with

reranking

Authors Name/s per 1st Affiliation (Author) Authors Name/s per 2nd Affiliation (Author)
line 1 (of Affiliation): dept. name of organization line 1 (of Affiliation): dept. name of organization
line 2-name of organization, acronyms acceptable line 2-name of organization, acronyms acceptable
line 3-City, Country line 3-City, Country
line 4-e-mail address if desired line 4-e-mail address if desired

AbstractIn todays electronic world search engine plays a basis. They are following links of results, click on the ads,
pivotal role in information dissemination and retrieval. However, spend time on pages reformulate their queries and perform
the performance of any search engine is worth debatable because other actions. These interactions may concern some of the
almost any search engine retrieves enormous information and valuable source of information for tuning and collecting the
irrelevant results. Thus, the basic objective of the project is user content. There are massive web pages on the web, search
satisfaction and improvement of the users search engine results. engines always have the very challenge to find the best ranked
The basic fact emphasized is that it is the quality of results and pages, on the other hand considering the traditional IR scenario
not the quantity of the results that plays a significant role in where a user formulates a query and then triggers a retrieval
improving users experience. Thus objective of the project is loud
process which results in a list of ranked documents in
and clear i.e. to combine the ranking approaches so as to improve
and compensate the disadvantages and demerits of each .These
decreasing order of relevance and popularity.
results are analyzed on the basis of content ,usage and links .The With the billions of web pages available on the
links guarantee the popularity in web sphere ,however the links Internet. Search Engines always have the task to find the best
ignore the semantics and content part of the same .Content based ranked list to the user's query from those huge numbers of
ranking ensures that the semantics are taken into consideration pages. A lot of search results that corresponding to a user's
however the content does not consider the user behavior into query are not relevant to the user need. Most of the page
mapping ,the usage based ranking algorithm considers the usage
ranking algorithms use Link-based ranking (web structure) or
log and thus influences usage characteristics accordingly .And
finally concept of reranking is introduced that matches the users
Content-based ranking to calculate the relevancy of the
query and brings only specific top 10 results that fulfills users information to the user's need, but those ranking algorithms
requirements. Thus, the major goal is to map the results might be not being suitable to provide a good ranked list.
according to the users interest. Thus, this reference of the users This paper thus provides an insight to hybrid ranking
search goals can be helpful in meeting the users criteria. The algorithm that combines the approaches of content based,
scope of the project revolves around combining the different usage based and link based ranking algorithms along with
algorithms of ranking. This integrated ranking approach fosters
reranking algorithm that efficiently provides the best search
improvement in performance and quality of the search engine
results. The user supplies a query in natural language English.
engine listing of top ten results accordingly ,in section 2 the
This query is examined for the set of links that matches to proposed approach of the system is presented ,followed by
corresponding query terms. The set of links are accordingly section 3 that provides the insight to working of the system and
checked for the content using the dictionary construction and section 4 that provides a reference to performance analysis and
then analysis is made in the usage server log which helps in the finally in section 5 we conclude the paper.
process of extraction of the user behavior. The reranking module
is analyzed on the basis of click entropy and historical clicks, thus II. PROPOSED SYSTEM
attempt is made in providing satisfaction to the user so that there
is no need for the user to modify the content of the query term The idea is on combining different ranking algorithms in
and that intelligence embedded in the system satisfies the user order to design an effective hybrid ranking approach that uses a
requirements. combination of content based, link based and usage based
ranking algorithms. In order to narrow down the ranked list
Keywordscomponent; formatting; style; styling; insert (key even further to meet the user specific needs the reordering of
words) the set of top n ranked pages is imperative using re ranking
algorithm. This is possible by user profile generation. Thus, the
I. INTRODUCTION search engine is based on personalized user behavior which
improves user satisfaction.
Searching has become the most common behavior of our
life. Millions of users interact with search engines on regular
Figure 1: xyz
The proposed system consists of the following modules
1. User Query generation
2. Data Cleaning module
3. Search Engine
4. Content based ranking
5. Link Based Ranking
6. Usage Based Ranking
7. Re ranking module.

Figure 2:xyz
The proposed approach uses the combination of various 1) The user specifies the query in the natural English
modules to produce a set of ranked documents. language. There are the set of 30-35 documents on various
diverse topics kept in the backend. Apart from the static 7. The scores obtained through content based, link based
documents the data is also extracted from the different web and usage based are averaged together to form a composite
sources (search engines such as Google, MSN, Bing etc ) score and they are ranked together in descending order of the
scores.
2) When the user enters the query in system API the
query is forwarded to the search engine and the top n results
are extracted from the search engine.
3) Preprocessing is an important step in the process all
the dirty data from the set of documents and the query are
preprocessed to form a clean and structured query and data.
4) The top n retrieved results from the search engine
module are reordered using content based ranking. A page
score is assigned on the basis of keyword matching and user
query. The values of these scores are assigned in descending
order. [10]
5) The results of the search engine which are retrieved Figure 3: The user interface of the system
are then passed through a link based ranking module which
assigns the page score on the basis of enhanced normalized IV. QUERY GENERATION
page ranking algorithm [17] Suppose the search term is education then the user would
6) The results of both content and link are passed to enter the data accordingly in the search box. User is allowed to
usage based ranking module accordingly [14] which then enter any specific search terms whether in unigrams, bigrams
assigns the score to each of pages or ngrams etc. User is allowed to rerank the set of reults to
obtain the best quality results too. There is no restriction on
7) A score corresponding to each module is averaged
the user to enter the query with or without stop words etc.
together to form a composite score and the pages are ranked
and filtered in descending order of composite scores. [14] Refer Fig 5.2 where search term entered is education
accordingly.
8) The set of ranked pages are displayed to the user.
The set of the ranked pages are again reranked using reranking
module that uses the user profile creation and then finally
displayed to the user.

III. WORKING OF HYBRID RANKING ALGORITHM


1. The input to the system is a query in English language,
the user enters the search term in the search box. The entire
system is dynamic and it collects and crawls data from various
search engines using (APIs) (Application programming
Interface) of Bing, Google, Dukduckgo etc.
Figure 4: Query Generation
2. The data is being pre-processed and the search term is
checked for stop words accordingly. A. Links and data extraction from the different search
engine
3. The dictionary construction plays a very important role
in the process of content based ranking, a word net module
specifically extracts data and finds out the synonyms of a given
term accordingly.
4. Based on the dictionary of the words constructed the
content based ranking module assigns a content based score on
the basis of keyword matching and user query.
5. The results of search engine are then retrieved and
passed to the link based ranking module which assigns the page
score on the basis of enhanced normalized page ranking
algorithm
6. The results of both content based, link based and then Figure 5: Link extraction from different sources
analysed and then passed to the usage based ranking module The search term education is searched and crawled across
which assigns the score according to the user behavior . various search engines such as Duckduckgo, google, Bing etc.
and then links are extracted accordingly. These extracted links
contain title, ID, description etc.
B. Dictionary Construction Corresponding To Search These numbers denote the unique ids coessponding to each
Term URLs accordingly.
Based on the search term the semantics of the data is
crawled upon using the dictionary set construction. The
dictionary set plays an important role in content based ranking
algorithm which constructs a synonym of the search term
education and then performs keyword matching. Based on this
keyword matching it assigns a content based score to the set of
links that are obtained from above.

Figure 8: Display of results

The results are then displayed to the user according to the


Figure 6:Dictionary construction descending order of the composite score. The user is prompted
Here all the synonyms associated with the education as a for re ranking of the results accordingly.
search term are collected and then keyword matching is
performed. This is used to assign a content based score on the E. Re ranking of the results
basis of keyword matching. Dictionary module plays a pivotal
role in the process of determining and matching the semantics Re ranking narrows down the search results to the specific
of the search term. All the set of links that contain the above users needs. In reranking process only top 10 top results are
dictionary set are given more priority and more consideration only displayed accordingly. The top 10 results are displayed on
in terms of content based ranking scores. the basis of reranking algorithms that uses click entropy,
historical clicks and user log to determine the relevance.
C. Composite Score Generation

Figure 7
Based on the server log the usage based ranking score is
extracted, the usage based ranking is assigned on the basis of
number of clicks and dwell time of each user. The content
based, usage based and link based scores are created
accordingly and then a composite score is calculated by finding Figure 9: Reranking
the median of the data. The links are then sorted in descending F. Performance Analysis
order of the scores.
The system is evaluated to check whether the results generated
using the algorithm are they relevant or not, to check for
D. Display of Results
relevancy two important parameters precision and recall play a
myList: [245, 2855, 246, 719, 885, 258, 259, 260, 324, 201, pivotal role in determining the factor of relevant results. This
203, 204, 716, 206, 718, 209, 210, 211, 216, 1690, 1691, 219, system was checked for unigrams, bigrams and trigrams and
220, 1693, 221, 2848, 2849, 2850, 2851, 2852, 2853, 2854, accordingly the resulted were tabulated for different keywords.
687, 879, 689, 690, 882, 883, 249, 251, 253, 254, 191, 959, The precision is represented by the fraction of relevant results
255] by total number of results. All the results were compared
against search engines like google, bing and yahoo.
Recall is defined as the ratio of retrieved relevant results to the REFERENCES
total existing relevant results in the database, a collection of 20
relevant results corresponding to all the results were collected [1] G. Eason, B. Noble, and I.N. Sneddon, On certain integrals of
and all the analysis was performed. Lipschitz-Hankel type involving products of Bessel functions, Phil.
Trans. Roy. Soc. London, vol. A247, pp. 529-551, April 1955.
(references)
CONCLUSION
[2] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol.
With the rapid growth of the information sources we are 2. Oxford: Clarendon, 1892, pp.68-73.
drowning in data but starving for knowledge, therefore it has [3] I.S. Jacobs and C.P. Bean, Fine particles, thin films and exchange
become necessary for the user to use information retrieval anisotropy, in Magnetism, vol. III, G.T. Rado and H. Suhl, Eds. New
techniques and combination of different ranking algorithms to York: Academic, 1963, pp. 271-350.
find and extract and filter the desired information. [4] K. Elissa, Title of paper if known, unpublished.
[5] R. Nicole, Title of paper with only first word capitalized, J. Name
Many of the existing Information Retrieval Systems still Stand. Abbrev., in press.
relies on various approach of ranking algorithms, like Content- [6] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, Electron spectroscopy
based ranking algorithms, Link-based ranking, or a few of studies on magneto-optical media and plastic substrate interface, IEEE
them based on utilize user behaviors via usage-based ranking Transl. J. Magn. Japan, vol. 2, pp. 740-741, August 1987 [Digests 9th
Annual Conf. Magnetics Japan, p. 301, 1982].
algorithm. Unfortunately, those ranking algorithms still have
[7] M. Young, The Technical Writers Handbook. Mill Valley, CA:
some drawbacks to a ranked list provided from some search University Science, 1989.
engines. A combination of these algorithms enables to filter out
the redundant results
This algorithm can be extended in further to support
ranking of links and documents that support regional languages
like Hindi, Marathi, Bangla, Punjabi or Japanese. Apart from
this it can also be made useful for extracting and ranking of
images, videos dynamically from the web. The inference and
analysis of user search goals can have a lot of advantages in
improving search engine relevance and user experience which
can be achieved by web usage mining while web content
mining removes persistence problem. This work is the
combination of content and usage mining which works better
than using any one of them.

You might also like