You are on page 1of 6

Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing

A Hierarchical Cache Scheme for the Large-scale Web Search Engine


Sungchae Lim Dongduk Womens University sclim@dongduk.ac.kr Joonseon Ahn Korea Aerospace University jsahn@kau.ac.kr

Abstract

Over the past decade, much research has been done to solve technical challenges regarding the web search engine, such as crawling web documents, high performance indexes, and ranking systems using hyperlink analysis. However, implementation details of its query processing system are rarely dealt with in the literature. In this paper we present a distributed architecture for the query processing system and its hierarchal cache scheme. Our paper is based on the development experience of a commercial web search engine designed to answer 5 million user queries against over 6.5 million web pages per day. Using the hierarchal cache scheme, we keep a portion of query results in multi-level caches so that excessive I/O or CPU time is not used for query processing. With that scheme, it is possible to reduce around 70% of the server costs.

1. Introduction
Nowadays, the web search engine is widely used as a common way to find information of interests and its indexed documents have reached the scale of multiple billions [1]. The web search engine is software that indexes web documents collected from the Internet and gives orders to them according to their query relevancy with respect to an entered user query. Much research has been done to solve various problems related to the web search engine, such as crawling web documents [2-4], high-performance indexing [5-7], hyperlink analysis [8, 9], and topic sensitive searching [10, 11]. However there is not enough information about the way how to implement the query processing system
This research was supported by the Internet Information Retrieval Research Center (IRC) in Korea Aerospace University. IRC is a Regional Research Center of Gyeonggi Province, designated by ITEP and Ministry of Knowledge Economy.
1

suitable for large-scale web search engines. Since the system has to index a huge size of data, the cost for yielding a query result could be very high. Therefore the public service of web searching could be infeasible for prohibitive server costs, if there were no query processing system (QPS) that can efficiently reduce the resource consumption during query processing [12]. In this paper, we mainly focus on the QPS of a large-scale of web search engine. We first design a distributed architecture for the QPS. Since the amount of used CPU and I/O resources is so huge for query processing, more than one server has to work in parallel even for processing a single query. To make such cooperation efficient, the QPS is designed as clustered servers and the server clusters are connected to each other via high-speed LANs. Next, we describe the hierarchical cache scheme, which is the main topic of this paper. The cache scheme is devised to have hierarchical 4-level cache data. In the top-level cache, the recent search result pages are stored in main memory, and the remaining lower levels of caches reside in the disk for saving more query results. Using the multi-level caches, we can save 70% of server cost for query processing. In this way, our system indexes 65 million web documents and can answer 5 millions of user queries against them per day at a cheap cost. The rest of our paper is organized as follows. We describe the architecture of the QPS in Section 2. In Section 3, the basic ideas and algorithms of the proposed multi-level cache scheme are presented. Then, we discuss the performance and effectiveness of our cache scheme in Section 4, and conclude this paper in Section 5.

2. Architecture of the QPS


Our QPS was developed to answer over 5 million queries a day against 65 million web documents that are indexed using inverted files [5, 6]. As a timing restriction, the system has to retain the average response time below 0.7 seconds even at the peak time.

978-0-7695-3263-9/08 $25.00 2008 IEEE DOI 10.1109/SNPD.2008.107

925 919

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

To fulfill such performance requirements, we make a single query be process over multiple servers in parallel. In addition, for more capacity and system stability, the server clustering technique is employed. To design the architecture of the QPS, it is needed to understand how a user query is answered in the system. When a user inputs a web query, the query is sent to a web server in the system and it is represented by an URL, which contains users keyword(s) and retrieval range. Here, the retrieval range specifies the rank range of query-matching documents to be shown in the query result page returned to the user. The retrieval range is shifted with users clicks on the next/previous page link given in the current query result page. By parsing the query URL, the web server extracts the keyword(s) and retrieval range, and then sends them to other query processing server. The query processing server is responsible for performing the equi-join and rank evaluation. The equi-join is to select the query-matching documents with respect to given keywords. For this, inverted files are used to identify the documents where all of the given keywords occur at least once. After the equi-join operation, rank evaluation is performed to give rank scores to the query-matching documents according to documents query relevancy. At this point, our ranking system refers to various index data including the hyperlink analysis results, keywords occurrence positions, and HTML-tag related information. This index data is also stored in inverted files along with other data used for equi-join. From the steps of equijoin and ranking, we come to have a set of <DID, rank score>, where DID stands for the document ID. By sorting them in the decreasing order of rank scores, we can finally determine the set of DIDs pertaining to users retrieval range. To make a query result page after the ranking evaluation, it is needed to create the summarization data of the DIDs selected above. The summarization data for a DID is comprised of its URL, title, and the keyword-including passages extracted from the corresponding documents body text. Such data for a DID is called the DST (Document Summarizing Text) from now on. To make a DST, we read a HTML-tag free body text saved in the disk and select pieces of text containing many keywords. This DST generation is very I/O intensive and CPU consuming task, because we have to read documents body texts that are randomly located in a large volume of disks and a number of string matching operations are performed over the body texts [13]. After gathering of the DSTs, a single result page can be generated via assembly of them and insertion of proper HTML tags. Under the consideration of such steps during query processing, we designed the architecture of the QPS as

Fig. 1. Architecture of the QPS.

in Fig. 1. The QPS is installed at a IDC (Internet Data Center) site for stable Internet connectivity. The IDC port in the figure provides 100 Mbps of network bandwidth and a firewall reside behind it for security purpose. The load balancer of an L4 switch acts as a load balancer which dispatches user queries toward four web servers in round-robin fashion. The performance monitor repeatedly gathers the performance statistics such as response times, the rate of entered queries, servers workloads, etc. If any problem is detected, then it sends a warning message to the administrator. Below the web servers, the main components of our QPS are depicted as coordinator servers, ranker clusters, and DST servers. The coordinator server receives user queries via web servers and performs two-phase query processing in coordination with ranker severs and DST servers. During the first phase, the coordinator server sends a query to four ranker servers at once. In our system, the whole index information is partitioned into four index files and they are stored in four ranker clusters, respectively. The ranker servers in the same ranker cluster have the same partition of index files. Therefore, by sending a query to four ranker servers of different clusters, we can expect the time for the equi-join and ranking operations is reduced owing to parallel processing of ranker servers. And then, at the end of the first phase, the coordinator merges the returned results, which are in the form of <DID, rank score>, from the ranker servers. At the second phase, the coordinator server sends the lists of DIDs obtained at the first phase to

926 920

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

DST servers in parallel. Correspondingly, the DST servers create DST data for each received DID and return it to the coordinator server. For this, the DST server stores URLs, titles, and tag-free body text of all the crawled web documents in the disk, and uses a hash scheme to read each of them. By merging the DSTs, the coordinator server finishes the second phase.

3. Proposed Cache Scheme


In this section, we address the hierarchical multilevel cache scheme developed for the QPS. Using the cache scheme, we can considerably reduce the use of disk I/Os and CPU time in the implemented QPS. 3.1. Background We here present the reason why the cache scheme should be employed for the QPS. For this, we need to analyze the steps of query processing in details. Fig. 2 shows the steps of the coordinator server working with other servers such as a web server, ranker servers, and DST servers. At step 1, the coordinator server accepts a user query and finally obtains the query result at Step 7. During the steps, two-phase query processing is performed. That is, during the first phase of Steps 1 to 3, the coordinator server requests partial ranking results to ranker servers and merges the results from them. And, in the next phase of Steps 4 to 6, DSTs are collected for the DUP_K top-ranked DIDs. At Step 6, duplication deletion is executed to remove the duplicate web documents from the query result page returned to users. Although deletion of duplicate web documents is performed before we start to index the crawled ones, it is not possible to detect all of duplicates among the crawled web documents. Therefore, deleting of duplicate DSTs is essential for the quality of search results. The detection of duplicates is done based on string comparisons. Steps 2 and 4 are executed in cooperation with ranker servers and DST servers, respectively, and those steps become the main bottleneck points in terms of I/O bandwidth and CPU time. To get a glimpse of that, let us consider a user query having k1 and k2 as its keywords. Also suppose that the number of web documents containing k1 (or k2) is 20,000 (or 30,000). In this situation, the disk bandwidth required in Steps 2 and 4 can be estimated as the followings. At step 2, in this case, the ranker server needs to read about 0.2 MB of disk data for the equi-join because an index entry for the equi-join has to contain at least 4-byte DID to every document. Such disk size is computed from the fact that the DID information of

Step 1. Accept a query of keywords and a retrieval range from a web server. Step 2. forall s /* is a set of ranker servers toward which the query will be sent. */ Send keywords to s for equi-join and ranking operations. Receive the raking results of <DID, rank score> from s, and merge them into . Step 3. Sort in the decreasing order of rank scores and select the ranking results with DUP_K highest rank scores. Let Top-K be the selected subset. Step 4. Divide Top-K into DUP_K/20 subsets and send every subset to a DST server and receive DSTs from them. Step 5. Merge the DSTs returned from DUP_K/20 number of DST servers. Let be the merged DSTs, which has the format of <DID, DST> for each document. Step 6. Perform duplication detection on . And then, delete some duplicates from as well as , and let and be the modified ones, respectively. Step 7. Make the query results using and .

Fig. 2. Query processing steps of the coordinator server.

k1 and k2 are totally retrieved for equi-join at least once. In reality, the size of data read is over twice than that because the index entry always includes other information besides the DID data. As a result, the equijoin consumes about 0.4 MB of disk bandwidth in general. After this equi-join, the ranker server calculates a rank score for every DID selected from the equi-join and thus it has to read additional index data such as keywords occurrence positions, HTML-tag related data, etc. Since the disk accesses to those data always incur additional disk seek times, the disk usage for ranking becomes over twice that of the equi-join on average. Note that this is a performance statistics obtained by our experience to run a commercial web search engine. Therefore, disk bandwidth consumed by a ranker server amounts to that required to read 1.2 MB of data in all. Such disk usage could be larger in the case that the query has more keywords or query-matching documents. Since this computation is for a single ranker server and four ranker severs are used for the process of the user query, 4.8 MB of disk bandwidth is required in all. At step 4, next, a large number of disk I/Os arise in the DST servers to create DUP_K DSTs. The DST creation demands at least one read of the disk block saving the HTML-tag free text of a web document and those disk blocks are randomly located in the disk. Therefore, disk bandwidth for DST creation at Step 4 is very large because of seek time overhead. With DUP_K = 120 of our implementation, a fast SCSI disk would be fully occupied during more than one second for generating the same number of DSTs. This delay

927 921

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

time is too long for our system. Instead, we divide the DIDs into subsets and sent them to multiple DST servers at once for parallel processing as in Step 4. Due to such heavy disk use in the ranker and SD servers, the cache mechanism is unavoidable. Notice that the CPU time use is not mentioned for simplicity. The details regarding the cache scheme for our QPS are presented in next sections. 3.2. Multi-level Cache We design a hierarchical cache scheme composed of four-level caches. In this paper, we name those fourlevel caches as CL1, CL2, CL3, and CL4, from the top level to the bottom level. In the hierarchical cache, cache look-ups arise in the order from CL1 to CL4. The CL1 is managed on main memory of the web server in the QPS, and the cache records of CL1 are compressed for saving memory space. For fast accesses in CL1, a hashing technique is used to determine the address to the cache record with respect to a given query URL. For this, we make a hash table mapping a query URL to an address pointing to the memory bucket, where multiple memory slots of a fixed size reside. The web server first accesses an associated memory bucket, and then explores the memory slots within that for the search of the matched cache record. If a matched cache record is found, then it is uncompressed for its use. The uncompressed data is the HTML-coded data to be transferred to users web browser for the view of a query result page. With the cache hit on CL1, the user query is completed without any connection to the coordinator server. Otherwise, if not found, a request for query processing will be sent toward a coordinator server. Since the CL1 is saved in memory, the number of cached search result pages is not large enough to support many users queries. Therefore, the caches of CL2, CL3, and CL4 are stored in the disk. These caches are managed by the coordinator server to cache the execution results of Step 6 of Fig. 2. After executing Step 6, the coordinator server has two sorts of resultant data, i.e., DST data and a DID list. The DST data is made of at most DUP_K number of DSTs obtained at Step 5, and the DID list is from the equijoin at Step 2. In this list, the DIDs behind the rank of DIDs having DST data are contained, as well as other DIDs having DST. Note that the DIDs having DST data are ones that are ranked at 1 until DUP_K. Since the DID list is already modified through duplication deletion at Step 6, the number of DSTs is usually less than DUP_K and the DID list does not contain any DIDs that are found to be duplicated ones. The CL2 and CL3 are used to cache the DST data and their cache records are saved in the form of <DID,

DST> to each document. To make access to the record in the disk, the coordinator server uses a hash table mapping a given user query to a disk address pointing to the target cache record. That hash table is also stored in the same disk as the saved cache records. A difference between CL2 and CL3 exists in query sets to be cached. To CL2, all the cached queries are 0.3 million ones with the highest popularities among the queries entered earlier. Such popular query set is determined by analyzing query log, and their popularities do not change drastically in a short period. By caching such popular queries in CL2, we can guarantee a high cache hit rate. Moreover, since the cache data of CL2 can be established in batch mode, we can use the CL2 data to prevent severe cache misses when the current index files and data of the web search engine are replaced with new ones at data refresh times. In the case of CL3, the set of cached queries changes as time goes. Since the cache replacement is performed based on an LRU (Least Recently Used) style algorithm, the cached queries also vary according to time-varying user queries. Unlike CL3, the CL2 data does not change until the next data refresh time. The CL4 is to cache the DID list of Step 6. The cache record of CL4 saves the list of DIDs, which are sorted according to their rank scores. Since the rank scores are not used for generating search results, only the DIDs data are saved in CL4. As with CL2 and CL3, CL4 has a hash table mapping a query to the target cache record. Because the DID size is much less than that of a DST, we can cache a larger number of DIDs in CL4, compared with CL2 and CL3. In our implementation, a cache record for a query cached in CL4 can save up to 4,000 DIDs. If a user requests a query result page having any DIDs beyond this limit, then the query should be processed without cache hits in the coordinator server. Of cause, the cached data in CL1 will be used, if the same request is issued again. Since the record saved in CL4 has no DST data, a request for gathering DSTs is needed toward DST servers, even when a cache hit arises on CL4. The benefit from CL4 cache hits is that the I/O and CPU costs for executing Steps from 2 to 6 of Fig. 2 can be eliminated. Since the cost for creating some DSTs needed for a search result page is much less than that paid in the case of cache miss on CL4, we can obtain much performance improvement from the CL4. As with CL3, the CL4 cache records can be swapped out according to a cache replacement algorithm, and a hash table is used for fast accesses. Some algorithms for our multi-level caches will be presented in the next section.

928 922

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

3.3. Cache Algorithms In this section, we first describe the algorithm of CL1. The memory of CL1 is initialized at the time of data refresh, and it is filled up with answered user queries according to the algorithm of Fig. 3. In the algorithm, with the hash key of h_key made from the given user query, the CL1 locates the memory bucket containing a memory slot that possibly saves the cache record having key h_key. Then, the CL1 reduces the popularity of unmatched memory slots in line 10, while searching for the target memory slot. Such adjustment of the popularities is needed for the cache replacement.
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) h_keyget_key(keywords, retrieval range). id h_key mod BUCKET_NO. /* bucket id */ Let be the set of cache slots in the bucket with identifier of id. record = nil. foreach s if( s.h_key == h_key) then /* cache hit */ record pointer to s. s.popularity += 1. else s.popularity -= 1. endfor if(record != nill) then Uncompress the cache record saved in record for returning a search result page. else Forward the user query to a coordinator server and receive the query result saved in q_result. Select a free slot s from and save the compressed q_result into s.data, and set s.popularity to 50, if the current result is to be cached. endif

Fig. 4. Workflow of the CL2, CL3, and CL4.

Fig. 3. Algorithm for the CL1 cache.

If there is no cache hit on CL1, the query is sent to a coordinator server in line 15 and then the query result returned from that server is compressed for saving in CL1. Such saving of the new cache record is done only when the query is for requesting five first result pages among the whole result pages; other pages behind this page range are not cached in CL1 for saving memory space. In addition, as a victim the memory slot with the least popularity is selected in line 16. On the side of the coordinator server, other lower levels of caches are used as in Fig. 4, where we give a simplified workflow for the space limitation. At the top of Fig. 4, we check if a given query is a popular one. If that is true, cache lookup is performed against CL2; otherwise, the cache lookup is done against CL3. If a cache hit arises on either of CL2 or CL3, then the resultant page for the query is sent to the web server by reading the data saved in the found cache record. Here, the result page is one that contains the DSTs of DIDs

pertaining to user retrieval range. In the case of cache miss, CL4 will be looked up. If there is a cache hit on CL4, the DTSs for the resultant page are requested to DST servers. The cache hit on this cache level removes the cost for processing rank evaluations and deleting duplicate DSTs. As mentioned before, these two steps account for most of system resource consumption during query processing. Therefore, the use of CL4 significantly improves the system throughput using a rather small size of disk space. After merging the ranking results returned from ranker servers, our cache scheme checks if the current query is cached in CL2 or CL3. Note that any cache record in CL2 or CL3 cannot be used for answering the current query, if the retrieval range of it is beyond the coverage of that cache record. If the query is cached in either of them, the process of duplication deletion is omitted; otherwise, duplication deletion is needed to save this new query result into CL3.

4. Performance
We here analyze the benefits of the implemented multi-level cache scheme. To this end, we use a probabilistic model. Let 1 , 2 , 3 , 4 be the cache hit rates of CL1, CL2, CL3 CL4, respectively. Then, the overall hit rate, i.e., , is computed as follows:

+ (1

1)

+ (1

1)

(1

2)

929 923

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

+ (1

1)

(1

2)

(1

3)

Since the hit rates of multi-level caches deeply depend on the characteristic of entered user queries, it is not possible to forecast the exact cache hit rate of every web search engine. However, we believe that the operational statistics gathered from our real web search engine could be a guideline for similar systems. In our experience, the values of 1 , 2 , 3 , 4 are the same as 0.62, 0.19, 0.24, 0.13 respectively. From those values, we have = 0.79. That is, around 80% of user queries are processed using cache data. At the time of a cache hit on CL1, the cost of query processing is negligibly small, since the web server only has to uncompress the cache record saved in main memory. For the memory space of CL1, we allocate 75% of main memory exclusively and thus around 20,000 search result pages can be cached in CL1 of a web server with 2 GB memory. In the cases of CL2, CL3, and CL4, although they require disk bandwidth, their cost for query processing is not greater than 3% of the cost that could be paid without any cache hit. Since the cost for query processing using cached data is negligible with respect to that for non-cached queries, we can say that the multi-level cache improves the throughput of the implement system by about 300%. Due to such performance enhancement, we can reduce 70% of the sever cost when building our QPS. For efficient disk management of CL2, CL3, and CL4, we partition disk space into a fixed size of disk areas, each of which contains a multiple of cache records in it. The disk area is called the disk bucket and it is directly accessible via hash keys made from the query. Then we save the addresses to cache records within a disk bucket at the head of it. The disk area for those addresses information is called the control data block. For a cache access, the modules of CL2 to CL4 first read the control data block with the disk address calculated from the mixture of the hash key and the disk bucket size, and then examine the existence of the target cache record by using the information saved in the control data block. If found, then the corresponding cache record is fetched into memory. Due to this mechanism we can access the target cache record with at most two times of disk reads. For cache replacement of CL3 and CL4, a time stamp saving the recent access time is also updated when the corresponding cache record is referred to or created.

technique of server clustering. In addition, we proposed a hierarchical cache scheme suitable for our query processing system. To use the memory and disk space efficiently, the cache data are managed across caches of four different levels. Using the multi-level caching scheme, we can save around 70% of the server cost. It is expected that our paper gives some valuable information to those who have interests in the implementation techniques regarding the large scale search engines.

References
[1] Search Engine Report, Http://www.searchenginewatch.com, 2005. [2] Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam., Accelerated focused crawling through online relevance feedback. In Proc. of the 11th International Conf. on World Wide Web, pp. 148-159, 2002. [3] Sriram Raghvan and Hector Garcia-Molina. Crawling the Hidden Web. In Proc. of the VLDB Conference, pp. 129-138, 2001. [4] Andrei Z. Broder, Marc Najork, and Janet L. Wiener, Efficient URL Caching for World Wide Crawling, In Proc. of the 12th WWW Conference, Budapest, Hungary, 2003. [5] Maxim Lifantsev and Tzi-cker Chiueh, I/O-Conscious Data Preparation for Large-Scale Web Search Engines, In Proc. of the 28th VLDB Conf., pp. Hong Kong, 2002. [6] Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a Distributed Full-text Index for the Web, In Proc. of the 10th International World Wide Web Conference. pp. 396-406, 2001. [7] Maxim Lifantsev and Tzi-cker Chiueh, Implementation of a modern web search engine cluster, In Proc. of the USENIX Annual Technical Conference, Texas, 2003. [8] Arvind Arasu, et al., Searching the Web, ACM Trans. on Internet Technology, Vol. 1(1), pp. 2-43, August 2001. [9] Larry Page, Sergey Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bring Order to the Web, Stanford Univ. Technical Report, 1998. [10] Zheng Chen, Shengping Liu, Liu Wenyin, Geguang Pu, and Wei-Ying Ma, Building a web thesaurus from web link structure, In Proc. of the ACM SIGIR' 03, pp. 4855, Toronto, Canada, 2003. [11] Taher H. Haveliwala. Topic-sensitive PageRank, In Proc. of the 11th International Conf. on World Wide Web, 2002. [12] Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando, Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data, ACM Trans. on Information Systems, Vol. 24(1), pp. 51-78, 2006. [13] Alfred V. Aho and Margaret J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Communication of the ACM, Vol. 18(6), pp. 333-340, 1975.

5. Conclusion
In this paper we designed the architecture of the query processing system for a large scale web search engine. To make response times shorter, we adopt the

930 924

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 15, 2009 at 16:16 from IEEE Xplore. Restrictions apply.

You might also like