Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Numerical Algorithms for Personalized Search in Self-organizing Information Networks
Numerical Algorithms for Personalized Search in Self-organizing Information Networks
Numerical Algorithms for Personalized Search in Self-organizing Information Networks
Ebook330 pages2 hours

Numerical Algorithms for Personalized Search in Self-organizing Information Networks

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. Representing much of the foundational research in this field, the book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data.

Sep Kamvar focuses on eigenvector-based techniques in Web search, introducing a personalized variant of Google's PageRank algorithm, and he outlines algorithms--such as the now-famous quadratic extrapolation technique--that speed up computation, making personalized PageRank feasible. Kamvar suggests that Power Method-related techniques ultimately should be the basis for improving the PageRank algorithm, and he presents algorithms that exploit the convergence behavior of individual components of the PageRank vector. Kamvar then extends the ideas of reputation management and personalized search to distributed networks like peer-to-peer and social networks. He highlights locality and computational considerations related to the structure of the network, and considers such unique issues as malicious peers. He describes the EigenTrust algorithm and applies various PageRank concepts to P2P settings. Discussion chapters summarizing results conclude the book's two main sections.

Clear and thorough, this book provides an authoritative look at central innovations in search for all of those interested in the subject.

LanguageEnglish
Release dateSep 7, 2010
ISBN9781400837069
Numerical Algorithms for Personalized Search in Self-organizing Information Networks
Author

Sep Kamvar

Sep Kamvar is a consulting assistant professor of computational mathematics at Stanford University. From 2003 to 2007, he was the engineering lead for personalization at Google. He is the founder and former CEO of Kaltix, a personalized search engine acquired by Google in 2003.

Related to Numerical Algorithms for Personalized Search in Self-organizing Information Networks

Related ebooks

Programming For You

View More

Related articles

Reviews for Numerical Algorithms for Personalized Search in Self-organizing Information Networks

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Numerical Algorithms for Personalized Search in Self-organizing Information Networks - Sep Kamvar

    them.

    Chapter One

    Introduction

    Distributed, self-organizing networks such as the World Wide Web and peer-to- peer networks allow for fast access to vast quantities of diverse information for a large number of users. However, with such large scale and data diversity comes the challenge of finding relevant data from reputable sources in an efficient manner.

    This book, addresses the issues of relevance and reputation by exploiting user preference information to perform reputation management and personalized search. The issues of personalization and reputation management are highly intertwined, in terms of both the basic ideas and the underlying technologies. Personalization exploits the preferences of an individual to bias search toward that individual’s preferences, while reputation management aggregates the preferences of all individuals to bias search toward the data sources that are deemed reputable by the group.

    The ideas of reputation and personalization are powerful in conjunction. For example, a personalized Web search for the term giants would return the official site of the New York Giants to a football fan from New York, while the same query would return the official site of the San Francisco Giants to a baseball fan from San Francisco. Personalization takes advantage of the local context to return the right sports team, and reputation takes advantage of the global context to return the official site of the corresponding team, rather than some random fan page.

    In large-scale diverse data networks, a query will often have so many results that the challenge lies in finding those that are most relevant and reputable. When traditional IR keyword matching techniques are combined with the dual techniques of personalization and reputation management, the end user is likely to have to spend less time intelligently formulating a query and filtering through irrelevant data.

    This book is written in two parts, the first part focusing on the Web, and the second on peer-to-peer networks. These parts, while they share the themes of reputation and personalization, differ in style as well as application area. Part I has more mathematical proofs, while Part II relies more on simulation and experimentation. You may read them independently, but together they will give a broader view of the world. Both Part I and Part II, however, are meant for an audience comfortable with advanced concepts in math and computer science.

    1.1 WORLD WIDE WEB

    Google’s PageRank algorithm [56] revolutionized Web search by providing a reliable, spam-resistant way to find reputable web pages. The algorithm is based on the idea that a link from page i to page j confers authority on page j. Therefore, pages with many links from reputable pages are themselves reputable. Part I of this book addresses the issue of personalizing the PageRank algorithm for individual users.

    The PageRank algorithm involves the computation of the dominant eigenvector of a Markov matrix describing the behavior of a model Web surfer jumping from page to page on the Web hyperlink graph. Chapter 2 reviews the PageRank algorithm and the random surfer model. Chapters 3 and 4 introduce some mathematical properties of PageRank that guide how we proceed in algorithm design.

    It has been suggested that, by biasing the behavior of the model surfer to reflect the biases of a given user, PageRank can be personalized for each individual user [56]. However, due to the sheer size of the web matrix, doing an individual eigenvector computation for each user is prohibitively expensive, and a computationally tractable algorithm for Personalized PageRank has remained an open problem since it was suggested in 1998.

    Chapters 5 through 7 discuss techniques for accelerating PageRank in order make the idea of Personalized PageRank computationally tractable. Personalized PageRank is presented at the end of Chapter 7.

    Much of the content in Part I represents joint work with Taher Haveliwala, Glen Jeh, Chris Manning, and Gene Golub.

    1.2 P2P NETWORKS

    Part II addresses the idea of reputation and personalization in the context of file-sharing peer-to-peer networks. Due to the highly distributed nature of P2P networks, the technical challenges here are different from those described for Personalized PageRank. The first challenge is to devise an algorithm that computes and stores reputation in a distributed manner with minimal overhead and that is resistant to malicious users. Chapter 9 describes the EigenTrust algorithm for reputation management in P2P systems. Since queries in a large-scale P2P network have a limited time horizon, personalizing P2P search can be achieved by designing the topology of a P2P network such that each peer is surrounded by peers that are likely to store data of interest to that peer. In Chapter 10, a peer-level protocol is presented for the self-organization of such a P2P topology. These protocols are tested using a P2P simulator called the Query-Cycle simulator, described in Chapter 8.

    Much of the content in Part II represents joint work with Mario Schlosser, Tyson Condie, and Hector Garcia-Molina.

    1.3 CONTRIBUTIONS

    The work presented in this book offers three main contributions to research in information retrieval.

    The first is a mathematical analysis of PageRank, including convergence and stability guarantees. While convergence and stability have been observed empirically for PageRank, domain-independent guarantees are useful when proposingPageRank-like algorithms in other problem domains. Furthermore, convergence and stability analysis generally lays a foundation for future work in numerical algorithms. In this case, the convergence analysis of PageRank suggests that future algorithms should be based on the Power Method.

    The second main contribution of this work is the presentation of a scalable, personalized PageRank algorithm for Web search. In particular, we use properties of the problem and the domain to speed up the PageRank algorithm. The properties of the matrix (sparsity and large eigengap) lead us to use algorithms based on the Power Method throughout the book, and the extrapolation algorithms specifically exploit the matrix properties. The domain properties of the Web as a hierarchical dynamic system lead us to the Adaptive PageRank and BlockRank algorithms. And finally, the linearity of PageRank, another property of the problem, allows us to use all these algorithms in conjunction with Topic-Sensitive PageRank. The scalability issues have long been a bottleneck for the successful deployment of personalized search on the scale of the web, and this book addresses those issues.

    The third main contribution is bringing the ideas of reputation and personalization to search in P2P networks. As the quantity and diversity of data on P2P networks approaches that of the web, the importance of search quality in P2P networks becomes increasingly important. The recent focus of research in P2P search has been efficiency for point queries (exact-match queries). However, while efficiency for point queries is important, point queries represent only a small fraction of possible queries in today’s P2P networks. Three main ideas are presented within this contribution. The first is the understanding that the ideas behind PageRank can also be applied to search in P2P networks. The second is a method of computing the dominant eigenvector in a highly distributed and potentially subversive environment. These are the basis of the EigenTrust algorithm. The third is a recognition that the local neighborhood is more important than a differential quality score for personalization in P2P search, where queries are only broadcast across a limited time horizon. This is the basis of Adaptive P2P Topologies.

    PART I

    World Wide Web

    Chapter Two

    PageRank

    2.1 PAGERANK BASICS

    The PageRank algorithm for determining the reputation of Web pages has become a central technique in Web search [56]. The core of the PageRank algorithm involves computing the principal eigenvector of the Markov matrix representing the hyperlink structure of the Web. As the Web graph is very large, containing several billion nodes, the PageRank vector is generally computed offline, during the preprocessing of the Web crawl, before any queries have been issued. As discussed in Chapter 1, personalization requires significant advances to the standard PageRank algorithm.

    This chapter reviews the standard PageRank algorithm [56] and some of the mathematical tools that will be used in analyzing and improving the standard iterative algorithm for computing PageRank throughout the rest of this book.

    Underlying PageRank is the following basic assumption. A link from a Web page u to another page v can be viewed as evidence that v is an important page.¹ In particular, the amount of importance conferred on v by u is proportional to the importance of u and inversely proportional to the number of pages u points to. Since the importance of u is itself not known, determining the importance for every page i in the Web requires an iterative fixed-point computation.

    To allow for a more rigorous analysis of the necessary computation, we next describe an equivalent formulation in terms of a random walk on the directed Web graph G. (The graph G is the directed graph where each node represents a page on the Web, and an edge between nodes u and v represents a link from page u to page v.) Let u v denote the existence of an edge from u to v in G. Let deg(u) be the outdegree of page u in G. Consider a random surfer visiting page u at time k. In the next time step, the surfer chooses a node vi from among u’s out-neighbors {v|u v} uniformly at random. In other words, at time k + 1, the surfer lands at node vi {v|u v} with probability 1/deg(u).

    The PageRank of a page i is defined as the probability that at some particular time step k > K, the surfer is at page i. In the limit as K → ∞, this probability distribution is called the stationary probability distribution.

    With minor modifications to the random walk, this probability distribution is unique, illustrated as follows. Consider the Markov chain induced by the random walk on G, where the states are given by the nodes in G and the stochastic transition matrix describing the transition from i to j is given by Q with Qij = 1/deg(i).

    For Q to be a valid transition probability matrix, every node must have at least 1 outgoing transition; that is, Q should have no rows consisting of all zeros. This holds if G does not have any pages with outdegree 0, which is not the case for the Web graph. Q can be converted into a valid transition matrix by adding a complete set of outgoing transitions to pages with outdegree 0. In other words, we can define the new matrix P where all states have at least one outgoing transition in the following way. Let n be the n-dimensional column vector representing a uniform probability distribution over all nodes:

    be the n-dimensional column vector identifying the nodes with outdegree 0:

    Then we construct P′ as follows:

    In terms of the random walk, the effect of D

    By the ergodic theorem for Markov chains [31], the Markov chain defined by P has a unique stationary probability distribution if P is aperiodic and irreducible. In general, neither of these properties holds for the Markov chain induced by the Web graph.

    In the context of computing PageRank, the standard way of ensuring these properties is to add a new set of complete outgoing transitions, with small transition probabilities, to all nodes, creating a strongly connected (and thus irreducible) transition graph. Furthermore, adding this set of outgoing transitions will ensure that the transition graph has at least one self-loop, thus guaranteeing aperiodicity. For a more in-depth discussion of these conditions, see [31].

    In matrix notation, we construct the aperiodic, irreducible Markov matrix P′ as follows:

    E T,

    P′ = cP + (1 − c)E.

    In terms of the random walk, the effect of E is as follows. At each time step, with probability (1 − c. Artificial jumps taken because of E are referred to as teleportation.

    Algorithm 1: = A

    given in (2.1) to be nonuniform, so that D and E as the personalization vector.

    For simplicity and consistency with prior work, the remainder of the discussion will be in terms of the transpose matrix, A = (P′)T; that is, the transition probability distribution for a surfer at node i is given by row i of P′ and column i of A.

    Note that the edges artificially introduced by D and E = A can be implemented efficiently using Algorithm 1.

    (0), the probability distribution for the surfer’s location at time k (k) = Ak (0). The unique stationary distribution of the Markov chain is defined as limk→∞ x(k), which is equivalent to limk→∞ Ak(0). This is simply the principal eigenvector of the matrix A = (P′′)T [31], which is exactly the PageRank vector we would like to compute.

    (k) = A (k−1) until convergence. This is known as the Power Method, and is discussed in further detail in Section 2.3. The next section summarizes some of the notation described in the section, and introduces some new notation that will be used later on in this book.

    2.2 NOTATION AND MATHEMATICAL PRELIMINARIES

    We will use the following notation and simple mathematical preliminaries throughout this book.

    P is an n × n row-stochastic matrix. E is

    Enjoying the preview?
    Page 1 of 1