Numerical Algorithms for Personalized Search in Self-organizing Information Networks
By Sep Kamvar
()
About this ebook
This book lays out the theoretical groundwork for personalized search and reputation management, both on the Web and in peer-to-peer and social networks. Representing much of the foundational research in this field, the book develops scalable algorithms that exploit the graphlike properties underlying personalized search and reputation management, and delves into realistic scenarios regarding Web-scale data.
Sep Kamvar focuses on eigenvector-based techniques in Web search, introducing a personalized variant of Google's PageRank algorithm, and he outlines algorithms--such as the now-famous quadratic extrapolation technique--that speed up computation, making personalized PageRank feasible. Kamvar suggests that Power Method-related techniques ultimately should be the basis for improving the PageRank algorithm, and he presents algorithms that exploit the convergence behavior of individual components of the PageRank vector. Kamvar then extends the ideas of reputation management and personalized search to distributed networks like peer-to-peer and social networks. He highlights locality and computational considerations related to the structure of the network, and considers such unique issues as malicious peers. He describes the EigenTrust algorithm and applies various PageRank concepts to P2P settings. Discussion chapters summarizing results conclude the book's two main sections.
Clear and thorough, this book provides an authoritative look at central innovations in search for all of those interested in the subject.
Sep Kamvar
Sep Kamvar is a consulting assistant professor of computational mathematics at Stanford University. From 2003 to 2007, he was the engineering lead for personalization at Google. He is the founder and former CEO of Kaltix, a personalized search engine acquired by Google in 2003.
Related to Numerical Algorithms for Personalized Search in Self-organizing Information Networks
Related ebooks
Power Electronic System Design: Linking Differential Equations, Linear Algebra, and Implicit Functions Rating: 0 out of 5 stars0 ratingsSAS Statistics by Example Rating: 5 out of 5 stars5/5High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches Rating: 0 out of 5 stars0 ratingsAWS Certified Solutions Architect Study Guide: Associate SAA-C02 Exam Rating: 0 out of 5 stars0 ratingsData Mining Applications with R Rating: 4 out of 5 stars4/5Mobile Edge Artificial Intelligence: Opportunities and Challenges Rating: 0 out of 5 stars0 ratingsOptimal Operation of Active Distribution Networks: Congestion Management, Voltage Control and Service Restoration Rating: 0 out of 5 stars0 ratingsDevelopment of Online Hybrid Testing: Theory and Applications to Structural Engineering Rating: 5 out of 5 stars5/5Structured Parallel Programming: Patterns for Efficient Computation Rating: 1 out of 5 stars1/5Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition Rating: 0 out of 5 stars0 ratingsPractical Three-Way Calibration Rating: 0 out of 5 stars0 ratingsEvolutionary Algorithms for Mobile Ad Hoc Networks Rating: 0 out of 5 stars0 ratingsNetworks-on-Chip: From Implementations to Programming Paradigms Rating: 0 out of 5 stars0 ratingsComputational Methods for Next Generation Sequencing Data Analysis Rating: 0 out of 5 stars0 ratingsStatistical Data Cleaning with Applications in R Rating: 0 out of 5 stars0 ratingsFlexible Distribution Networks Rating: 0 out of 5 stars0 ratingsTemporal Data Mining via Unsupervised Ensemble Learning Rating: 0 out of 5 stars0 ratingsMetaheuristic Applications in Structures and Infrastructures Rating: 0 out of 5 stars0 ratingsHigh Performance Parallelism Pearls Volume One: Multicore and Many-core Programming Approaches Rating: 0 out of 5 stars0 ratingsJavaScript and Open Data Rating: 0 out of 5 stars0 ratingsAWS Certified Solutions Architect Study Guide with 900 Practice Test Questions: Associate (SAA-C03) Exam Rating: 0 out of 5 stars0 ratingsA Survey of Computational Physics: Introductory Computational Science Rating: 0 out of 5 stars0 ratingsOracle 11g Streams Implementer's Guide Rating: 0 out of 5 stars0 ratingsQuantum Machine Learning: What Quantum Computing Means to Data Mining Rating: 0 out of 5 stars0 ratingsHarness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models Rating: 0 out of 5 stars0 ratingsHamiltonian Monte Carlo Methods in Machine Learning Rating: 0 out of 5 stars0 ratingsSustainable Wireless Network-on-Chip Architectures Rating: 0 out of 5 stars0 ratingsWavelet Neural Networks: With Applications in Financial Engineering, Chaos, and Classification Rating: 0 out of 5 stars0 ratingsProbabilistic Design for Optimization and Robustness for Engineers Rating: 0 out of 5 stars0 ratings
Programming For You
HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsSQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsPython Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsModern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards Rating: 0 out of 5 stars0 ratingsProgramming Arduino: Getting Started with Sketches Rating: 4 out of 5 stars4/5Problem Solving in C and Python: Programming Exercises and Solutions, Part 1 Rating: 5 out of 5 stars5/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5
Reviews for Numerical Algorithms for Personalized Search in Self-organizing Information Networks
0 ratings0 reviews
Book preview
Numerical Algorithms for Personalized Search in Self-organizing Information Networks - Sep Kamvar
them.
Chapter One
Introduction
Distributed, self-organizing networks such as the World Wide Web and peer-to- peer networks allow for fast access to vast quantities of diverse information for a large number of users. However, with such large scale and data diversity comes the challenge of finding relevant data from reputable sources in an efficient manner.
This book, addresses the issues of relevance and reputation by exploiting user preference information to perform reputation management and personalized search. The issues of personalization and reputation management are highly intertwined, in terms of both the basic ideas and the underlying technologies. Personalization exploits the preferences of an individual to bias search toward that individual’s preferences, while reputation management aggregates the preferences of all individuals to bias search toward the data sources that are deemed reputable by the group.
The ideas of reputation and personalization are powerful in conjunction. For example, a personalized Web search for the term giants
would return the official site of the New York Giants to a football fan from New York, while the same query would return the official site of the San Francisco Giants to a baseball fan from San Francisco. Personalization takes advantage of the local context to return the right sports team, and reputation takes advantage of the global context to return the official site of the corresponding team, rather than some random fan page.
In large-scale diverse data networks, a query will often have so many results that the challenge lies in finding those that are most relevant and reputable. When traditional IR keyword matching techniques are combined with the dual techniques of personalization and reputation management, the end user is likely to have to spend less time intelligently formulating a query and filtering through irrelevant data.
This book is written in two parts, the first part focusing on the Web, and the second on peer-to-peer networks. These parts, while they share the themes of reputation and personalization, differ in style as well as application area. Part I has more mathematical proofs, while Part II relies more on simulation and experimentation. You may read them independently, but together they will give a broader view of the world. Both Part I and Part II, however, are meant for an audience comfortable with advanced concepts in math and computer science.
1.1 WORLD WIDE WEB
Google’s PageRank algorithm [56] revolutionized Web search by providing a reliable, spam-resistant way to find reputable web pages. The algorithm is based on the idea that a link from page i to page j confers authority on page j. Therefore, pages with many links from reputable pages are themselves reputable. Part I of this book addresses the issue of personalizing the PageRank algorithm for individual users.
The PageRank algorithm involves the computation of the dominant eigenvector of a Markov matrix describing the behavior of a model Web surfer jumping from page to page on the Web hyperlink graph. Chapter 2 reviews the PageRank algorithm and the random surfer model. Chapters 3 and 4 introduce some mathematical properties of PageRank that guide how we proceed in algorithm design.
It has been suggested that, by biasing the behavior of the model surfer to reflect the biases of a given user, PageRank can be personalized for each individual user [56]. However, due to the sheer size of the web matrix, doing an individual eigenvector computation for each user is prohibitively expensive, and a computationally tractable algorithm for Personalized PageRank has remained an open problem since it was suggested in 1998.
Chapters 5 through 7 discuss techniques for accelerating PageRank in order make the idea of Personalized PageRank computationally tractable. Personalized PageRank is presented at the end of Chapter 7.
Much of the content in Part I represents joint work with Taher Haveliwala, Glen Jeh, Chris Manning, and Gene Golub.
1.2 P2P NETWORKS
Part II addresses the idea of reputation and personalization in the context of file-sharing peer-to-peer networks. Due to the highly distributed nature of P2P networks, the technical challenges here are different from those described for Personalized PageRank. The first challenge is to devise an algorithm that computes and stores reputation in a distributed manner with minimal overhead and that is resistant to malicious users. Chapter 9 describes the EigenTrust algorithm for reputation management in P2P systems. Since queries in a large-scale P2P network have a limited time horizon, personalizing P2P search can be achieved by designing the topology of a P2P network such that each peer is surrounded by peers that are likely to store data of interest to that peer. In Chapter 10, a peer-level protocol is presented for the self-organization of such a P2P topology. These protocols are tested using a P2P simulator called the Query-Cycle simulator, described in Chapter 8.
Much of the content in Part II represents joint work with Mario Schlosser, Tyson Condie, and Hector Garcia-Molina.
1.3 CONTRIBUTIONS
The work presented in this book offers three main contributions to research in information retrieval.
The first is a mathematical analysis of PageRank, including convergence and stability guarantees. While convergence and stability have been observed empirically for PageRank, domain-independent guarantees are useful when proposingPageRank-like algorithms in other problem domains. Furthermore, convergence and stability analysis generally lays a foundation for future work in numerical algorithms. In this case, the convergence analysis of PageRank suggests that future algorithms should be based on the Power Method.
The second main contribution of this work is the presentation of a scalable, personalized PageRank algorithm for Web search. In particular, we use properties of the problem and the domain to speed up the PageRank algorithm. The properties of the matrix (sparsity and large eigengap) lead us to use algorithms based on the Power Method throughout the book, and the extrapolation algorithms specifically exploit the matrix properties. The domain properties of the Web as a hierarchical dynamic system lead us to the Adaptive PageRank and BlockRank algorithms. And finally, the linearity of PageRank, another property of the problem, allows us to use all these algorithms in conjunction with Topic-Sensitive PageRank. The scalability issues have long been a bottleneck for the successful deployment of personalized search on the scale of the web, and this book addresses those issues.
The third main contribution is bringing the ideas of reputation and personalization to search in P2P networks. As the quantity and diversity of data on P2P networks approaches that of the web, the importance of search quality in P2P networks becomes increasingly important. The recent focus of research in P2P search has been efficiency for point queries (exact-match queries). However, while efficiency for point queries is important, point queries represent only a small fraction of possible queries in today’s P2P networks. Three main ideas are presented within this contribution. The first is the understanding that the ideas behind PageRank can also be applied to search in P2P networks. The second is a method of computing the dominant eigenvector in a highly distributed and potentially subversive environment. These are the basis of the EigenTrust algorithm. The third is a recognition that the local neighborhood is more important than a differential quality score for personalization in P2P search, where queries are only broadcast across a limited time horizon. This is the basis of Adaptive P2P Topologies.
PART I
World Wide Web
Chapter Two
PageRank
2.1 PAGERANK BASICS
The PageRank algorithm for determining the reputation of Web pages has become a central technique in Web search [56]. The core of the PageRank algorithm involves computing the principal eigenvector of the Markov matrix representing the hyperlink structure of the Web. As the Web graph is very large, containing several billion nodes, the PageRank vector is generally computed offline, during the preprocessing of the Web crawl, before any queries have been issued. As discussed in Chapter 1, personalization requires significant advances to the standard PageRank algorithm.
This chapter reviews the standard PageRank algorithm [56] and some of the mathematical tools that will be used in analyzing and improving the standard iterative algorithm for computing PageRank throughout the rest of this book.
Underlying PageRank is the following basic assumption. A link from a Web page u to another page v can be viewed as evidence that v is an important
page.¹ In particular, the amount of importance conferred on v by u is proportional to the importance of u and inversely proportional to the number of pages u points to. Since the importance of u is itself not known, determining the importance for every page i in the Web requires an iterative fixed-point computation.
To allow for a more rigorous analysis of the necessary computation, we next describe an equivalent formulation in terms of a random walk on the directed Web graph G. (The graph G is the directed graph where each node represents a page on the Web, and an edge between nodes u and v represents a link from page u to page v.) Let u → v denote the existence of an edge from u to v in G. Let deg(u) be the outdegree of page u in G. Consider a random surfer visiting page u at time k. In the next time step, the surfer chooses a node vi from among u’s out-neighbors {v|u → v} uniformly at random. In other words, at time k + 1, the surfer lands at node vi {v|u → v} with probability 1/deg(u).
The PageRank of a page i is defined as the probability that at some particular time step k > K, the surfer is at page i. In the limit as K → ∞, this probability distribution is called the stationary probability distribution.
With minor modifications to the random walk, this probability distribution is unique, illustrated as follows. Consider the Markov chain induced by the random walk on G, where the states are given by the nodes in G and the stochastic transition matrix describing the transition from i to j is given by Q with Qij = 1/deg(i).
For Q to be a valid transition probability matrix, every node must have at least 1 outgoing transition; that is, Q should have no rows consisting of all zeros. This holds if G does not have any pages with outdegree 0, which is not the case for the Web graph. Q can be converted into a valid transition matrix by adding a complete set of outgoing transitions to pages with outdegree 0. In other words, we can define the new matrix P where all states have at least one outgoing transition in the following way. Let n be the n-dimensional column vector representing a uniform probability distribution over all nodes:
be the n-dimensional column vector identifying the nodes with outdegree 0:
Then we construct P′ as follows:
In terms of the random walk, the effect of D
By the ergodic theorem for Markov chains [31], the Markov chain defined by P has a unique stationary probability distribution if P is aperiodic and irreducible. In general, neither of these properties holds for the Markov chain induced by the Web graph.
In the context of computing PageRank, the standard way of ensuring these properties is to add a new set of complete outgoing transitions, with small transition probabilities, to all nodes, creating a strongly connected (and thus irreducible) transition graph. Furthermore, adding this set of outgoing transitions will ensure that the transition graph has at least one self-loop, thus guaranteeing aperiodicity. For a more in-depth discussion of these conditions, see [31].
In matrix notation, we construct the aperiodic, irreducible Markov matrix P′ as follows:
E T,
P′ = cP + (1 − c)E.
In terms of the random walk, the effect of E is as follows. At each time step, with probability (1 − c. Artificial jumps taken because of E are referred to as teleportation.
Algorithm 1: = A
given in (2.1) to be nonuniform, so that D and E as the personalization vector.
For simplicity and consistency with prior work, the remainder of the discussion will be in terms of the transpose matrix, A = (P′)T; that is, the transition probability distribution for a surfer at node i is given by row i of P′ and column i of A.
Note that the edges artificially introduced by D and E = A can be implemented efficiently using Algorithm 1.
(0), the probability distribution for the surfer’s location at time k (k) = Ak (0). The unique stationary distribution of the Markov chain is defined as limk→∞ x(k), which is equivalent to limk→∞ Ak(0). This is simply the principal eigenvector of the matrix A = (P′′)T [31], which is exactly the PageRank vector we would like to compute.
(k) = A (k−1) until convergence. This is known as the Power Method, and is discussed in further detail in Section 2.3. The next section summarizes some of the notation described in the section, and introduces some new notation that will be used later on in this book.
2.2 NOTATION AND MATHEMATICAL PRELIMINARIES
We will use the following notation and simple mathematical preliminaries throughout this book.
P is an n × n row-stochastic matrix. E is