Professional Documents
Culture Documents
http://cs246.stanford.edu
Graph data
PageRank, SimRank
Infinite data
Filtering data streams
Machine learning
Apps
SVM
Clustering
Community Detection
Web advertising
Decision Trees
Association Rules
Spam Detection
Queries on streams
Perceptron, kNN
2/5/2013
domain2
domain1
router
domain3
Internet
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
2/5/2013
2/5/2013
2/5/2013
10
But: Web is huge, full of untrusted documents, random things, web spam, etc.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
2 challenges of web search: (1) Web contains many sources of information Who to trust?
Trick: Trustworthy pages may point to each other!
2/5/2013
12
There is large diversity in the web-graph node connectivity. Lets rank the pages by the link structure!
2/5/2013
13
We will cover the following Link Analysis approaches for computing importances of nodes in a graph:
Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms
2/5/2013
14
2/5/2013
15
2/5/2013
16
Each links vote is proportional to the importance of its source page If page j with importance rj has n out-links, each link gets rj / n votes
rj = ri/3+rk/4
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
j
rj/3
A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank rj for page j
m
a/2
ri rj i j di
out-degree of node
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
Flow equations:
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
18
Flow equations:
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation!
= 0
=
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
ri rj i j di
20
. M
2/5/2013
r
21
We can now efficiently solve for r! The method is called Power iteration
Jure Leskovec, Stanford C246: Mining Massive Datasets 22
2/5/2013
y a m
y y a m 0
a 0
m 0 1 0
r = Mr
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 y 0 a = 0 1 m 0 0 y a m
2/5/2013
23
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme
Suppose there are N web pages Initialize: r(0) = [1/N,.,1/N]T Iterate: r(t+1) = M r(t) Stop when |r(t+1) r(t)|1 <
|x|1 = 1iN|xi| is the L1 norm
rj
( t 1)
ri i j di
(t )
di . out-degree of node i
2/5/2013
24
Power Iteration:
Set = 1/N 1: = 2: = Goto 1
y a m
y y a m 0
a 0
m 0 1 0
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Example:
ry ra = rm 1/3 1/3 1/3 1/3 3/6 1/6 5/12 1/3 3/12 9/24 11/24 1/6 6/15 6/15 3/15
Iteration 0, 1, 2,
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
Power Iteration:
Set = 1/N 1: = 2: = Goto 1
y a m
y y a m 0
a 0
m 0 1 0
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Example:
ry ra = rm 1/3 1/3 1/3 1/3 3/6 1/6 5/12 1/3 3/12 9/24 11/24 1/6 6/15 6/15 3/15
Iteration 0, 1, 2,
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)
() = () () =
() =
2/5/2013
Assume M has n linearly independent eigenvectors, 1 , 2 , , with corresponding eigenvalues 1 , 2 , , , where 1 > 2 > > Vectors 1 , 2 , , form a basis and thus we can write: (0) = 1 1 + 2 2 + + () = + + + = 1 (1 ) + 2 (2 ) + + ( ) = 1 (1 1 ) + 2 (2 2 ) + + ( ) Repeated multiplication on both sides produces (0) = 1 (1 1 ) + 2 ( ) + + ( 2 ) 2
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
2 + +
2 1
2 3 , 1 1
<1
= 0 as (for all = 2 ).
Thus: ()
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
i1
i2
i3
rj
i j
ri d out (i)
Let:
() vector whose th coordinate is the prob. that the surfer is at page at time So, () is a probability distribution over pages
2/5/2013
30
i1
i2
i3
2/5/2013
31
rj
( t 1)
ri i j di
(t )
or equivalently
r Mr
Does this converge? Does it converge to what we want? Are results reasonable?
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
rj
0 1
( t 1)
ri i j di
(t )
Example: ra = rb
1 0
0 1
1 0
Iteration 0, 1, 2,
2/5/2013
33
rj
0 0
( t 1)
ri i j di
(t )
Example: ra = rb
1 0
0 1
0 0
Iteration 0, 1, 2,
2/5/2013
34
2/5/2013
35
Power Iteration:
Set = 1 =
y a m
y y a m 0
a 0
m 0 0 1
And iterate
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm
Example:
ry ra = rm 1/3 1/3 1/3 2/6 1/6 3/6 3/12 2/12 7/12 5/24 3/24 16/24 0 0 1
Iteration 0, 1, 2,
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
The Google solution for spider traps: At each time step, the random surfer has two options
With prob. , follow a link at random With prob. 1-, jump to some random page Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
y a m a y m
37
2/5/2013
Power Iteration:
Set = 1 =
y a m
y y a m 0
a 0
m 0 0 0
And iterate
ry = ry /2 + ra /2 ra = ry /2 rm = ra /2
Example:
ry ra = rm 1/3 1/3 1/3 2/6 1/6 1/6 3/12 2/12 1/12 5/24 3/24 2/24 0 0 0
Iteration 0, 1, 2,
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
Teleports: Follow random teleport links with probability 1.0 from dead-ends
Adjust matrix accordingly
y
a
y y a m 0 a 0
y m
m 0 0 0
Jure Leskovec, Stanford C246: Mining Massive Datasets
a
y y a m 0 a 0
m
m
39
2/5/2013
( t 1)
Mr
(t )
Markov chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) specifying the stationary probability of being at each state x X Goal is to find such that = P
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
2/5/2013
41
T 1 A M a ( e) n
y
a m
y a m y a m 0 0 1/3 1/3 1/3
ry = ry /2 + ra /2 + rm /3 ra = ry /2+ rm /3 rm = ra /2 + rm /3
2/5/2013
42
A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links
y
a m
2/5/2013
43
From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links
y
a m
2/5/2013
44
1 + (1 )
di out-degree of node i
This formulation assumes that has no dead ends. We can either preprocess matrix to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 45
PageRank equation [Brin-Page, 98] 1 + (1 ) = The Google Matrix A: 1 = + (1 ) evector of all 1s A is stochastic, aperiodic and irreducible, so
What is ?
(+)
()
0.8+0.2
1/n11T 1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
y a = m
2/5/2013
...
Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages
We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries
1018 is a large number!
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
A = M + (1-) [1/N]NxN
0 0 0 0 1
A = 0.8
+0.2
Suppose there are N pages Consider page j, with dj out-links We have Mij = 1/|dj| when ji and Mij = 0 otherwise The random teleport is equivalent to:
Adding a teleport link from j to every other page and setting transition probability to (1-)/N Reducing the probability of following each out-link from 1/|dj| to /|dj| Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly
2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 50
= , where = + = = =
=1 =1
So we
1 + 1 + =1 =1 1 since + =1 get: = +
= 1
2/5/2013
53
= 1
()
() ()
where: =
()
while
()
(1)
>
54
0 1 2
2/5/2013
3 5 2
0 1 2
3 4 2
0 1 2 3 4 5 6
56
2/5/2013
Question:
What if we could not even fit rnew in memory?
2/5/2013
57
rnew 0 1 2 3 4 5
src
degree
destination
rold 0 1 2 3 4 5
0 1 2
4 2 2
0, 1, 3, 5 0, 5 3, 4
2/5/2013
58
Can we do better?
Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration
2/5/2013
59
rnew 0 1
src
degree
destination
0 1 2
0
4 3 2
4
0, 1 0 1
3
rold 0 1 2 3 4 5
2 3
2
0 1 2
2
4 3 2
3
5 5 4
4 5
2/5/2013
60
2/5/2013
61
2/5/2013
62