For reference, I'm using this page. I understand the original pagerank equation
but I'm failing to understand why the sparse-matrix implementation is correct. Below is their code reproduced:
def compute_PageRank(G, beta=0.85, epsilon=10**-4):
'''
Efficient computation of the PageRank values using a sparse adjacency
matrix and the iterative power method.
Parameters
----------
G : boolean adjacency matrix. np.bool8
If the element j,i is True, means that there is a link from i to j.
beta: 1-teleportation probability.
epsilon: stop condition. Minimum allowed amount of change in the PageRanks
between iterations.
Returns
-------
output : tuple
PageRank array normalized top one.
Number of iterations.
'''
#Test adjacency matrix is OK
n,_ = G.shape
assert(G.shape==(n,n))
#Constants Speed-UP
deg_out_beta = G.sum(axis=0).T/beta #vector
#Initialize
ranks = np.ones((n,1))/n #vector
time = 0
flag = True
while flag:
time +=1
with np.errstate(divide='ignore'): # Ignore division by 0 on ranks/deg_out_beta
new_ranks = G.dot((ranks/deg_out_beta)) #vector
#Leaked PageRank
new_ranks += (1-new_ranks.sum())/n
#Stop condition
if np.linalg.norm(ranks-new_ranks,ord=1)<=epsilon:
flag = False
ranks = new_ranks
return(ranks, time)
To start, I'm trying to trace the code and understand how it relates to the PageRank equation. For the line under the with statement (new_ranks = G.dot((ranks/deg_out_beta))
), this looks like the first part of the equation (the beta times M) BUT it seems to be ignoring all divide by zeros. I'm confused by this because the PageRank algorithm requires us to replace zero columns with ones (except along the diagonal). I'm not sure how this is accounted for here.
The next line new_ranks += (1-new_ranks.sum())/n
is what I presume to be the second part of the equation. I can understand what this does, but I can't see how this translates to the original equation. I would've thought we would do something like new_ranks += (1-beta)*ranks.sum()/n
.