I recently made an implementation of the PageRank algorithm used by google which displays the most relevant web pages. There are two ways to approach the PageRank algorithm, with random internet surfers who browse the internet (walkers) -stochastic algorithm- or with the probability of landing on a specific page -distribution algorithm.
I first tried using dictionaries to store the nodes as keys and the list of their outgoing edges as their values. However, this approach requires a lot of time to run the algorithms - especially the stochastic one. For example, on my machine the distribution algorithm requires roughly 1.5s (30 steps) while the stochastic one requires roughly 50s (30 steps, 100,000 repeats).
Next, I tried using the library NetworkX to optimize the performance of the algorithms, but instead of running faster, it runs slower. Using the same parameters, distribution requires around 6s and stochastic requires about 280s. I was wondering why this is the case and if there is any other way to optimize the code, either by using networkx differently or using a different library of which I am not aware.
In both approaches, I read a file in which each line is a node and a target separated by space.
Here is the code:
Creating the dictionary
def load_graph(args):
"""Load graph from text file
Parameters:
args -- arguments named tuple
Returns:
A dict mapping a URL (str) to a list of target URLs (str).
"""
dictionary = {}
nodes = []
targets = []
# Iterate through the file line by line
for line in args.datafile:
# And split each line into two URLs
node, target = line.split()
nodes.append(node)
if node not in dictionary.keys():
targets = []
targets.append(target)
dictionary[node] = targets
return dictionary
Networkx approach:
def load_graph(args):
"""Load graph from text file
Parameters:
args -- arguments named tuple
Returns:
A graph object mapping all URLs (str) to their target URLs (str).
"""
# Use networkx to load the graph
graph = nx.read_edgelist(args.datafile, create_using=nx.DiGraph())
return graph