0

I recently made an implementation of the PageRank algorithm used by google which displays the most relevant web pages. There are two ways to approach the PageRank algorithm, with random internet surfers who browse the internet (walkers) -stochastic algorithm- or with the probability of landing on a specific page -distribution algorithm.

I first tried using dictionaries to store the nodes as keys and the list of their outgoing edges as their values. However, this approach requires a lot of time to run the algorithms - especially the stochastic one. For example, on my machine the distribution algorithm requires roughly 1.5s (30 steps) while the stochastic one requires roughly 50s (30 steps, 100,000 repeats).

Next, I tried using the library NetworkX to optimize the performance of the algorithms, but instead of running faster, it runs slower. Using the same parameters, distribution requires around 6s and stochastic requires about 280s. I was wondering why this is the case and if there is any other way to optimize the code, either by using networkx differently or using a different library of which I am not aware.

In both approaches, I read a file in which each line is a node and a target separated by space.

Here is the code:

Creating the dictionary

def load_graph(args):
    """Load graph from text file

    Parameters:
    args -- arguments named tuple

    Returns:
    A dict mapping a URL (str) to a list of target URLs (str).
    """
    dictionary = {}
    nodes = []
    targets = []
    # Iterate through the file line by line
    for line in args.datafile:
        # And split each line into two URLs
        node, target = line.split()
        nodes.append(node)
        if node not in dictionary.keys():
            targets = []
        targets.append(target)
        dictionary[node] = targets
    return dictionary

Networkx approach:

def load_graph(args):
    """Load graph from text file

    Parameters:
    args -- arguments named tuple

    Returns:
    A graph object mapping all URLs (str) to their target URLs (str).
    """
    # Use networkx to load the graph
    graph = nx.read_edgelist(args.datafile, create_using=nx.DiGraph())
    return graph

0 Answers0