Raising performance of BFS in Python

Question

How can I increase speed performance of below Python code?

My code works okay which means no errors but the performance of this code is very slow.

The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)

Problem definition:

Check if the distance between two nodes are less than max_distance

My constraints:

I have to import a .txt file of which format is like sample_input
Expected ouput is like sample_output
Totall code runtime should be less than 5 secs.

Can anyone give me an advice to improve my code much better? Follow my code:

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                self.graph[key] = self.graph.get(key, []) + [val]

    def check_distance(self, x, y, max_distance):          
        dist = self.path(x, y, max_distance)
        if dist:
            return dist - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

Thank you for your help in advance.

Use a [deque](https://docs.python.org/3/library/collections.html#collections.deque) instead of popping at index 0 from a list. — kaya3, May 23 '20 at 12:32
you can execute graph = self.file_to_graph() this line only once. and store the graph to instance variable. by calling it every distance local method path its taking time. — Akash Pagar, May 23 '20 at 12:37
@AkashPagar Thank you for your comment. I have to solve the distance between two nodes are the same with or less than the max_distance. and the input data is Facebook Large Page-Page Network dataset (snap.stanford.edu/data/facebook-large-page-page-network.html) — boralim, May 23 '20 at 12:37
@boralim. Do you have to compare all nodes? Ex: every node with every node minus itself or only the edges size? Or. For a specific node, compare with all others? You can use a diverse of techniques to solve this — William Prigol Lopes, May 23 '20 at 12:47
1) You could pass `max_distance` to function `path` and not expand a path once it's length equals `max_distance`, 2) Is it possible to use the heap module or is that disallowed? If allowed you prioritize to expand shorter paths first in the search, 3) You should not call a local variable path inside your function path since its confusing. 4) file_to_graph should be done within` __init__` (so executed once) rather each time `check_distance is called unless you expect the graph file to change. — DarrylG, May 23 '20 at 12:48
@WilliamPrigolLopes I have to compare only two nodes given by two parameters of def check_distance. — boralim, May 23 '20 at 13:55
@boralim and those two nodes are directly connected (ex: you have only one edge between those two nodes) or could your nodes are connected by a path (ex: you have edges to connect between those nodes?) If are directly connected, you can only run a simple comparison, otherwise, I'll recommend to use the Dijkstra's Algorithm or Tarjan's algorithm. They're not hard to implement since they are variations of BFS/DFS. — William Prigol Lopes, May 23 '20 at 13:59
@WilliamPrigolLopes Thank you for your help. There can be no or many edges between those two nodes, and I must use BFS algorithm in my case..! — boralim, May 23 '20 at 14:02
@DarrylG Thanks to your comment, I edited my code like above. Could you give me a little more advice to make it much better..? — boralim, May 24 '20 at 02:57

MindOfMetalAndWheels · Accepted Answer · 2020-05-23T18:06:07.797

Several pointers:

if you call check distance more than once you have to recreate the graph
calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
as DarrylG points out you can exit from the BFS early once a path has exceed the max distance

you could try

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = self.file_to_graph()

    def file_to_graph(self):
        graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                graph[key] = graph.get(key, []) + [val]
        return graph

    def check_distance(self, x, y, max_distance):          
        path_length = self.path(x, y, max_distance)
        if path_length:
            return len(path) - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                # we have explored all paths shorter than the max distance
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

As to why pop(0) is inefficient - from the docs:

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

Thank you for your kind comment, but in my case, I cannot import any other modules such as collections / deque. By the way, I will study your code to use it later. Thank you again for your time! — boralim, May 23 '20 at 12:45
collections is in the standard library so unless you have been specifically told not to use it or you have a strange python installation you should be able to import it without installing any packages. — MindOfMetalAndWheels, May 23 '20 at 12:52
Thank you for your kindness, by the way I tried your code and get an error message saying "path() missing 1 required positional argument: 'max_distance'" — boralim, May 23 '20 at 13:20
I share some more errors..! (1) path() missing 1 required positional argument: 'max_distance' (2) name 'graph' is not defined (3) 'str' object has no attribute 'copy' — boralim, May 23 '20 at 14:17
In here (return len(path) - 1 <= max_distance), if you change len(path) to path_length, it perfectly works. Thank you for your help sincerely. — boralim, May 24 '20 at 01:51

William Prigol Lopes · Answer 2 · 2020-05-23T14:28:29.530

About the approach:

You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V| at every time or, you need to calculate every time the distances again and again. This is not a good approach.

What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step

About the algorithms

I recommend you to look for different approaches of DFS/BFS.

If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.

Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)

As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.

If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.

The next step is collect from memory the distances that you can put in an array.

You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.

Raising performance of BFS in Python

2 Answers2

About the approach:

About the algorithms