-1

I've developed a graph clustering model for assigning destinations to vehicle routes, but my implementation is very slow. It takes about two days to process a graph with 400k nodes.

My current implementation in Python is as follows:

Input data is a sparse graph:
       Edges are roads
       Nodes are vehicle destinations and road intersections

Create Minimum Spanning Tree using Prims Algorithm

For every edge in the MST:
      Perform depth-first-search on the the two subgraphs on each side of the edge:
              Sum up road lengths for each edge
      If total road length for one of the subgraphs is within a defined range, then remove the edge

Any recommendations to make this implementation faster? Could using Networkx or Neo4J speed this up?

  • 1
    Suspect there might be a fast implementation based on dynamic trees, but would be surprised if it were built in to a convenient library. – David Eisenstat Mar 23 '23 at 20:49
  • I have added some performance test results to my answer. MST performance depends on many things, not just the node count. I have included varying edge : vertex ratios. How do your results compare? – ravenspoint Mar 24 '23 at 16:24
  • @ravenspoint The MST shouldn't ever be the bottleneck here if I understand the algorithm correctly. – David Eisenstat Mar 24 '23 at 19:29
  • @DavidEisenstat The question has too little details to know for sure. In general DFS goes very fast, however the MST runs once but a DFS on half the graph runs twice for every edge! The does not change the answer, however, using a C++ library implementation of DFS will be many times faster than hand coded python. It would be helpful to know what the metric is, along with many other details missing from the question. – ravenspoint Mar 24 '23 at 19:42
  • Sorry about the lack of information initially, it's been a while since I've posted in stack overflow. I've added some more detail to my original question if that helps! – fume_hood_geologist Mar 25 '23 at 18:57

1 Answers1

1

Could using Networkx or Neo4J speed this up?

Yes. These libraries are written in C++ which is many times faster than python (the usual quote is approx 50 times faster )

Personally, I would recommend moving to C++ entirely. Python is fine for toy applications, but large graphs need the performance of a compiled language.

Here is the c++ code I use to find minimum spanning trees

https://github.com/JamesBremner/PathFinder/blob/50b89a0ff57e13cb34b0348b073d698a22ede406/src/GraphTheory.cpp#L180-L251

Here are some timing tests on randomly generated graphs

Vertex Count Run time
seconds
1 edge/vertex
Run time
seconds
3 edges/vertex
1,000 0.1 0.1
10,000 4 17
100,000 410 1900
ravenspoint
  • 19,093
  • 6
  • 57
  • 103