0

In a C++ project, we are trying on an important Boost Graph traffic-related to launch several simulations of Dijkstra for the shortest path between two nodes.

The graph has 18 000 vertices and 40 000 edges. Loading the graph takes roughly 200ms, and a Dijkstra run 50ms.

But time start to be an issue, so we want to lower those times. We are looking at several options :

  • Use heuristics of the Dijkstra algorithm like :

    • Bi-directional Dijkstra
    • A* search
  • Pre-processing the graph like clustering operations, to reduce the number of node and vertices loaded, and thus reducing the loading time :

    • Hierarchical clustering
    • Markov cluster algorithm

So the question is in two parts :

  • What is the best/easier method to implemented a graph clustering ? (if it is using the boost library, it would be easier to implement for us). With references, examples of code that could use ?

  • What is the best Dijsktra-like algorithm to use in this kind of scenario ?

If you have any information about those two inquiries, it will be much appreciated.

Thank you.

Emmanuel Jay
  • 484
  • 3
  • 10
  • Have you seen Parallel Boost Graph Library? – sehe Jun 09 '15 at 10:13
  • I did not, but the idea here is to use less ressources to have a good aproximation. If I had several processor, I sure would take this solution... – Emmanuel Jay Jun 09 '15 at 10:56
  • I'd look at implicit graphs and e.g. connected components/aticulation points and betweenness. It's really hard to tell anything without actual problem domain knowledge (and I'd pretty soon be out of my depth on the graph theoretical side of things) – sehe Jun 09 '15 at 11:07
  • minimum spanning tree would be a good start for a heuristic shortest path algorithm. – pbible Jun 09 '15 at 11:28
  • @sehe Thank you, I already looked into that, but I will again. It is not very simple to understand, so after a bit of research, I called for help... I struggle every day with the BGL. I am still not familiar with the concepts. – Emmanuel Jay Jun 09 '15 at 14:07
  • @pbible Thank you for your proposition, but minimum spanning tree seems to be on undirected graph, and will close a lot of edges without reducing the number of vertices. – Emmanuel Jay Jun 09 '15 at 14:10
  • @ravenspoint Yes, I am. I am working currently with a 20 000 vertices and 40 000 edges graph. And just one simulation of Dijkstra takes me 100ms, and a lot of CPU. I would like to reach a computation in 20ms. (operate on a smart reduce graph, or with a smarter algorithm) – Emmanuel Jay Jun 09 '15 at 14:15
  • 100ms for 20000 vertices and 40,000 edges seems about right. This code takes 128ms including setup https://gist.github.com/JamesBremner/fd7e253b3d42c61a3c8d – ravenspoint Jun 09 '15 at 14:31
  • 1
    100ms is a fairly short time, as these things go. Presumably you need to do this hundreds or thousands of times, so this is becoming a problem. Can you provide some insight into what you are doing? In particular, is the graph constant, or different for every run of dijkstra? Perhaps the weights are different but the topology the same? – ravenspoint Jun 09 '15 at 14:36
  • 1
    @EmmanuelJay You should update your questions with more specifics including your estimated graph size, time constraints, and performance goals. A more specific question is more likely to get up-voted. – pbible Jun 09 '15 at 14:58
  • @ravenspoint indeed you are right, the problem comes from the number of simulations... – Emmanuel Jay Jun 15 '15 at 11:47
  • @pbible I tried to update it the best that I could, tell me if it is still not enough – Emmanuel Jay Jun 15 '15 at 11:48
  • @EmmanuelJay thanks that is a lot better. Is your graph changes often or not? I'll try to address some of your questions with an answer. – pbible Jun 16 '15 at 12:38
  • @pbible No, it does not :) Thank you :) – Emmanuel Jay Jun 16 '15 at 12:41

1 Answers1

4

Clustering

There are many ways to cluster nodes in a graph. Any distance metric for node representations can be used for clustering. Boost doesn't have out of the box clustering support other than in a few limited cases such as betweenness clustering.

The Micans package has a very simple and fast program for markov clustering. This can be easily called from the command line. Markov clustering (MCL) works by repeated matrix multiplication to simulate random walks. It might not be hard to implement if you change your graph to a matrix representation.

If your a graph is not changing then clustering could be performed offline. R is a much better environment for clustering because many methods are already implemented but you must provide distance values.

Do you need clustering?

Will clustering help your shortest path calculations? My intuition (often wrong) leads me to believe that clustering would result in only local optima. An optimal solution on a subset of your graph does not guarantee that it would be useful for the global solution. The traffic problem makes me think of flow algorithms such as boykov_kolmogorov_max_flow.

Other suggestions

I'm not sure of your set up. If you are running 1 simulation per process then the load time could be an issue. You could trying running more than one simulation per process so your load time isn't repeated as often.

For example, currently you have 200ms for load and 50ms for Dijkstra. If you run two back to back you have 500ms for two runs.

You could load the graph in 200ms, then run 6 simulations on the same process back to back. That way you have 200ms for load, then 50ms x 6 simulations. This gives you a 3x increase in number of simulations you can run even without parallel processing.

pbible
  • 1,259
  • 1
  • 18
  • 34
  • Thank you very much ! The problem of loading comes because we load a specific file for each simulation (depends on day and hour), so 200ms is only for one file, and we have a lot of them. For now we run some thousand of simulation back to back like you say, but the CPU needed is to much for our poor server, thus the clustering ! ... ;) Thank you for your response ! It is highly appreciated ! – Emmanuel Jay Jun 16 '15 at 13:26
  • @EmmanuelJay in that case, you could consider using the files to create your edge weight maps. Instead of recreating the graph each time, use each file to initialize a weight map rather than a new graph. It may save you some time vs creating new graphs each time. – pbible Jun 16 '15 at 14:40