0

I have a code that makes Minimum Spanning Trees of many sets of points (about 25000 data sets containing 40-10000 points in each set) and this is obviously taking a while. I am using the MST algorithm from scipy.sparse.csgraph.

I have been told that the MST is a subset of the Delaunay Triangulation, so it was suggested I speed up my code by finding the DT first and finding the MST from that.

Does anyone know how much difference this would make? Also, if this makes it quicker, why is it not part of the algorithm in the first place? If it is quicker to calculate the DT and then the MST, then why would scipy.sparse.csgraph.minimum_spanning_tree do something else instead?

Please note: I am not a computer whizz, some people may say I should be using a different language but Python is the only one I know well enough to do this sort of thing, and please use simple language in your answers, no jargon please!

FJC
  • 166
  • 12

1 Answers1

1

NB: this assumes we're working in 2-d

I suspect that what you are doing now is feeding all point to point distances to the MST library. There are on the order of N^2 of these distances and the asymptotic runtime of Kruskal's algorithm on such an input is N^2 * log N.

Most algorithms for Delaunay triangulation take N log N time. Once the triangulation has been computed only the edges in the triangulation need to be considered (since an MST is always a subset of the triangulation). There are O(N) such edges so the runtime of Kruskal's algorithm in scipy.sparse.csgraph should be N log N. So this brings you to an asymptotic time complexity of N log N.

The reason that scipy.sparse.csgraph doesn't incorporate Delaunay triangulation is that the algorithm works on arbitrary input, not only Euclidean inputs.

I'm not quite sure how much this will help you in practice but that's what it looks like asymptotically.

mrmcgreg
  • 2,754
  • 1
  • 23
  • 26