-1

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.

I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.


What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?

Is my logic correct ?

sten
  • 7,028
  • 9
  • 41
  • 63

2 Answers2

1

That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.

If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • i see thats also an option and it will save time&mem if i stop early. I'm looking into all 3 basic algos ... the one that it is easier to parallelize ! – sten May 18 '21 at 04:24
0

A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.

Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.

Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.


Edit based on feedback in coments:

If your main bottleneck is the similarity matrix then the following should be considered:

  1. Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering

  2. Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

ldog
  • 11,707
  • 10
  • 54
  • 70
  • 1
    Cutting MST edges is equivalent to single-linkage clustering. – j_random_hacker May 17 '21 at 23:33
  • i have to implement the clustering myself .. the stock ones in scipy, etc .. cant handle millions of datapoints .. primary because they use similarity matrices which would require too much RAM ... thats why opted to build MST first, the algorithms seem more feasible – sten May 18 '21 at 04:18
  • I still stand by my comment that results will be likely garbage with single-linkage clustering unless you have a very special type of data. It never hurts to try (other than possibly wasting time & effort.) – ldog May 18 '21 at 19:12