2

I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.

(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )

shape of input: (4957, 26)

KMeans:    0.00491824944814
MeanShift:     2.56759268443
AffinityPropagation:     4.04678163528
SpectralClustering:     4.1573699673
DBSCAN:     4.16347868443
Gaussian:     4.16394021908
AgglomerativeClustering:     5.52318491936
Birch:     5.52657626867

I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)). So I was wondering

  • How can I determine how many data points each of these algorithms can handle; and are the number of input files / input features equally relevant in this equation?
  • How much does the computation intensity depend on the clustering settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
  • Does clustering success influence computation time? Some algorithms such as DBSCAN finish very quickly - mabe because they don't find any clustering in the data; Meanshift does not find clusters either and still takes forever. (I'm using the default settings here). Might that change drastically once they discover structure in the data?
  • How much is raw computing power a limiting factor for these kind of algorithms? Will I be able to cluster ~ 300,000 files with ~ 30 features each on a regular desktop computer? Or does it make sense to use a computer cluster for these kind of things?

Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.

patrick
  • 4,455
  • 6
  • 44
  • 61

1 Answers1

1

This is a too broad question.

In fact, most of these questions are unanswered.

For example k-means is not simply linear O(n), but because the number of iterations needed until convergence tends to grow with data set size, it's more expensive than that (if run until convergence).

Hierarchical clustering can be anywhere from O(n log n) to O(n^3) mostly depending on the way it is implemented and on the linkage. If I recall correctly, the sklearn implementation is the O(n^3) algorithm.

Some algorithms have parameters to stop early. Before they are actually finished! For k-means, you should use tol=0 if you want to really finish the algorithm. Otherwise, it stops early if the relative improvement is less than this factor - which can be much too early. MiniBatchKMeans does never convergence. Because it only looks at random parts of the data every time, it would just go on forever unless you choose a fixed amount of iterations.

Never try to draw conclusions from small data sets. You need to go to your limits. I.e. what is the largest data set you can still process within say, 1, and 2, and 4, and 12 hours, with each algorithm? To get meaningful results, your runtimes should be hours, except if the algorithms simply run out of memory before that - then you might be interested in predicting how far you could scale until your run out of memory - assuming you had 1 TB of RAM, how large would the data be that you can still process?

The problem is, you can't simply use the same parameters for data sets of different size. If you do not chose the parameters well (e.g. DBSCAN puts everything into noise, or everything into one cluster) then you cannot draw conclusions from that either.

And then, there might simply be an implementation error. DBSCAN in sklearn has become a lot lot lot faster recently. It's still the same algorithm. So most results done 2 years ago were simply wrong, because the implementation of DBSCAN in sklearn was bad... now it is much better, but is it optimal? Probably not. And similar problems might be in any of these algorithms!

Thus, doing a good benchmark of clustering is really difficult. In fact, I have not seen a good benchmark in a looong time.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks, that is very helpful! If I understand you correctly, going on with trial & error is then the best I can do. And regarding the "good benchmark" you mentioned, where might one find something like that? Thanks! – patrick Apr 28 '16 at 21:06
  • 1
    First of all, you should be concerned with getting out useful results. Most likely only one (or none) produces a useful result with carefully chosen parameters. Then if you are super lucky, the same parameters work for multiple files... – Has QUIT--Anony-Mousse Apr 28 '16 at 21:39
  • Alright sounds good / encouraging. – patrick Apr 28 '16 at 21:49