I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.
(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )
shape of input: (4957, 26)
KMeans: 0.00491824944814
MeanShift: 2.56759268443
AffinityPropagation: 4.04678163528
SpectralClustering: 4.1573699673
DBSCAN: 4.16347868443
Gaussian: 4.16394021908
AgglomerativeClustering: 5.52318491936
Birch: 5.52657626867
I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)). So I was wondering
- How can I determine how many data points each of these algorithms can handle; and are the number of input files / input features equally relevant in this equation?
- How much does the computation intensity depend on the clustering settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
- Does clustering success influence computation time? Some algorithms such as DBSCAN finish very quickly - mabe because they don't find any clustering in the data; Meanshift does not find clusters either and still takes forever. (I'm using the default settings here). Might that change drastically once they discover structure in the data?
- How much is raw computing power a limiting factor for these kind of algorithms? Will I be able to cluster ~ 300,000 files with ~ 30 features each on a regular desktop computer? Or does it make sense to use a computer cluster for these kind of things?
Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.