10

I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing.

I've been trying the following clustering implementations with no luck:

  • Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset)
  • scipy.cluster.hierarchy.fclusterdata (runs too long)
  • scipy.cluster.vq.kmeans (runs out of memory)
  • sklearn.cluster.hierarchical.Ward (runs too long)

Are there any other implementations which I might miss?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
tisch
  • 1,098
  • 3
  • 13
  • 30

5 Answers5

13

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.

Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.

DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
7

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward and supports parallel fitting on multicore machines. MiniBatchKMeans is better still, but won't do random restarts for you.

>>> from sklearn.cluster import MiniBatchKMeans
>>> X = np.random.randn(50000, 7)
>>> %timeit MiniBatchKMeans(30).fit(X)
1 loops, best of 3: 114 ms per loop
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks for the hint. KMeans and especially MinBatchKMeans run a lot faster than Ward. However I still get an awful few number of clusters for my dataset. I would expect clusters of very different number of samples. A few very large ones (1-5) and a lot of very small ones (70-200). However the algorithm gives only 2-25 nonempty clusters. Is there a way to force the algorithm to generate the desired number (30-300) of non-empty clusters? – tisch Jun 20 '12 at 10:02
  • what about 3M data points with ~100 as dim in 10000+ clusters that makes sklearn suffer any python suggestion ? – Wajih Apr 15 '14 at 22:22
4

My package milk handles this problem easily:

import milk
import numpy as np
data = np.random.rand(50000,7)
%timeit milk.kmeans(data, 300)
1 loops, best of 3: 14.3 s per loop

I wonder whether you meant to write 500,000 data points, because 50k points is not that much. If so, milk takes a while longer (~700 sec), but still handles it well as it does not allocate any memory other than your data and the centroids.

luispedro
  • 6,934
  • 4
  • 35
  • 45
  • how do i do feature selection and normalization before using the kmeans from the `milk` package? – alvas Mar 15 '14 at 16:36
1

The real answer for actually large scale situations is to use something like FAISS, Facebook Research's library for efficient similarity search and clustering of dense vectors.

See https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

Jules G.M.
  • 3,624
  • 1
  • 21
  • 35
0

OpenCV has a k-means implementation, Kmeans2

Expected running time is on the order of O(n**4) - for an order-of-magnitude approximation, see how long it takes to cluster 1000 points, then multiply that by seven million (50**4 rounded up).

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99