2

From the documentation of sklearn KMeans

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)

and SciPy kmeans

scipy.cluster.vq.kmeans(obs, k_or_guess, iter=20, thresh=1e-05, check_finite=True)

it is clear the number of parameters differ and perhaps more of them are available for sklearn.

Have any of you tried one versus the other and would you have a preference for using one of them in a classification problem?

pepe
  • 9,799
  • 25
  • 110
  • 188
  • 1
    Without trying it, i would always prefer sklearn. Better documentation (including user-guides) and much more tools you would likely use too, like Cross-Validation/Gridsearch. But that's just my opinion. – sascha May 13 '16 at 14:29
  • 1
    The scipy implementation gives you the option to set your own centroids, which can be nice. Also note that for most applications, you'll be wanting to use [kmeans2](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.vq.kmeans2.html), not the one you quote. Besides that, I can't say. – patrick May 13 '16 at 16:06

1 Answers1

5

Benchmark.

And you will never touch the scipy one again.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • it seems difficult to compare one to another -- the params for SciPy do not match perfectly those for sklearn: for example, the default number of initializations for sklearn is n=10, while in SciPy it isn't explicit. Using 100 centroids for both and other params as default, SciPy is faster but that doesn't mean better. – pepe May 15 '16 at 01:53
  • Disable all the extras. `n_init=1`, `tol=thresh=0`, `max_iter=iter=100000` (you want the final result, not an interim result). Use a *large* data set. – Has QUIT--Anony-Mousse May 15 '16 at 07:22
  • Scipy has less overhead. A major advantage when running on small datasets. – Michael Mezher Jan 23 '23 at 19:52