21

I run a clustering algorithm and want to evaluate the result by using silhouette score in scikit-learn. But in the scikit-learn, it needs to calculate the distance matrix: distances = pairwise_distances(X, metric=metric, **kwds)

Due to the fact that my data is order of 300K, and my memory is 2GB, and the result is out of memory. And I can not evaluate the clustering result.

Does anyone know how to overcome this problem?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Thien Bao
  • 211
  • 2
  • 3

1 Answers1

27

Set the sample_size parameter in the call to silhouette_score to some value smaller than 300K. Using this parameter will sample datapoints from X and calculate the silhouette_score on those instead of the entire array.

mwv
  • 4,221
  • 2
  • 19
  • 9
  • 1
    thank you for your reply. I think it would be a good solution. I will try many iterations and then take the mean of the score. – Thien Bao May 08 '13 at 12:04
  • 1
    This works for silhouette_score but not silhouette_samples has no such sample_size parameter – Keith Dec 07 '17 at 19:02