Efficient k-means evaluation with silhouette score in sklearn

Question

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:

metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)

I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?

I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.

I'm also open to alternative, more scalable evaluation metrics.

Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters enter image description here

What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

Define unintuitive results, and try rerunning that test multiple times with different sample sizes. — Fred Foo, May 15 '14 at 20:14
Running code to generate a clarifying plot. Will edit and post asap. — moustachio, May 15 '14 at 20:25
Those silhouette scores are pretty low. Data with strong cluster structure will give you silhouette scores above 0.7 or so. Have you tried using the Gap Statistic to estimate the proper number of clusters? Another possibility is that some of the 100 features are adding noise and are hiding clusters. You might try PCA to get rid of some of the noise. — Daniel Watkins, Aug 28 '14 at 19:12
I've also encountered similar problem. When I increased the number of cluster, the silhouette score computed by `sklearn.metrics.silhouette_score` decreased monotonically, and I don't figure out why this happened — LittleLittleQ, Jan 25 '15 at 13:19
@AnnabellChan did you ever figure out what was going with sklearn.metrics.silhouette_score? I have the same problem of monotonically decreasing values with larger k. — asado23, Feb 10 '15 at 22:19
@asado23 not yet, but I read a paper discussing the main internal validation measures, see [Understanding the Internal Clustering Validation Measures](https://web.njit.edu/~yl473/papers/ICDM10CLU.pdf) and replaced `silhouette score` with `SDbw`, which was demonstrated to be the most robust index in this paper — LittleLittleQ, Feb 11 '15 at 15:01
All things being equal, the silhouette score will decrease if you increase the number of clusters, or increase the number of features used as anchors for the model. Another thing to keep in mind is, just like correlation scores, from a real life application standpoint, suggesting that 0.7 and above are the best scores, is not realistic. — BlackHat, Sep 13 '18 at 21:05

score 6 · Answer 1 · answered May 02 '17 at 02:46

Other metrics

Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).

How much to sample

It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

score 1 · Answer 2 · answered May 30 '17 at 14:34

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.

Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.

Reason for the weird observations could be different starting points for different sized samples.

Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

score 0 · Answer 3 · answered Feb 12 '20 at 03:32

Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.

In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.

One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.

Efficient k-means evaluation with silhouette score in sklearn

3 Answers3