What are the performance metrics for Clustering Algorithms?

Question

I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?

I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). — desertnaut, Jun 03 '21 at 08:07

hafiz031 · Accepted Answer · 2021-06-03T19:39:45.253

1

For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.

kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better

Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.

# Call score with data X
kmeans.score(X) # greater is better

This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.

You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb

The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

edited Jun 03 '21 at 19:39

answered Jun 03 '21 at 07:45

hafiz031

2,236
3
26
48

Thanks Hafiz. How to connect with you? – Komali Jun 03 '21 at 10:27
Scaling should be done first or PCA should be applied first? – Komali Jun 03 '21 at 10:28
1

Scaling should be done first, see here: https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#standardizing – hafiz031 Jun 03 '21 at 10:47
How do I know which clustering algorithm works well? for suppose if I use 3 clustering algorithms how do I know which performs better? – Komali Jun 03 '21 at 11:22
Theoretically speaking, it depends on your data -- If your clusters are almost same sized and spherical then try 'K-Means', if cluster blobs are ellipsoidal in shape then use Gaussian Mixture Model, if you think the cluster can have random shapes but has a continuous region of high density then use 'DBSCAN'. – hafiz031 Jun 03 '21 at 11:54
For these types of prototyping the library `PyCaret` is super helpful. It requires less coding which is helpful for faster prototyping on various algorithms. For clustering see: https://pycaret.readthedocs.io/en/latest/tutorials.html#clustering Also you will find very good tutorials on it from `YouTube`. – hafiz031 Jun 03 '21 at 12:01

What are the performance metrics for Clustering Algorithms?

1 Answers1