How to find most optimal number of clusters with K-Means clustering in Python

Question

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.

I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:

  UserID         M1     M2       M3  ...............  M200                          
  user1          1      0                               0     
  user2          0      1        1                                      
  user3          1      1                               1                                                                         
    .
    .
    .
    .
 user100         1      0        1

The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.

I will appreciate some help..

Usually you do this using a Bayesian Information Criteria – user3684792 Feb 01 '21 at 10:46 — user3684792, Feb 01 '21 at 10:46
@user3684792, Can you please provide an example how ? – ToBeEXP Feb 01 '21 at 10:48 — ToBeEXP, Feb 01 '21 at 10:48

score 6 · Accepted Answer · answered Feb 01 '21 at 16:28

Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.

Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques). Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.

Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.

seed_random = 1

fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
    
    #Perform clustering.
    kmeans = KMeans(n_clusters=n_clusters,
                    random_state=seed_random,
                    )
    labels_clusters = kmeans.fit_predict(X)
    
    #Insert fitted model and calculated cluster labels in dictionaries,
    #for further reference.
    fitted_kmeans[n_clusters] = kmeans
    labels_kmeans[n_clusters] = labels_clusters
    
    #Calculate various scores, and save them for further reference.
    silhouette = silhouette_score(X, labels_clusters)
    ch = calinski_harabasz_score(X, labels_clusters)
    db = davies_bouldin_score(X, labels_clusters)
    tmp_scores = {"n_clusters": n_clusters,
                  "silhouette_score": silhouette,
                  "calinski_harabasz_score": ch,
                  "davies_bouldin_score": db,
                  }
    df_scores.append(tmp_scores)

#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)

This code assumes that all your numerical features are in a DataFrame X. All clustering performance metrics are stored in df_scores DataFrame. You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().

Thanks a lot @Enrico_Gandini. I will check it with the example i provided above in the question. Since i am clustering users based on similar ratings so i assume that there the numerical features are the values of ratings given to movies as 0 and 1 and a null for no ratings. — ToBeEXP, Feb 01 '21 at 18:05
Yes @ToBeEXP, in your scenario, the numerical features are columns containing 0 and 1. In general, clustering algorithms need data to be complete, so you cannot have null values. You may decide to fill your null values with -1, and remember that -1 correspond to "answer not given for this movie by a certain user". My answer was, in general, on how to evaluate clustering algorithms and find the best number of clusters. Considering your specific dataset, I am not even sure that KMeans is a good idea! Maybe there are more specific algorithms. — Enrico Gandini, Feb 01 '21 at 19:46
Thanks Enrico. Why do you think that in my case K-means is not a good idea ? If you have time to commit on this last point of yours. — ToBeEXP, Feb 01 '21 at 20:37
@ToBeEXP, I think K-means, and other clustering algorithms, are meant to be used on continuous numerical features. K-means in particular defines clusters by calculating an Euclidean distance between the points, and I do not think that Euclidan distance is meaningful on your kind of data. And I am not sure that the basic assumptions of K-means still hold if you change the distance metric (you mentioned you wanted to use cosine distance). In my opinion, you should try looking into other kinds of algorithms, such as Association Rules or Sequential Patterns, but I am not an expert on those! — Enrico Gandini, Feb 02 '21 at 00:00
Thank you Enrico. Well i am still a beginner and its not exactly my area of research but i have to do it. The information you provided is very helpful and i will look into the other possibilities. Actually as you mentioned that there can't be nulls, actually i already replaced nulls. — ToBeEXP, Feb 02 '21 at 10:14
import numpy as np ; from sklearn.cluster import KMeans ; from sklearn.metrics import silhouette_score ; from sklearn.metrics import calinski_harabasz_score ; from sklearn.metrics import davies_bouldin_score ; from sklearn.datasets import load_iris ; import pandas as pd ; iris = load_iris() ; X = iris.data[:, :2] ; — Tom J, Oct 08 '22 at 07:24

Morten Jensen · Answer 2 · 2021-02-01T20:54:46.103

It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.

Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).

The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.

BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.

From wikipedia:

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

Thanks a lot @Morten for the details. I will look into the approaches you mentioned. Though i am novice to clustering, i will need some practical examples for a better explanation that i will try to search. — ToBeEXP, Feb 02 '21 at 10:10

score 0 · Answer 3 · answered Feb 01 '21 at 10:49

0

You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.

answered Feb 01 '21 at 10:49

pfrodedelaforet

78
2

I mentioned in my question that i am new to clustering and i have no idea what you just mentioned. Can you please give an example may be for the dataset example i mentioned above ? Thanks – ToBeEXP Feb 01 '21 at 10:52

score 0 · Answer 4 · answered Feb 01 '21 at 10:57

0

You could use the elbow method.

The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !

answered Feb 01 '21 at 10:57

Sahil_Angra

131
7

can you please provide an example ? – ToBeEXP Feb 01 '21 at 11:02

How to find most optimal number of clusters with K-Means clustering in Python

4 Answers4

Linked

Related