5

I have a problem. I am working with k-means and would like to find the optimal cluster. Unfortunately, my data set is too large to apply silhouette . Is there an option to adapt this code and replace the silhouette with the Inertia?

MVC

from sklearn.cluster import KMeans
import numpy as np
from sklearn.metrics import silhouette_score
import matplotlib as mpl
import matplotlib.pyplot as plt

X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [10, 2], [10, 4], [10, 0],
              [1, 2], [1, 4], [1, 0],])

kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)
                for k in range(1, 10)]
inertias = [model.inertia_ for model in kmeans_per_k]

silhouette_scores = [silhouette_score(X, model.labels_)
                     for model in kmeans_per_k[1:]]


from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter

plt.figure(figsize=(11, 9))

for k in (3, 4, 5, 6):
    plt.subplot(2, 2, k - 2)
    
    y_pred = kmeans_per_k[k - 1].labels_
    silhouette_coefficients = silhouette_samples(X, y_pred)

    padding = len(X) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()

        color = mpl.cm.Spectral(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (3, 5):
        plt.ylabel("Cluster")
    
    if k in (5, 6):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient")
    else:
        plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_scores[k - 2], color="red", linestyle="--")
    plt.title("$k={}$".format(k), fontsize=16)

#save_fig("silhouette_analysis_plot")
plt.show()

What I want with Inertia enter image description here

Test
  • 571
  • 13
  • 32
  • 1
    have you tried using mini batch to help with large datasets https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html – Muhammad Pathan Jun 07 '22 at 15:15
  • @MuhammadPathan The problem is not that `kMeans` in general takes a long time but that the metric `silhouette` takes a long time to calculate. – Test Jun 08 '22 at 06:21
  • Is a different metric an option? Maybe [Gap statistics](https://towardsdatascience.com/k-means-clustering-and-the-gap-statistics-4c5d414acd29) are an option in your case. – code-lukas Jun 11 '22 at 09:03
  • Another metric would be possible, I would like to graphically illustrate with the help of the code the value of the different metric `Inertia` or another one. – Test Jun 11 '22 at 12:44
  • Well, you can visualize both the intra-cluster distance as well as the gap statistics – code-lukas Jun 11 '22 at 14:32

1 Answers1

2

First of all I suggest calculating silhouette score on a subset of data using argument sample_size and random_state (for reproducibility). This may save you some time, meanwhile calculate and plot rather comprehensive information. (how to use). But as you know there are plenty of options for measuring clustering quality along with visualization. The one you've mentioned is elbow (inertia) which can be used like this:

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
X, y = make_blobs(n_samples=100, centers=3, n_features=2,
                  random_state=0)
scores = [KMeans(n_clusters=i+2).fit(X).inertia_ 
          for i in range(10)]
sns.lineplot(np.arange(2, 12), scores)
plt.xlabel('Number of clusters')
plt.ylabel("Inertia")
plt.title("Inertia of k-Means versus number of clusters")

enter image description here

This Article introduces several useful yet easy technique to acquire clustering quality.

meti
  • 1,921
  • 1
  • 8
  • 15
  • Thank you. However, I would like to plot the `Inertia` metric instead of the `shihoulette` and this is my question. How can I modify the diagram / chart to show the `Inertia` metric. – Test Jun 13 '22 at 06:42