0

So, I have this code to choose the best number of cluster using the silhouette method:

def kmeans_silhouette(data) -> Tuple[np.array, np.array]:
    """
    Performs silhouette method to choose the best result for kMeans clustering.

    :param data: data to be clustered.
    :return:
    """
    import os

    logger.info(len(data))
    if len(data) == 1:
        return [0], data

    range_n_clusters = [2, 3, 4, 5, 6]
    labels = None
    centroids = None
    silhouette = -999

    for n_clusters in range_n_clusters:
        kmeans = KMeans(n_clusters=n_clusters, random_state=0)
        cluster_labels = kmeans.fit_predict(data)

        silhouette_avg = silhouette_score(data, cluster_labels)
        if silhouette_avg > silhouette:
            silhouette = silhouette_avg
            labels = cluster_labels
            centroids = kmeans.cluster_centers_

        if os.environ["DEBUG"]:
            logger.info(
                f"For n_clusters = {n_clusters}, the average silhouette_score is {silhouette_avg}"
            )

    return labels, centroids

However, sometimes this error pops up:

  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 228, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
    "to n_samples - 1 (inclusive)" % n_labels)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

This should happen when one cluster is defined, cause the silhouette method requires at least to clusters (ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score).

So, I checked the number of unique cluster_labels, and its retrieving just one:

logger.info(np.unique(kmeans.labels_))

INFO [0]

But it is specified that the minimum number of clusters I want is 2. I wonder if it makes any sense for kmeans to have a parameter specifying the number of clusters and yet retrieve less clusters than expected.

pceccon
  • 9,379
  • 26
  • 82
  • 158

0 Answers0