19

I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

I am unable to understand the reason for this. Here is the code, that I am using to cluster and calculate silhouette score.

I read the csv that contains the text to be clustered and run K-Means on the n cluster values. What could be the reason I am getting this error?

  #Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score

model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)

#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []

def split_words(text):
  return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()

def preprocess_document(text):
  sp_words = split_words(text)
  return sp_words

for i, t in enumerate(overview):
  vectors.append(loaded_model.infer_vector(preprocess_document(t)))

sse = {}
silhouette = {}


for k in range(1,15):
  km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
  sse[k] = km.inertia_
  #FOLLOWING LINE CAUSES ERROR
  silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')

best_cluster_size = 1
min_error = float("inf")

for cluster_size in sse:
    if sse[cluster_size] < min_error:
        min_error = sse[cluster_size]
        best_cluster_size = cluster_size

print(sse)
print("====")
print(silhouette)
seralouk
  • 30,938
  • 9
  • 118
  • 133
Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328

4 Answers4

38

The error is produced because you have a loop for different number of clusters n. During the first iteration, n_clusters is 1 and this leads to all(km.labels_ == 0)to be True.

In other words, you have only one cluster with label 0 (thus, np.unique(km.labels_) prints array([0], dtype=int32)).


silhouette_score requires more than 1 cluster labels. This causes the error. The error message is clear.


Example:

from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X,y)

# check how many unique labels do you have
np.unique(km.labels_)
#array([0, 1, 2], dtype=int32)

We have 3 different clusters/cluster labels.

silhouette_score(X, km.labels_, metric='euclidean')
0.38788915189699597

The function works fine.


Now, let's cause the error:

km2 = KMeans(n_clusters=1)
km2.fit(X,y)

silhouette_score(X, km2.labels_, metric='euclidean')
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
seralouk
  • 30,938
  • 9
  • 118
  • 133
4

From the documentation,

Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1

So one way to solve this problem is instead of using for k in range(1,15), try to start iteration from k = 2, which is for k in range(2,15). That works for me.

Don
  • 3,876
  • 10
  • 47
  • 76
Yuan
  • 41
  • 3
1

Try changing min_samples and also algorithm and metric.

for valid list of metrics and algoritms use. sklearn.neighbors.VALID_METRICS

ram4189
  • 21
  • 4
  • Please consider further explaining why this could solve the problem as well as providing links to referenced external documentation. – vlizana Aug 01 '20 at 23:07
  • Apologies. min_samples suggestion is for DBSCAN. I also got same error as above for DBSCAN but fixed that. Coming to error- for k in range(1,15) - for first iteration k=1, we have len(set(kmeans.label_) i.e. only 1 cluster. silhoute coefficient is about how close points inside a cluster are separated with respect to points from other cluster. Basic defination of silhoute coefficient requires therefore at least 2 clusters meaning you should choose cluster range between (2,15) rather than (1,15). – ram4189 Aug 03 '20 at 17:01
  • Please go through https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html . See the usage range_n_clusters = [2, 3, 4, 5, 6] – ram4189 Aug 03 '20 at 17:02
  • I believe that `silhouette_score` uses simple random sampling underneath, which can effectively lead to only one cluster label within the sample. Imagine sampling from two clusters of data - a huge one and a minor one. – Martin Fridrich Sep 08 '20 at 09:49
0

Try to increase your eps value. I was also getting the same error but when I choose the higher eps value, the error is gone.