23

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color.

sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences
km = KMeans(n_clusters=5, init='k-means++',n_init=10, verbose=1) 
#with 5 cluster, i want 5 different colors
km.fit(vectorized)
km.labels_ # [0,1,2,3,3,4,4,5,2,5]

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(sentence_list).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])

km.fit(X)
centers2D = pca.transform(km.cluster_centers_)
plt.hold(True)
labels=np.array([km.labels_])
print labels

My problem is in the bottom code for plt.scatter(); what should i use for the parameter c?

  1. when i use c=labels in the code, i get this error:

number in rbg sequence outside 0-1 range

2.When i set c= km.labels_ instead, i get the error:

ValueError: Color array must be two-dimensional

plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c=labels)
plt.show()
jxn
  • 7,685
  • 28
  • 90
  • 172

3 Answers3

32
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Scaling the data to normalize
model = KMeans(n_clusters=5).fit(X)

# Visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))

Now you have different color for different clusters.

Zhenye Na
  • 423
  • 1
  • 7
  • 12
16

The color= or c= property should be a matplotlib color, as mentioned in the documentation for plot.

To map a integer label to a color just do

LABEL_COLOR_MAP = {0 : 'r',
                   1 : 'k',
                   ....,
                   }

label_color = [LABEL_COLOR_MAP[l] for l in labels]
plt.scatter(x, y, c=label_color)

If you don't want to use the builtin one-character color names, you can use other color definitions. See the documentation on matplotlib colors.

Hannes Ovrén
  • 21,229
  • 9
  • 65
  • 75
  • Instead of manually typing in a color for each new cluster, how do we use **colormap** so that in case i change the cluster number in the future, i dont have to add in a new color again? – jxn Jan 30 '15 at 10:43
  • 1
    Or use the built in color maps in `mpl.colors` – tacaswell Jan 30 '15 at 15:09
  • @tcaswell That is an option. But I guess you 1) might want to have the mapping between label ID and color explicit, and 2) must know that your label IDs are not greater than the number of colors in the colormap. – Hannes Ovrén Jan 30 '15 at 15:12
  • you just need to scale them all between 0 and 1 for the continuous color maps. If you have so many labels that you stop being able to resolve the difference on the continuous color maps, you have too many labels – tacaswell Jan 30 '15 at 15:16
  • Yeah, I think you are right. Just wanted to point out potential pit falls :) – Hannes Ovrén Jan 30 '15 at 15:20
  • Thanks, @HannesOvrén for specifying the color names here. Found the other colors and color palettes on https://matplotlib.org/stable/tutorials/colors/colors.html. I required them for ease of coding – MItrajyoti Nov 25 '22 at 06:34
3

It should work:

from sklearn.cluster import KMeans;
cluster = KMeans(10);
cluster.fit(M);

cluster.labels_;

plt.scatter(M[:,0],M[:,1], c=[matplotlib.cm.spectral(float(i) /10) for i in cluster.labels_]);   
  • 1
    I like the idea, but semicolons in Python? –  Oct 14 '18 at 23:37
  • 1
    Using semicolons is a pythonic way of obtaining columns or rows. Here is getting the first column with all the rows (a column vector) and the second column respectively. – mgokhanbakal Sep 01 '20 at 19:40