kmeans scatter plot: plot different colors per cluster

Question

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color.

sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences
km = KMeans(n_clusters=5, init='k-means++',n_init=10, verbose=1) 
#with 5 cluster, i want 5 different colors
km.fit(vectorized)
km.labels_ # [0,1,2,3,3,4,4,5,2,5]

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(sentence_list).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])

km.fit(X)
centers2D = pca.transform(km.cluster_centers_)
plt.hold(True)
labels=np.array([km.labels_])
print labels

My problem is in the bottom code for plt.scatter(); what should i use for the parameter c?

when i use c=labels in the code, i get this error:

number in rbg sequence outside 0-1 range

2.When i set c= km.labels_ instead, i get the error:

ValueError: Color array must be two-dimensional

plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c=labels)
plt.show()

score 32 · Answer 1 · answered Nov 24 '17 at 13:26

32

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Scaling the data to normalize
model = KMeans(n_clusters=5).fit(X)

# Visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))

Now you have different color for different clusters.

answered Nov 24 '17 at 13:26

Zhenye Na

423
1
7
12

9

Please explain your answer. – Mazz Nov 24 '17 at 13:32
Beauty! Weirdly, I can't getthe pandas shorthand to work for this case (i.e. using data.plot(...) throws a 'c=' too many color element error. – rocksteady Feb 12 '19 at 21:38
Bang.Great Trick.Thanks for that. – Muhammad Younus Aug 06 '19 at 11:20
Does this guarantee unique colour per cluster no matter how big `n_cluster` is? – gokul_uf Jul 01 '20 at 23:36
1

Where does data come from in your answer? Did you mean X? Even if it is X, how do you know X's shape is (n, 2)? – mgokhanbakal Sep 01 '20 at 11:42
Works ok - for me it sets the colours as along a greyscale between white and black. Meaning one cluster appears as white and isn't visible. It can also be hard to distinguish between different types of grey. – c_m_conlan Jul 14 '21 at 10:37

score 16 · Accepted Answer · answered Jan 30 '15 at 09:01

16

The color= or c= property should be a matplotlib color, as mentioned in the documentation for plot.

To map a integer label to a color just do

LABEL_COLOR_MAP = {0 : 'r',
                   1 : 'k',
                   ....,
                   }

label_color = [LABEL_COLOR_MAP[l] for l in labels]
plt.scatter(x, y, c=label_color)

If you don't want to use the builtin one-character color names, you can use other color definitions. See the documentation on matplotlib colors.

answered Jan 30 '15 at 09:01

Hannes Ovrén

21,229
9
65
75

Instead of manually typing in a color for each new cluster, how do we use **colormap** so that in case i change the cluster number in the future, i dont have to add in a new color again? – jxn Jan 30 '15 at 10:43
1

Or use the built in color maps in `mpl.colors` – tacaswell Jan 30 '15 at 15:09
@tcaswell That is an option. But I guess you 1) might want to have the mapping between label ID and color explicit, and 2) must know that your label IDs are not greater than the number of colors in the colormap. – Hannes Ovrén Jan 30 '15 at 15:12
you just need to scale them all between 0 and 1 for the continuous color maps. If you have so many labels that you stop being able to resolve the difference on the continuous color maps, you have too many labels – tacaswell Jan 30 '15 at 15:16
Yeah, I think you are right. Just wanted to point out potential pit falls :) – Hannes Ovrén Jan 30 '15 at 15:20
Thanks, @HannesOvrén for specifying the color names here. Found the other colors and color palettes on https://matplotlib.org/stable/tutorials/colors/colors.html. I required them for ease of coding – MItrajyoti Nov 25 '22 at 06:34

score 3 · Answer 3 · answered Mar 15 '17 at 02:58

3

It should work:

from sklearn.cluster import KMeans;
cluster = KMeans(10);
cluster.fit(M);

cluster.labels_;

plt.scatter(M[:,0],M[:,1], c=[matplotlib.cm.spectral(float(i) /10) for i in cluster.labels_]);

answered Mar 15 '17 at 02:58

user3805442

67
4

1

I like the idea, but semicolons in Python? – Oct 14 '18 at 23:37
1

Using semicolons is a pythonic way of obtaining columns or rows. Here is getting the first column with all the rows (a column vector) and the second column respectively. – mgokhanbakal Sep 01 '20 at 19:40

kmeans scatter plot: plot different colors per cluster

3 Answers3