How to determine which `x` argument to use for K-means and scatter plots?

Question

I'm trying to implement and visualize a K-means algorithm code in Python. I have a dataset I created using make_blobs, then I fit the data with K-means and visualize the results using matplotlib.pyplot.scatter.

Here's my code:

Importing and data creation step

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

n_samples = 3000
random_state = 1182

X, y = make_blobs(n_samples=n_samples, random_state=random_state)
# X.shape = (3000, 2)
# y.shape = (3000,) -> y's values range from 0 to 2.

Scatter plot of original data

plt.scatter(X[:, 0], X[:, 1])
plt.title("Original Dataset Scatter Plot")
plt.xlabel("X[:, 0]")
plt.ylabel("X[:, 1]")
plt.show()

K-Means training and visualization

kmeans_model = KMeans(n_clusters=3, random_state=1)
kmeans_model.fit(X)

colors = { 0: 'r',
           1: 'b',
           2: 'g'}

label_color = [colors[l] for l in y]
plt.scatter(X[:, 0], kmeans_model.labels_, c=label_color)
plt.title("K-Means Scatter Plot")
plt.xlabel("X[:, 0]")
plt.ylabel("Labels")
plt.show()

My question is: when I use plt.scatter with X[:, 1] instead of X[:, 0], as I did in the given code, I get a different plot albeit with the same clusters as such:

Would this still be considered a correct implementation and usage of K-means and scatter plots? If so, is there a particular reason that one should choose certain x values over others?

score 2 · Answer 1 · answered Dec 16 '18 at 17:08

2

That's a very strange way of visualising clustering. If you want to see how well your model did, you just have to plot all the blobs as you did in the first diagram and then supply a colouring sequence label_color.

plt.scatter(X[:,0], X[:,1], c=label_color)

Your question of using either X[:,0] or X[:,1] is not correctly set. Both of this dimensions represent the data and your diagrams would be correct in some way, but they would not be interpretable.

answered Dec 16 '18 at 17:08

tidylobster

683
5
13

Hello, thanks for the answer. Is the plot that you provided the correct K-means plot for my code then? It's actually what I had originally thought of, but I thought I needed to include the labels of K-means in the plot. – Sean Dec 16 '18 at 17:40
1

Yes, that's the plot for your code. I've used your `label_color` for ground truth labels (although, this could be replaced with `y` for simplicity - `plt.scatter(X[:,0], X[:,1], c=y)`). You can use `plt.scatter(X[:,0], X[:,1], c=kmeans_model.labels_)` to plot labels, predicted by your model. – tidylobster Dec 16 '18 at 17:47

Dinari · Accepted Answer · 2018-12-16T17:55:58.947

Your K-means take in account both X[:,0] and X[:,1]. The clustering is done on 2 dimensions.
The correct way to present K-Means would be to display both dimensions, and use coloring (as you did).

Regarding your question - The reason for the difference is that you use for 1 graph the first dimensions, thus you display the points according to there [:,0] coordinate, and in the second you display according to [:,1].

The correct way would be to use both coordinates, use coloring, and if possible - adding the cluster centroids is always nice:

Altering your code:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

n_samples = 3000
random_state = 1182

X, y = make_blobs(n_samples=n_samples, random_state=random_state)

kmeans_model = KMeans(n_clusters=3, random_state=1)
kmeans_model.fit(X)

colors = { 0: 'r',
           1: 'b',
           2: 'g'}

label_color = [colors[l] for l in y]
plt.scatter(X[:, 0],X[:,1], c=label_color, s=10)
plt.scatter(kmeans_model.cluster_centers_[:,0],kmeans_model.cluster_centers_[:,1],s=300,marker='+',c='y')
plt.title("K-Means Scatter Plot")
plt.xlabel("X[:, 0]")
plt.ylabel("Labels")
plt.show()

Will produce:

Note that i added a line for the cluster centroids.

Hi, thanks for the answer. I had a couple of questions if it's okay. 1) Did you make this plot using the data that I gave in my question? 2) I actually was thinking it would be good to include both dimensions of `X`, but I'm not sure how to do that. Is there a way to include both when making a scatter plot? Thank you. — Sean, Dec 16 '18 at 17:39
That some k-means image i had laying around from earlier use. `plt.scatter(X[:,0],X[:,1],c=label_color,s=30)` will work for you, where `s=30` is the size of each sample. Will update ans with the full code. — Dinari, Dec 16 '18 at 17:46
Thank you for the tip and edit. I was also wondering how to plot cluster centroids as well. :) — Sean, Dec 16 '18 at 17:58

How to determine which `x` argument to use for K-means and scatter plots?

2 Answers2