How do I color clusters after k-means and TSNE in either seaborn or matplotlib?

Question

I have a dataframe that look something like this:

transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]

The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()

It gives me this error:

ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])

This happens even if I do this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()

I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.

EDIT: I realized I was using the wrong plot-points. The updates code and error is below:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]

ValueError: Length of values (35106) does not match length of index (35104)

I'm not sure where the two dropped data-points are being... dropped.

EDIT2: Here is the TSNE code:

centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
              early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()

I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF

But what is `transformed_centroids[-true_k:, 0]` supposed to be? Based on the error, `true_k` is 2, so `transformed_centroids[-true_k:, 0]` is only an array of length 2, and you're trying to put it into a column of length 35104. — tdy, Mar 23 '22 at 03:26
I just updated the question, I was accidently trying to color the centroids and not the data points. The new code transformed_centroids[:, 0/1] are the plot points. The other code are the centroids. The code above it is posted above (as of now) — Savanah Marisa Barnes, Mar 23 '22 at 03:27
_"I'm not sure where the two dropped data points are being dropped."_ It's not that 2 points got dropped. It's that `everything` is the concatenation of your data + 2 centroids, so the transformed values have 2 extra values compared to your labels. I'm not too familiar with this type of analysis, so I'm not sure why they concatenated those 2 extra values in the code you found. — tdy, Mar 23 '22 at 03:42
Oh my gosh... I was working with more clusters than 2 and changed it to 2 so it would run faster so I never put 'two and two' together (see what I did there). The concatenation was a specific solution I looked up so that I could graph both the clusters and the data points at the same time. Assuming they are added to the end, I should be able to append two dummy values to the end and just plot the clusters last to cover it up... I'll go try that :) (and I'll change a tag!) — Savanah Marisa Barnes, Mar 23 '22 at 03:45

score 1 · Accepted Answer · edited Mar 23 '22 at 03:59

With help from @tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:-2, 0]
df["comp-2"] = transformed_centroids[:-2, 1]

of course, this is the same as this for my 2-cluster code:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:true_k, 0]
df["comp-2"] = transformed_centroids[:true_k, 1]

where true_k is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[] slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)

How do I color clusters after k-means and TSNE in either seaborn or matplotlib?

1 Answers1