1

I have a dataframe that look something like this:

transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]

The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()

It gives me this error:

ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])

This happens even if I do this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()

I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.

EDIT: I realized I was using the wrong plot-points. The updates code and error is below:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]

ValueError: Length of values (35106) does not match length of index (35104)

I'm not sure where the two dropped data-points are being... dropped.

EDIT2: Here is the TSNE code:

centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
              early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()

I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF

  • 1
    But what is `transformed_centroids[-true_k:, 0]` supposed to be? Based on the error, `true_k` is 2, so `transformed_centroids[-true_k:, 0]` is only an array of length 2, and you're trying to put it into a column of length 35104. – tdy Mar 23 '22 at 03:26
  • I just updated the question, I was accidently trying to color the centroids and not the data points. The new code transformed_centroids[:, 0/1] are the plot points. The other code are the centroids. The code above it is posted above (as of now) – Savanah Marisa Barnes Mar 23 '22 at 03:27
  • 1
    _"I'm not sure where the two dropped data points are being dropped."_ It's not that 2 points got dropped. It's that `everything` is the concatenation of your data + 2 centroids, so the transformed values have 2 extra values compared to your labels. I'm not too familiar with this type of analysis, so I'm not sure why they concatenated those 2 extra values in the code you found. – tdy Mar 23 '22 at 03:42
  • 1
    Oh my gosh... I was working with more clusters than 2 and changed it to 2 so it would run faster so I never put 'two and two' together (see what I did there). The concatenation was a specific solution I looked up so that I could graph both the clusters and the data points at the same time. Assuming they are added to the end, I should be able to append two dummy values to the end and just plot the clusters last to cover it up... I'll go try that :) (and I'll change a tag!) – Savanah Marisa Barnes Mar 23 '22 at 03:45

1 Answers1

1

With help from @tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:-2, 0]
df["comp-2"] = transformed_centroids[:-2, 1]

of course, this is the same as this for my 2-cluster code:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:true_k, 0]
df["comp-2"] = transformed_centroids[:true_k, 1]

where true_k is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[] slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)

tdy
  • 36,675
  • 19
  • 86
  • 83