I have a dataframe that look something like this:
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]
The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()
It gives me this error:
ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])
This happens even if I do this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()
I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.
EDIT: I realized I was using the wrong plot-points. The updates code and error is below:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]
ValueError: Length of values (35106) does not match length of index (35104)
I'm not sure where the two dropped data-points are being... dropped.
EDIT2: Here is the TSNE code:
centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))
tsne_init = 'pca' # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF