T-SNE for better data visualization

Question

My dataset shape is (248857, 11) This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.

After

I performed K-Means with three clusters and I am trying to find a way to show these clusters. I found T-SNE as a solution but I am stuck. This is how I implemented it:

# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)

# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
 
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
 
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
 
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
 
plt.show()

As you see, it is not that good. How to visualize this better?

Based on the visualization, it seems your data comprises only one cluster. Could you please explain what is behind your intuition about the existing 3 clusters in your data? — OmG, Jun 22 '22 at 11:12
Thank you for your reply. I used elbow point and silhouette score to define the number of clusters. According to the results, 3 was a good choice. And actually I know it should be 3 clusters because I just removed the labels. — linuxpanther, Jun 22 '22 at 11:22
https://lvdmaaten.github.io/tsne/ According to the author of T-SNE, I tried to find a solution but I am also afraid that my data is not good to visualize maybe? I just thought maybe there is another way to visualize or do 3 dimensions? — linuxpanther, Jun 22 '22 at 11:24

OmG · Accepted Answer · 2022-06-23T07:39:48.890

1

It seems you need to tune the perplexity hyper-parameter which is:

a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.

Read more about it in this post and more specifically, here.

edited Jun 23 '22 at 07:39

answered Jun 22 '22 at 13:49

OmG

18,337
10
57
90

Great post, I understood better T-SNE. It did not solve my problem but I don't think it is related to parameters now. It shows better but as you said, it is clustered mostly in one and that is why the separation is not clear. – linuxpanther Jun 22 '22 at 15:23
It looks that the default perplexity is too small relative to your dataset size. You could try to apply t-SNE on, say 1000 data points, and see whether the t-SNE map can show better cluster separation. – James LI Jun 22 '22 at 22:22

T-SNE for better data visualization

1 Answers1