0

My dataset shape is (248857, 11) This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo. enter image description here

After enter image description here

I performed K-Means with three clusters and I am trying to find a way to show these clusters. I found T-SNE as a solution but I am stuck. This is how I implemented it:

# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)

# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
 
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
 
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
 
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
 
plt.show()

enter image description here

As you see, it is not that good. How to visualize this better?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Based on the visualization, it seems your data comprises only one cluster. Could you please explain what is behind your intuition about the existing 3 clusters in your data? – OmG Jun 22 '22 at 11:12
  • Thank you for your reply. I used elbow point and silhouette score to define the number of clusters. According to the results, 3 was a good choice. And actually I know it should be 3 clusters because I just removed the labels. – linuxpanther Jun 22 '22 at 11:22
  • https://lvdmaaten.github.io/tsne/ According to the author of T-SNE, I tried to find a solution but I am also afraid that my data is not good to visualize maybe? I just thought maybe there is another way to visualize or do 3 dimensions? – linuxpanther Jun 22 '22 at 11:24

1 Answers1

1

It seems you need to tune the perplexity hyper-parameter which is:

a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.

Read more about it in this post and more specifically, here.

OmG
  • 18,337
  • 10
  • 57
  • 90
  • Great post, I understood better T-SNE. It did not solve my problem but I don't think it is related to parameters now. It shows better but as you said, it is clustered mostly in one and that is why the separation is not clear. – linuxpanther Jun 22 '22 at 15:23
  • It looks that the default perplexity is too small relative to your dataset size. You could try to apply t-SNE on, say 1000 data points, and see whether the t-SNE map can show better cluster separation. – James LI Jun 22 '22 at 22:22