4

I am attempting a binary classification task on a dataset of around 20000 samples and 40 features. I have manually curated the dataset - each feature is a topic and the value for that feature and that sample is the sentiment associated with the topic. The topics are found with Latent Dirichlet Allocation. In an attempt to visualize the separability and classification potential of the data I have used Scikit-Learn's tSNE implementation and plot both 2D and 3D plots of the resulting transformed data.

The two classes nearly completely overlap in both the 2D and 3D plots. However when I am trying different classifiers on the original (before t-SNE) data I am able to get 5-fold cross-validation scores between 75-80%, therefore I assume the model is not overfitting the data.

I have attempted to play with the perplexity, using values from 5 to 200, and it has not changed the level of overlap in the plot.

My t-SNE code:

tsne = TSNE(n_components=2).fit_transform(data, data_target)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(*zip(*tsne),  c=data_target)
plt.show()

Is it possible that t-SNE can make classifier performance worse? Or does it indicate that there is something wrong with my data or my implementation if my classifiers work fine but the t-SNE plot indicates a total overlap? Should I just stick to classifying my data without any dimensionality reduction?

Thanks!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Floating
  • 49
  • 2
  • 3
    If you are asking if t-SNE is losing information when mapping `N>2` dimensions to `2` dimensions for plotting, then sure: that's basic math (try to number points on a circle (2d); now map those to a number-line (1d) but keeping the euclidean distances!). This is an ill-conditioned problem and there will always be problems. It's therefore a heuristic, worken quite good in general (to "somewhat" map the structure). – sascha Apr 06 '19 at 14:02
  • @sascha thanks for the response. I'd read that a large or full overlap in the t-SNE plot indicates low classification potential, and I was wondering if this indicated any potential problems with my dataset that I'd have to look into, hence the question. What I'd read made it seem like it is uncommon to get a result like mine. – Floating Apr 06 '19 at 14:13
  • Try reducing features from 40 to 10-20 first using PCA, then apply tSNE, let me know if it changes anything. Also try [UMAP](https://github.com/lmcinnes/umap) – Shihab Shahriar Khan Apr 06 '19 at 15:19

0 Answers0