I am attempting a binary classification task on a dataset of around 20000 samples and 40 features. I have manually curated the dataset - each feature is a topic and the value for that feature and that sample is the sentiment associated with the topic. The topics are found with Latent Dirichlet Allocation. In an attempt to visualize the separability and classification potential of the data I have used Scikit-Learn's tSNE implementation and plot both 2D and 3D plots of the resulting transformed data.
The two classes nearly completely overlap in both the 2D and 3D plots. However when I am trying different classifiers on the original (before t-SNE) data I am able to get 5-fold cross-validation scores between 75-80%, therefore I assume the model is not overfitting the data.
I have attempted to play with the perplexity, using values from 5 to 200, and it has not changed the level of overlap in the plot.
My t-SNE code:
tsne = TSNE(n_components=2).fit_transform(data, data_target)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(*zip(*tsne), c=data_target)
plt.show()
Is it possible that t-SNE can make classifier performance worse? Or does it indicate that there is something wrong with my data or my implementation if my classifiers work fine but the t-SNE plot indicates a total overlap? Should I just stick to classifying my data without any dimensionality reduction?
Thanks!