Problem Statement - Classify a product review
classes - Travel,Hotel,Cars,Electronics,Food,Movies
I am approaching this problem with the famous Text Classification
problem. Feature set is prepared by using Doc2Vec
default model from gensim
and for classification I am using Logistic Regression
oneVSrest from sklearn
.
For every class I feed 10000 reviews to Doc2Vec
.( I am following this Doc2Vec tutorial). In this way the model learns vector for each sentence. From the resulting vectors, 80% from each class are given to LogisticRegression
for training and 20% for testing. The accuracy of classifier is 98%. But for unseen data the accuracy is just 17%. Also PCA
of all sentence vectors when plotted in a 2D graph resulted in one dense cluster. What I can conclude from the graph is that the data is inseparable but then how the classifier gave an accuracy of 98%? Also, why on unseen data the accuracy is very low? How can I evaluate/validate my results.