0

Problem Statement - Classify a product review

classes - Travel,Hotel,Cars,Electronics,Food,Movies

I am approaching this problem with the famous Text Classification problem. Feature set is prepared by using Doc2Vec default model from gensim and for classification I am using Logistic Regression oneVSrest from sklearn.

For every class I feed 10000 reviews to Doc2Vec.( I am following this Doc2Vec tutorial). In this way the model learns vector for each sentence. From the resulting vectors, 80% from each class are given to LogisticRegression for training and 20% for testing. The accuracy of classifier is 98%. But for unseen data the accuracy is just 17%. Also PCA of all sentence vectors when plotted in a 2D graph resulted in one dense cluster. What I can conclude from the graph is that the data is inseparable but then how the classifier gave an accuracy of 98%? Also, why on unseen data the accuracy is very low? How can I evaluate/validate my results.

Rashmi Singh
  • 519
  • 1
  • 8
  • 20
  • 3
    Accuracy of 17% with 6 classes implies that your model has no predictive power and outputs purely random results. I suspect it must be some sort of program logic error and not directly related to ML. – Dennis Sakva Jan 11 '17 at 15:16
  • While I suspect @DennisSakva is right, strong accuracy on your training set but worthless results on a held-out test set can also be a symptom of *overfitting*: your process is essentially memorizing idiosycracies about your data to make its predictions, not learning generalizable patterns. If that's the case after fixing any other problems, using *smaller* models – lower dimensional features – may help. – gojomo Jan 19 '17 at 00:36

0 Answers0