My dataset has copd documents as positive data(86) and malaria(20) + diarreha(20) + elephantiasis(20) as negative data.So total documents in my dataset is 146 where 86 as positive and 60 as negative.I have taken ratio of training: testing is 3:1.ngram-range is (1,1).And also I removed all numeric features from the features list.I am taking tfidf of features as input.I am using naive bayes algorithm for training and testing.Accuracy= 89%, Precision= 84%, Recall = 100%. Now I am taking new documents for testing outside of my dataset. Where 20 documents are positive(copd) and 20 documents are negative ( which are not in our dataset i.e. disease which is not in our dataset) Now it is predicting almost all documents as positive Or we can say the accuracy get decrease with large value. My question is what is wrong I am doing here? Why my classifier is not working well for new documents? Any type of help will be appreciated.
Asked
Active
Viewed 194 times
1 Answers
0
You are clearly overfitting on your training set. You must use regularization to make your model generalize well on the new data as well.
You can either go for L2 norm or Dropout technique for preventing overfitting.

Akshay Bahadur
- 497
- 4
- 11
-
I am new for this field, I do not have much idea about these overfitting techniques. But I am using only transform() function in the calculation of term frequency and tfidf. According to my knowledge if I will use fit_transform() with testing data then it will cause overfitting, am I right? I am very thankful for your concern @Akshay_Bahadur . If regularization is only solution then can you please suggest any link for this? – Tanushree Tanu Apr 25 '18 at 18:06
-
I would recommend you to search on the web for that. To give you a gist, it adds a noise to your loss function so that you are not too close to the minima. – Akshay Bahadur Apr 25 '18 at 18:30