0

cv accuracy cv accuracy graph test accuracy

I am trying to implement Naive bayes on fine food reviews dataset of amazon. Can you review the code and tell why there is such a big difference between cross validation accuracy and test accuracy?

Conceptually is there anything wrong with the below code?

#BOW()

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range = (2,3))
bow_vect = bow.fit(X_train["F_review"].values)
bow_sparse = bow_vect.transform(X_train["F_review"].values)
X_bow = bow_sparse
y_bow = y_train



roc = []
accuracy = []
f1 = []
k_value = []
for i in range(1,50,2):
  BNB =BernoulliNB(alpha =i)

  print("************* for alpha = ",i,"*************")
  x = (cross_validate(BNB, X_bow,y_bow, scoring = ['accuracy','f1','roc_auc'], return_train_score = False, cv = 10))
  print(x["test_roc_auc"].mean())
  print("-----c------break------c-------break-------c-----------")
  roc.append(x['test_roc_auc'].mean())#This is the ROC metric
  accuracy.append(x['test_accuracy'].mean())#This is the accuracy metric
  f1.append(x['test_f1'].mean())#This is the F1 score

  k_value.append(i)


#BOW Test prediction
BNB =BernoulliNB(alpha= 1)
BNB.fit(X_bow, y_bow)
y_pred = BNB.predict(bow_vect.transform(X_test["F_review"]))
print("Accuracy Score: ",accuracy_score(y_test,y_pred))
print("ROC: ", roc_auc_score(y_test,y_pred))
print("Confusion Matrix: ", confusion_matrix(y_test,y_pred))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hardik Bapna
  • 71
  • 1
  • 2
  • 7
  • for hyperparameter tuning: [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) or [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Also change the range of alpha, somthing like alpha_set = [1e-3, 1e-2,1e-1, 1e,1e2, 1e3, 1e4 ] – Kalsi Sep 11 '18 at 01:34

1 Answers1

0

Use one of the metric to find the optimal alpha value. Then train BernoulliNB on test data.

And don't consider Accuracy for performance measurement as it is prone to imbalanced dataset.

Before doing anything, please change values given in loop as mentioned by Kalsi in the comment.

  • Have alpha values as said above in a list
  • find maximum AUC value and its index.
  • Use the above index to find optimal alpha.
User1312
  • 43
  • 6