-1

I have a question regarding cross validation: I'm using a Naive Bayes classifier to classify blog posts by author. When I validate my dataset without k-fold cross validation I get an accuracy score of 0.6, but when I do k-fold cross validation, each fold renders a much higher accuracy (greater than 0.8).

For Example:

(splitting manually): Validation Set Size: 1452,Training Set Size: 13063, Accuracy: 0.6033057851239669

and then

(with k-fold): Fold 0 -> Training Set Size: 13063, Validation Set Size: 1452 Accuracy: 0.8039702233250621 (all folds are over 0.8)

etc...

Why does this happen?

IVlad
  • 43,099
  • 13
  • 111
  • 179
mesllo
  • 545
  • 7
  • 29
  • please do not cross-post your question on multiple SE sites... http://stats.stackexchange.com/questions/138449/accuracy-increases-using-cross-validation-and-decreases-without – cel Feb 20 '15 at 07:13

1 Answers1

1

There are a few reasons this could happen:

  1. Your "manual" split is not random, and you happen to select more outliers that are hard to predict. How are you doing this split?

  2. What is the k in k-fold CV? I'm not sure what you mean by Validation Set Size, you have a fold size in k-fold CV. There is no validation set, you run the cross validation using your entire data. Are you sure you're running k-fold cross validation correctly?

Usually, one picks k = 10 for k-fold cross validation. If you run it correctly using your entire data, you should rely on its results instead of other results.

IVlad
  • 43,099
  • 13
  • 111
  • 179
  • k = 10, it was found out that the splitting was not being performed well and the training and testing sets were not being perfectly disjointed. Now the accuracy is consistent. Thank you – mesllo Feb 20 '15 at 12:05