Accuracy increases using cross-validation and decreases without

Question

I have a question regarding cross validation: I'm using a Naive Bayes classifier to classify blog posts by author. When I validate my dataset without k-fold cross validation I get an accuracy score of 0.6, but when I do k-fold cross validation, each fold renders a much higher accuracy (greater than 0.8).

For Example:

(splitting manually): Validation Set Size: 1452,Training Set Size: 13063, Accuracy: 0.6033057851239669

and then

(with k-fold): Fold 0 -> Training Set Size: 13063, Validation Set Size: 1452 Accuracy: 0.8039702233250621 (all folds are over 0.8)

etc...

Why does this happen?

please do not cross-post your question on multiple SE sites... http://stats.stackexchange.com/questions/138449/accuracy-increases-using-cross-validation-and-decreases-without — cel, Feb 20 '15 at 07:13

score 1 · Accepted Answer · answered Feb 20 '15 at 09:48

There are a few reasons this could happen:

Your "manual" split is not random, and you happen to select more outliers that are hard to predict. How are you doing this split?
What is the k in k-fold CV? I'm not sure what you mean by Validation Set Size, you have a fold size in k-fold CV. There is no validation set, you run the cross validation using your entire data. Are you sure you're running k-fold cross validation correctly?

Usually, one picks k = 10 for k-fold cross validation. If you run it correctly using your entire data, you should rely on its results instead of other results.

k = 10, it was found out that the splitting was not being performed well and the training and testing sets were not being perfectly disjointed. Now the accuracy is consistent. Thank you — mesllo, Feb 20 '15 at 12:05

Accuracy increases using cross-validation and decreases without

1 Answers1