Big difference in score (10%) between a split_test_train and a cross validation

Question

I'm on a classification issue with: 2,500 lines. 25000 columns 88 different classes unevenly distributed

And then something very strange happened:

When I run a dozen different split test trains, I always get scores around 60%...

And when I run cross validations, I always get scores around 50%. Here the screen : Moreover it has nothing to do with the unequal distribution of classes because when I put a stratify=y on the TTS I stay around 60% and when I put a StratifiedKFold I stay around 50%.

Which score to remember? Why the difference? For me a CV was just a succession of test train splits with different splits from each other, so nothing justifies such a difference in score.

Please **re-read** [How to ask](https://stackoverflow.com/help/how-to-ask), as it would seem that you missed some crucial points the first time you read it, namely "***DO NOT post images of code, data, error messages, etc.** - copy or type the text into the question*" (emphasis in the original). See why [an image of your code is not helpful](http://idownvotedbecau.se/imageofcode). — desertnaut, Jul 10 '20 at 16:01

Chaussette · Accepted Answer · 2020-07-19T19:14:46.073

Short answer: Add shuffle=True to your KFold : cross_val_score(forest,X,y,cv=KFold(shuffle=True))

Long answer: the difference between a succession of TrainTestSplit and a cross-validation with a classic KFold is that there is a mix in the TTS before the split between the train and the test set. The difference in score may be due to the fact that your dataset is sorted in a biased way. So just add shuffle=True to your KFold (or your StratifiedKFold and that's all you need to do).

score 0 · Answer 2 · answered Jul 11 '20 at 15:19

First sample of code return 5 values, then you learn on 4/5 of the data and valid on 1/5. Second sample learn on 5/6 and valid on 1/6, and the results are better.

Maybe you can try to do for k in range(5) to see if you still have a big diffrence in score ?

Big difference in score (10%) between a split_test_train and a cross validation

2 Answers2