0

I'm on a classification issue with: 2,500 lines. 25000 columns 88 different classes unevenly distributed

And then something very strange happened:

When I run a dozen different split test trains, I always get scores around 60%...

And when I run cross validations, I always get scores around 50%. Here the screen : enter image description here Moreover it has nothing to do with the unequal distribution of classes because when I put a stratify=y on the TTS I stay around 60% and when I put a StratifiedKFold I stay around 50%.

Which score to remember? Why the difference? For me a CV was just a succession of test train splits with different splits from each other, so nothing justifies such a difference in score.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Arnaud Hureaux
  • 121
  • 1
  • 10
  • Please **re-read** [How to ask](https://stackoverflow.com/help/how-to-ask), as it would seem that you missed some crucial points the first time you read it, namely "***DO NOT post images of code, data, error messages, etc.** - copy or type the text into the question*" (emphasis in the original). See why [an image of your code is not helpful](http://idownvotedbecau.se/imageofcode). – desertnaut Jul 10 '20 at 16:01

2 Answers2

1

Short answer: Add shuffle=True to your KFold : cross_val_score(forest,X,y,cv=KFold(shuffle=True))

Long answer: the difference between a succession of TrainTestSplit and a cross-validation with a classic KFold is that there is a mix in the TTS before the split between the train and the test set. The difference in score may be due to the fact that your dataset is sorted in a biased way. So just add shuffle=True to your KFold (or your StratifiedKFold and that's all you need to do).

Chaussette
  • 56
  • 5
0

First sample of code return 5 values, then you learn on 4/5 of the data and valid on 1/5. Second sample learn on 5/6 and valid on 1/6, and the results are better.

Maybe you can try to do for k in range(5) to see if you still have a big diffrence in score ?

Maug
  • 35
  • 8