5

What is the difference between using the stratify argument in train_test_split function of sklearn, and the StratifiedShuffleSplit function? Don't they do the same thing?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Rohan Pinto
  • 51
  • 1
  • 5

1 Answers1

4

These two modules perform different operations.

train_test_split, as its name clearly implies, is used for splitting the data in a single training & single test subset, and the stratify argument permits doing this in a stratified way.

StratifiedShuffleSplit, on the other hand, provides splits for cross-validation; from the docs:

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

Notice the plural sets (emphasis mine).

So, StratifiedShuffleSplit is there to be used instead of KFold when we want to ensure the CV splits are stratified, and not to replace train_test_split.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • If you set `n_splits=1` on the `StratifiedShuffleSplit` then I would expect same split as `train_test_split` with `stratify` if same `random_state` but I am getting different results. Any idea why? – LazyEval Oct 01 '22 at 18:08
  • @LazyEval No; please open a new question on this. – desertnaut Oct 01 '22 at 18:31