What is the difference between using the stratify
argument in train_test_split
function of sklearn, and the StratifiedShuffleSplit
function? Don't they do the same thing?
Asked
Active
Viewed 3,284 times
5

desertnaut
- 57,590
- 26
- 140
- 166

Rohan Pinto
- 51
- 1
- 5
1 Answers
4
These two modules perform different operations.
train_test_split
, as its name clearly implies, is used for splitting the data in a single training & single test subset, and the stratify
argument permits doing this in a stratified way.
StratifiedShuffleSplit
, on the other hand, provides splits for cross-validation; from the docs:
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data in train/test sets.
Notice the plural sets (emphasis mine).
So, StratifiedShuffleSplit
is there to be used instead of KFold
when we want to ensure the CV splits are stratified, and not to replace train_test_split
.

desertnaut
- 57,590
- 26
- 140
- 166
-
If you set `n_splits=1` on the `StratifiedShuffleSplit` then I would expect same split as `train_test_split` with `stratify` if same `random_state` but I am getting different results. Any idea why? – LazyEval Oct 01 '22 at 18:08
-
@LazyEval No; please open a new question on this. – desertnaut Oct 01 '22 at 18:31