2

I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit.

When stratify is not None train_test_split uses StratifiedShuffleSplit internally,

I was just wondering why the StratifiedShuffleSplit from sklearn.model_selection is used when we can use the stratify argument available in train_test_split.

adiaux
  • 103
  • 8
  • When quoting, please always include a link to the source; such a statement cannot be found in the current version of the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). – desertnaut Mar 23 '21 at 10:23

1 Answers1

2

Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit, train_test_split just calls that class. For the same reason, when stratify=False, it uses the model_selection.ShuffleSplit class (see source code).

Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.

Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV. The StratifiedShuffleSplit does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split that yields (train, test) splits as array of indices. More info here (see parameter cv).

s.dallapalma
  • 1,225
  • 1
  • 12
  • 35