Differnce between train_test_split and StratifiedShuffleSplit

Question

I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit.

When stratify is not None train_test_split uses StratifiedShuffleSplit internally,

I was just wondering why the StratifiedShuffleSplit from sklearn.model_selection is used when we can use the stratify argument available in train_test_split.

When quoting, please always include a link to the source; such a statement cannot be found in the current version of the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). — desertnaut, Mar 23 '21 at 10:23

score 2 · Accepted Answer · answered Mar 23 '21 at 10:29

Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit, train_test_split just calls that class. For the same reason, when stratify=False, it uses the model_selection.ShuffleSplit class (see source code).

Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.

Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV. The StratifiedShuffleSplit does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split that yields (train, test) splits as array of indices. More info here (see parameter cv).

Thank you so much. This answer helped a lot – adiaux Mar 24 '21 at 08:04 — adiaux, Mar 24 '21 at 08:04

Differnce between train_test_split and StratifiedShuffleSplit

1 Answers1