I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:
An imaginary dataset:
id, count, size
1, 4, 8
2, 5, 9
3, 6, 0
say I would divide it into two separate sets like this (keeping 'id' in both):
id, count | id, size
1, 4 | 1, 8
2, 5 | 2, 9
3, 6 | 3, 0
And split them both with train_test_split()
with the same random_state
of 0
. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)
I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.
When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.
It would be great if train_test_split()
would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.