0

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:

An imaginary dataset:

id, count, size
1, 4, 8
2, 5, 9
3, 6, 0

say I would divide it into two separate sets like this (keeping 'id' in both):

id, count      |       id, size
1, 4           |       1, 8
2, 5           |       2, 9
3, 6           |       3, 0

And split them both with train_test_split() with the same random_state of 0. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)

I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.

When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.

It would be great if train_test_split() would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.

NG.
  • 459
  • 1
  • 6
  • 20

1 Answers1

3

They should have the same resulting indices if you use the same random_state parameter in each call.

However--you could also just reverse your order of operations. Call test/train split on the parent dataset, then create two sub-sets from both the test and train sets that result.

Example:

print(df)
   id  count  size
0   1      4     8
1   2      5     9
2   3      6     0

from sklearn.model_selection import train_test_split
dfa = df[['id', 'count']].copy()
dfb = df[['id', 'size']].copy()
rstate = 123
traina, testa = train_test_split(dfa, random_state=123)
trainb, testb = train_test_split(dfb, random_state=123)
assert traina.index.equals(trainb.index)
# True
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Changing the order is in my case not possible since the models do not necessarily get trained at the same time or by the same code. However, this is still the answer to my question. – NG. Dec 04 '17 at 15:04