4

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.

Author is using following function for dividing Tran and Test split,

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128

Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.

Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?

The author is also providing a few solutions,

One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.

But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).

Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.

The author is giving an example,

You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.

The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

Sachin Rastogi: I am not able to understand this solution. Could you please help?

Sachin Rastogi
  • 409
  • 5
  • 8
  • Imagine you want to use approx. 50% of some dataset, uniformly randomly chosen. If the dataset grows, you want to keep the 50%-idea and: the old instance subset is treated exactly as before (chosen or not chosen): hash each instance to some hash-value of N bits: e.g. i0: 0110, i1: 1001, i2: 1000. Now picking means: the leftmost-bit is 0. This will be consistent and approximately picking 50% of your data. The `is lower or equal to 20% of the maximum hash value` is just the generalization to other values than 50%. – sascha May 29 '19 at 17:48
  • The `leftmost-bit = 0` is a simple decider-function here and statistics guaranteed us approximation to 50%. Assuming your hash is uniformly in e.g. 32 bits, and you target 30%; you will need a decider-function to pick 30% of all the possible 32 bit strings (in a deterministic way). Which one does not matter (uniformly random assumption on our hash!).This somewhat relates to the classic bounded-integer sampling in PRNGS. See [here](http://www.pcg-random.org/posts/bounded-rands.html) – sascha May 29 '19 at 17:54

1 Answers1

1

For me, these are the answers:

  1. The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.

  2. This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.

  3. As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.

    • Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
    • Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).

    The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.

I would also suggest to see here for an explanation from the author: https://github.com/ageron/handson-ml/issues/71.

amiola
  • 2,593
  • 1
  • 11
  • 25