PySpark randomSplit vs SkLearn Train Test Split - Random Seed Question

Question

Let's say I have a pandas dataframe and apply sklearn.model_selection.train_test_split with the random_seed parameter set to 1.

Let's say I then take the exact same pandas dataframe and create a Spark Dataframe with an instance of SQLContext. If I apply the PySpark randomSplit function with the seed parameter set to 1, will I always be guaranteed to obtain the same exact split?

score 3 · Accepted Answer · answered Mar 31 '19 at 05:33

In general, no.

Most "random" number generators are really functions that take some input value and generate a really long stream of bytes that can be converted into values of other types. The "randomness" comes from the fact that, given only values from this stream, even as many as you want, it is very difficult to predict the next value or to extract the original input value.

This input value is what we call a "seed".

Whether the results will be the same will depend on not only the seed, but also whether sklearn and pyspark use the exact same random number generator implementations, the OS they are run on, the processor architecture...

Ah that makes sense. Thank you so much for the answer. I don't understand who voted me down because I feel like this is something worth checking for model reproducibility. — Odisseo, Mar 31 '19 at 06:00

PySpark randomSplit vs SkLearn Train Test Split - Random Seed Question

1 Answers1