2

Let's say I have a pandas dataframe and apply sklearn.model_selection.train_test_split with the random_seed parameter set to 1.

Let's say I then take the exact same pandas dataframe and create a Spark Dataframe with an instance of SQLContext. If I apply the PySpark randomSplit function with the seed parameter set to 1, will I always be guaranteed to obtain the same exact split?

gmds
  • 19,325
  • 4
  • 32
  • 58
Odisseo
  • 747
  • 1
  • 13
  • 32

1 Answers1

3

In general, no.

Most "random" number generators are really functions that take some input value and generate a really long stream of bytes that can be converted into values of other types. The "randomness" comes from the fact that, given only values from this stream, even as many as you want, it is very difficult to predict the next value or to extract the original input value.

This input value is what we call a "seed".

Whether the results will be the same will depend on not only the seed, but also whether sklearn and pyspark use the exact same random number generator implementations, the OS they are run on, the processor architecture...

gmds
  • 19,325
  • 4
  • 32
  • 58
  • Ah that makes sense. Thank you so much for the answer. I don't understand who voted me down because I feel like this is something worth checking for model reproducibility. – Odisseo Mar 31 '19 at 06:00