8

Regarding the seeding system when running machine learning algorithms with Scikit-Learn, there are three different things usually mentioned:

  • random.seed
  • np.random.seed
  • random_state at SkLearn (cross-validation iterators, ML algorithms etc)

I have already in my mind this FAQ of SkLearn about how to fix the global seeding system and articles which point out that this should not be simply a FAQ.

My ultimate question is how can I get absolutely reproducible results when running an ML algorithm with SkLearn?

In more detail,

  • If I only use np.random.seed and do not specify any random_state at SkLearn then will my results be absolutely reproducible?

and one question at least for the sake of knowledge:

  • How exactly np.random.seed and random_stateof SkLearn are internally related? How np.random.seed affects the seeding system (random_state) of SkLearn and makes it (at least hypothetically speaking) to reproduce the same results?
Outcast
  • 4,967
  • 5
  • 44
  • 99

3 Answers3

6

Defining random seed will make sure that every time you run the algorithm, the random will generate the same numbers. IMHO, the result will always be the same as long as we use the same data, and the same values of any other parameters.

As you have read in sklearn's FAQ, it is the same either you define it globally by numpy.random.seed() or by set random_state parameter in all algorithms involved, provided that you set the same number for both cases.

I take example from sklearn docs, to illustrate it.

import numpy
from sklearn.model_selection import train_test_split
# numpy.random.seed(42)
X, y = np.arange(10).reshape((5, 2)), range(5)

#1 running this many times, Xtr will remain [[4, 5],[0, 1],[6, 7]].
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33, random_state=42)

#2 try running this line many times, you will get various Xtr
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)

Now uncomment the third line. Run #2 many times. Xtr will always be [[4, 5],[0, 1],[6, 7]]

By numpy.random.seed(), it sets seed to default (None) and then it will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. docs

ipramusinto
  • 2,310
  • 2
  • 14
  • 24
  • 1
    Yes, it is actually clearer to me now that "whenever a RandomState instance or an integer random seed is not provided as an argument, it relies on the numpy global random state, which can be set using `numpy.random.seed`". – Outcast Oct 17 '18 at 11:53
  • However, one question of mine is: if you have not set any value to `numpy.random.seed` then it simply picks one in a random way by itself? – Outcast Oct 17 '18 at 11:54
  • Yes it will pick a value, but of course not in random way. How do you imagine ‘random’ in computer?. So if you don’t define seed, then it will use ,for example, current timestamp as seed. That’s why you will get different numbers when you generate it now and later. – ipramusinto Oct 17 '18 at 12:20
  • Apparently I am talking about this kind of randomness (like picking a numpy seed value according to a "random" element like timestamps etc). However, do we know that it uses timestamps? – Outcast Oct 17 '18 at 12:33
  • @PoeteMaudit if part of the answer doesn't answer your question please address this in a comment to the author, suggesting an edit to remove it is not the correct course of action – WhatsThePoint Oct 17 '18 at 12:48
  • @PoeteMaudit , https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.RandomState.html#numpy.random.RandomState – ipramusinto Oct 17 '18 at 12:53
  • Ok cool. So if you want, remove the `pickle` part from your post (because my question is about the seeding systems) and add in a well presentable way some of the things that we have discussed here in the comments and then I will tick your answer as correct. For now +1 for your effort :) – Outcast Oct 17 '18 at 13:22
0

In scikit-learn documentation examples, for instance here, they use np.random.seed(n) which seems to work.

  • Cool but I expect a more thorough analysis of the seeding system of sklearn and numpy from an answer on StackOverflow. – Outcast Oct 17 '18 at 11:44
  • While this link may assist in your answer to the question, you can improve this answer by taking vital parts of the link and putting it into your answer, this makes sure your answer is still an answer if the link gets changed or removed :) – WhatsThePoint Oct 17 '18 at 12:23
0

I just was playing with numpy and as well with sklearn. Apparently, setting np.random.seed does not guarantee a fixed random state for sklearn. We need to set random_state parameter corresponding to each sklearn function to ensure repeatability.

SKPS
  • 5,433
  • 5
  • 29
  • 63