In my code I use the dedupe library to match records between 2 datasets. The underlying library uses random numbers from python's random library and numpy's random submodule, but it provides no way to set a seed for either.
In our use case it is important to have reproducible results so I need to set a seed.
I used the following code to do so:
random.seed(seed, version=3)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
On my local this works and i get the same result every time.
On AWS however, when i push the same code I get almost consistent results. I.e. I will get the same result n times in a row but the next will be different. In addition, when it is different it seems to be similarly different.
To illustrate this I added a last random.randint(1,1000000)
call at the end of the code to see if it would get the same answer (I know that different states could create the same result, and that really i should use get_state(), I did that as well and got the same result, but that's hard to paste and check so I am using this as a decently high resolution proxy).
The results I got were like this:
run | random number |
---|---|
1 | 8809239 |
2 | 8809239 |
3 | 8809239 |
4 | 8410961 |
5 | 8809239 |
6 | 8410961 |
7 | 8410961 |
Note I also asked numpy to give me a random number at the end and it was always the same. So the problem is with the python native library (or how it interacts with cloud based computing?)
There is no multiprocessing happening inside my code.
I guess my question is does anyone have any experience with things like this, or know what is causing it, or know how i can ensure reproducibility?
P.S. this is on python 3.10.8, dedupe 2.0.6