2

I have a python script that generates predictions using sklearn Random Forest and fixed random_state = 0. It produces always deterministic results on the one computer (system) but when I switch to another computer, results are different.

Is there a way to make it deterministic across different systems? How to make identical results on a different machine like on the first machine?

The script is complicated and long so I won't share the code but I think the problem is in Random Forest random_state because when I tried using KNN instead of RF, results were identical

EnesZ
  • 403
  • 3
  • 16

1 Answers1

0

sklearn.neighbors.KNeighborsClassifier uses all observations from your train data, while as the name suggests sklearn.ensemble.RandomForestClassifier uses data randomly, so you can expect different results from Random Forest per iteration. Now coming to the question of using it on different systems, this is tricky one, but you can give a try to following approach (though I have not tested this yet).

1). Fit a Random Forest model on your data with some random_state, let's say random_state = 0

2). Import pickle, create a pickle object rf.pkl which will be saved at your current working directory.

3). Dump the current Random Forest model object in the pickle object.

import pickle    
pkl = 'rf.pkl'
with open(pkl,'wb') as file:
    pickle.dump(rf,file)

4). Share the pickle object file to another user/system.

5). Store the pickle object at some location and set that as working directory.

6). Open Python on that system, run your python code to read the data.

7). Instead of creating a new model, load the pickled model using following lines of code:

with open(pkl,'rb') as file:
    pkl_model = pickle.load(file)

8). Test if your pickled model works and produces same results as it did on another system.

I haven't tested this approach, but I think you should give a try to this and let me know if this works. Cheers!!

ManojK
  • 1,570
  • 2
  • 9
  • 17
  • Hi manojk. Thanks for your answer! RF uses data randomly but that randomness is always the same with fixed `random_state`, right? Your solution is very interesting but in my case, it is not practical. Behind RF in my script is a genetic algorithm that optimizes RF and some additional parameters. There are billions of different combinations and it doesn't make sense to dump each of them. Also, this GA with RF is trained using map-reduce with pyspark on the cloud of machines (50-100). – EnesZ Mar 12 '20 at 18:07
  • Yes randomness can be fixed with random_state. This is the only solution I could think of, it might not be practical to store multiple objects using pickle, I mentioned to try in case you are fitting single Random Forest model. – ManojK Mar 12 '20 at 18:12