7

How is it possible, that running the same Python program twice with the exact same seeds and static data input produces different results? Calling the below function in a Jupyter Notebook yields the same results, however, when I restart the kernel, the results are different. The same applies when I run the code from the command line as a Python script. Is there anything else people do to make sure their code is reproducible? All resources I found talk about setting seeds. The randomness is introduced by ShapRFECV.

This code runs on a CPU only.

MWE (In this code I generate a dataset and eliminate features using ShapRFECV, if that's important):

import os, random
import numpy as np
import pandas as pd
from probatus.feature_elimination import ShapRFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

global_seed = 1234
os.environ['PYTHONHASHSEED'] = str(global_seed)
np.random.seed(global_seed)
random.seed(global_seed)

feature_names = ['f1', 'f2', 'f3_static', 'f4', 'f5', 'f6', 'f7',
 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 
'f18', 'f19', 'f20']

# Code from tutorial on probatus documentation
X, y = make_classification(n_samples=100, class_sep=0.05, n_informative=6, n_features=20, 
random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)

def shap_feature_selection(X, y, seed: int) -> list[str]:
    
    random_forest = RandomForestClassifier(random_state=seed, n_estimators=70, max_features='log2',
criterion='entropy', class_weight='balanced')
    # Set to run on one thread only
    shap_elimination = ShapRFECV(clf=random_forest, step=0.2, cv=5,
scoring='f1_macro', n_jobs=1, random_state=seed)

    report = shap_elimination.fit_compute(X, y, check_additivity=True, seed=seed)
    # Return the set of features with the best validation accuracy
    return report.iloc[[report['val_metric_mean'].idxmax() - 1]]['features_set'].to_list()[0]

Results:

# Results from the first run
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Running again in same session
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Restarting the kernel and running the exact same command
shap_feature_selection(X, y, 0)
>>> ['f8', 'f1', 'f17', 'f6', 'f18', 'f20', 'f12', 'f15', 'f7', 'f13', 'f11']

Details:

  • Ubuntu 22.04
  • Python 3.9.12
  • Numpy 1.22.0
  • Sklearn 1.1.1
Dreana
  • 573
  • 4
  • 16
  • It suggests that probatus or sklearn is using a different rand. Can you comment out calls one by one and see if the problem goes away? – tdelaney May 19 '23 at 16:41
  • I know that the randomness happens in the `fit_compute` step, which is from the probatus library. But if it uses a different rand, wouldn't it also produce different results in the same (e.g. Jupyter) session? – Dreana May 19 '23 at 17:00
  • Good question! I don't know enough about jupyter to say. Perhaps a C library is loaded and seeded on first use. – tdelaney May 19 '23 at 17:18
  • Btw it's not a Jupyter-specific thing - this also happens when running the code twice from the terminal or from my IDE – Dreana May 19 '23 at 17:50
  • I tried to find out what `sklearn` uses as a default generator, and instead found out that [it is now deprecated](https://pypi.org/project/sklearn/). – pjs May 19 '23 at 20:11
  • @JoelCrypto it does not; the question is why when restarting the kernel (thus re-initaliizing all random number generators at the same states) still produces different results - plus you are quoting from the explanation of setting the random state to `None`, which is certainly and clearly not the case here. It would seem you don't have a clear understanding of the situation - I would kindly suggest you delete the comment now (to reduce clutter). – desertnaut May 19 '23 at 22:56
  • 3
    Since you've already narrowed it down to `probatus`, that should be in the question text. You might try setting `cv` to a specific splitter with random state set, instead of an integer (despite `probatus` [saying that should work](https://ing-bank.github.io/probatus/howto/reproducibility.html#static-data-splits); I don't think it does in sklearn...). – Ben Reiniger May 20 '23 at 14:35
  • I would suggest setting the seed *before* even importing any of the modules that rely on randomness - they might have created a default RNG instance at import time. – jasonharper May 21 '23 at 20:32
  • @BenReiniger This actually helped! The feature selector still doesn't return the features in the same order somehow, but I can work with that. MANY thanks. – Dreana May 21 '23 at 21:15
  • I just got a chance to try it, and I still get different results with my suggestion for `cv`. – Ben Reiniger May 22 '23 at 15:23

1 Answers1

1

This has now been fixed in probatus (the issue was a bug, apparently connected to the pandas implementation they were using, see here). For me, everything works as expecting when using the probatus' latest code version (not the package).

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Dreana
  • 573
  • 4
  • 16