How to fix randomization in sklearn

Question

I am trying to fix the randomization in my code but every time I run, I get different best score and best parameters. The results are no too far apart, but how can I fix the result to get the same best score and parameters every time I run?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 27)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


clf = DecisionTreeClassifier(random_state=None)

parameter_grid = {'criterion': ['gini', 'entropy'],
                  'splitter': ['best', 'random'],
                  'max_depth': [1, 2, 3, 4, 5,6,8,10,20,30,50],
                  'max_features': [10,20,30,40,50]
                 }

skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(X_train, y_train)

grid_search = GridSearchCV(clf, param_grid=parameter_grid, cv=skf, scoring='precision')

grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

y_pred_iris = clf.predict(X_test)
print(confusion_matrix(y_test,y_pred),"\n")
print(classification_report(y_test,y_pred),"\n")

desertnaut · Accepted Answer · 2021-03-09T12:05:51.890

In order to get reproducible results, every source of randomness in your code must be explicitly seeded (and even then, you must be careful that the implicit assumption of all other being equal actually holds - see Why does the importance parameter influence performance of Random Forest in R? for a case where it does not).

There are three parts in your code that inherently include a random element:

train_test_split
DecisionTreeClassifier
StratifiedKFold

You correctly seed the first one (using random_state=27), but you fail to do so for the other two, leaving random_state=None in both of them.

What you should do is simply replace the two cases of random_state=None in your code with an explicit seed, as you have done for train_test_split; it doesn't have to be any specific number, or even the same for all cases, it just needs to be explicitly set.

The random_state=False should achieve the same result according to the documentation. but apparently it doesn't. However, I tried your suggestion and it worked. Thank you very much. — Zenvega, Mar 08 '21 at 20:05

How to fix randomization in sklearn

1 Answers1