5

I am using a pipeline to perform feature selection and hyperparameter optimization using RandomizedSearchCV. Here is a summary of the code:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

rng = 44

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'], random_state=rng)


clf = RandomForestClassifier(random_state=rng)
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3, random_state=rng)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I am using a constant random_state for the train_test_split, RandomForestClassifer, and RandomizedSearchCV. However, the result of the above code is slightly different if I run it several times. More specifically, I have several test units in my code and these slightly different results leads to failure of the test units. Should not I obtain the same results because of using the same random_state? Am I missing anything in my code that creates randomness in a part of the code?

MhFarahani
  • 960
  • 2
  • 9
  • 19

1 Answers1

7

I usually answer my own questions! I will leave it here for others with similar question:

To make sure that I am avoiding any randomness, I defined a random seed. The code is as follows:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

seed = np.random.seed(22)

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'])


clf = RandomForestClassifier()
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I hope it can help others!

MhFarahani
  • 960
  • 2
  • 9
  • 19
  • While i'm not sure why the original code is not working as expected (and i'm too lazy to work on it), i would not call this solution the perfect one. Here you are assuming the order of operations between these three components is always the same which should be ok with this code, but can introduce trouble in more complex tasks. It's basically a switch from multiple random-streams to one random-stream. – sascha Jan 07 '17 at 01:15
  • @sascha: Thanks for your comment! I am still curious to find out the main cause. Do you think that the use of `scipy.stats.randint` caused the problem? – MhFarahani Jan 07 '17 at 01:25
  • Thank you for this example of how to combine RandomizedSearchCV and make_pipeline with the addition of 'randomforestclassifier__' (in this case) to the params dict. @MhFarahani for determinate runs of a script you may need to use np.random.seed AND random.seed in case a third party function mixes them. – DMTishler Mar 21 '18 at 16:43