1

So I was messing around with different classifiers in sklearn, and found that regardless of the value the random_state parameter GradientBoostingClassifier is in, it always returns the same values. For example, when I run the following code:

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data[:, :2]  
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, 
random_state=0)

scores = []
for i in range(10):
    clf = GradientBoostingClassifier(random_state=i).fit(X_train, y_train)
    score = clf.score(X_test,y_test)
    scores = np.append(scores, score)
print scores

the output is:

[ 0.66666667  0.66666667  0.66666667  0.66666667  0.66666667  0.66666667
0.66666667  0.66666667  0.66666667  0.66666667]

However, when I run the same thing with another classifier, such as RandomForest:

from sklearn.ensemble import RandomForestClassifier
scores = []
for i in range(10):
    clf = RandomForestClassifier(random_state=i).fit(X_train, y_train)
    score = clf.score(X_test,y_test)
    scores = np.append(scores, score)
print scores

The output is what you would expect, i.e. with slight variability:

[ 0.6         0.56666667  0.63333333  0.76666667  0.6         0.63333333
0.66666667  0.56666667  0.66666667  0.53333333]

What could be causing GradientBoostingClassifier to ignore the random state? I checked the classifier info but everything seems normal:

print clf
GradientBoostingClassifier(criterion='friedman_mse', init=None,
          learning_rate=0.1, loss='deviance', max_depth=3,
          max_features=None, max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=100, presort='auto', random_state=9,
          subsample=1.0, verbose=0, warm_start=False)

I tried messing around with warm_start and presort but it didn't change anything. Any ideas? I've been trying to figure this out for almost an hour so I figured I'd ask here. Thank you for your time!

Skip
  • 83
  • 2
  • 5
  • 1
    Check the output of `predict_proba(X_test)`. Also check the attributes `oob_improvement_` and `train_score_`. Are they also same for all `random_state` values? As this is an ensemble method, it may happen that the values are changing in internal estimators, but are not that much to affect the final class of sample and hence the score remains same. – Vivek Kumar Jun 01 '17 at 07:09
  • If that doesnt work, can you change the `subsample=1.0` to some other float value between 0 to 1 and check if still same issue exists. Also post your data samples, so that we can check the root cause. – Vivek Kumar Jun 01 '17 at 07:10
  • Hi Vivek, thank for the reply. The output of `predict_proba(X_test)`, and `train_score_` is the same for each random state the clf is in. I tried using clf.oob_improvement but it would not let me, it returned 'GradientBoostingClassifier' has no attribute 'oob_improvement_'. Changing the subsample to < 1 does introduce variability, but from what I see it does so in a way that isn't reproducible. I.e. if I run the GradientBoostingClassifier with random_state = 0 and subsample = 0.8 multiple times it will give me different answers each time. – Skip Jun 01 '17 at 16:55
  • The dataset I used is the Iris dataset, at the top of the code block. I see your point about the ensemble method, it's possible thats the case, but I want to say there's something else going on here. Maybe I'm just crazy haha. – Skip Jun 01 '17 at 16:56
  • As I said above, changing the `subsample` did change the scores pretty much. Also, I tried higher values of `random_state` and that seem to change the scores too. This is the line I changed : First option :- `clf = GradientBoostingClassifier(random_state=i, subsample=0.7).fit(X_train, y_train)`. Second option :- `clf = GradientBoostingClassifier(random_state=i*1000).fit(X_train, y_train)` – Vivek Kumar Jun 01 '17 at 17:13
  • Did the second option work for you? It unfortunately did not for me. Yes changing subsample did change the outputs scores, however it does so in a way that is unreliable. The reason I bring this up is because lets say someone wants to use their GradientBoosting algorithm consistently. They will get different outputs and not be able to reproduce their results. Thats the whole reason a random state is essential to reproducibility. – Skip Jun 01 '17 at 17:27
  • Yes I understand. I looked up the source code to check where the `random_state` is being used and found two locations. One is related to `subsample` and other is to make the internal tree estimators. Maybe you can put it as an issue on scikit-learn github page. There you can get definitive answers for it. – Vivek Kumar Jun 01 '17 at 17:40
  • I'll do that then. Thank you for your help. – Skip Jun 01 '17 at 19:02
  • Oops just realized that you can actually get reproducible results by changing subsample <1. I forgot to set the train test split to a random state as well, which explains why the results were not reproducible. But that does help me out a bit, so thanks again. – Skip Jun 01 '17 at 21:29
  • Yes, I did not take train_test_split into consideration because the `for loop` was below it when the train and test data has been assigned. So for the `for loop`, they are constant. And about the subsample, in my first comment I had said that try keeping it between 0 to 1 and see. And same was done by me in the `First option` of the two given above. – Vivek Kumar Jun 02 '17 at 00:50

0 Answers0