0

I'm using the Scikit module for Python to implement Stochastic Gradient Boosting. My data set has 2700 instances and 1700 features (x) and contains binary data. My output vector is 'y', and contains 0 or 1 (binary classification). My code is,

gb = GradientBoostingClassifier(n_estimators=1000,learn_rate=1,subsample=0.5) gb.fit(x,y)

print gb.score(x,y)

Once I ran it, and got an accuracy of 1.0 (100%), and sometimes I get an accuracy of around 0.46 (46%). Any idea why there is such a huge gap in its performance?

lostboy_19
  • 347
  • 3
  • 16

2 Answers2

5

First, a couple of remarks:

  • the name of the algorithm is Gradient Boosting (Regression Trees or Machines) and is not directly related to Stochastic Gradient Descent

  • you should never evaluate the accuracy of a machine learning algorithm on you training data, otherwise you won't be able to detect the over-fitting of the model. Use: sklearn.cross_validation.train_test_split to split X and y into a X_train, y_train for fitting and X_test, y_test for scoring instead.

Now to answer your question, GBRT models are indeed non deterministic models. To get deterministic / reproducible runs, you can pass random_state=0 to seed the pseudo random number generator (or alternatively pass max_features=None but this is not recommended).

The fact that you observe such big variations in your training error is weird though. Maybe your output signal if very correlated with a very small number of informative features and most other features are just noise?

You could try to fit a RandomForestClassifier model to your data and use the computed feature_importance_ array to discard noisy features and help stabilize your GBRT models.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
1

You should look at the training loss at each iteration, this might indicate whether the loss suddenly "jumps" which might indicate numerical difficulties::

import pylab as plt
train_scores = gb.train_score_
plt.plot(np.arange(train_scores.shape[0]), train_scores, 'b-')

The resulting plot should be gradually decreasing much like the blue line in the left figure here http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_regression.html .

If you see a gradual decrease but a sudden jump it might indicate a numerical stability problem - in order to avoid them you should lower the learning rate (try 0.1 for example).

If you don't see sudden jumps and there is no substantial decrease I strongly recommend turning off sub-sampling and tuning the learning rate first.

fixxxer
  • 15,568
  • 15
  • 58
  • 76
Peter Prettenhofer
  • 1,951
  • 18
  • 23