Numerical jump in sklearn GradientBoostingRegressor

Question

I have been investigating a "hand-rolled" version of a gradient boosted regression tree. I find that the errors agree very well with the sklearn GradientBoostingRegressor module until I increase the tree building loop above a certain value. I am not sure if this is a bug in my code or a feature of the algorithm that is manifesting itself, so I was looking for some guidance as to what may be happening. My full code listing that uses the Boston housing market data is shown below, and below that the output when I change the loop parameter.

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)


alpha = 0.5
loop = 44
yhi_1=0
ypT=0
for i in range(loop):
    dt = DecisionTreeRegressor(max_depth=2, random_state=42)
    ri = y_train - yhi_1
    dt.fit(X_train, ri)
    hi = dt.predict(X_train)
    yhi = yhi_1 + alpha * hi
    ypi = dt.predict(X_test)*alpha
    ypT = ypT + ypi
    yhi_1 = yhi


r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)

from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,random_state=42,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)

print("R2loop - GBT: ", r2Loop - r2GBRT)

When the parameter loop=44 the output is

dtL: R^2 =  0.8702681499951852
GBT: R^2 =  0.8702681499951852
R2loop - GBT:  0.0

and the two agree. If I increase the loop parameter to loop=45 I get

dtL: R^2 =  0.8726215419913225
GBT: R^2 =  0.8720222156381275
R2loop - GBT:  0.0005993263531949289

A sudden jump in accuracy between the two algorithms of 15 to 16 decimal places. Any thoughts?

It would arguably be a great idea if you modify slightly your code so that it uses the [built-in](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) Boston data instead of an external file, thus making your code fully reproducible. — desertnaut, May 10 '20 at 12:07
@desertnaut I have updated the code as requested pulling in from sklearn datasets. — AJR, May 10 '20 at 12:46

score 1 · Accepted Answer · answered May 11 '20 at 09:19

I believe there are two sources of differences here. The biggest one is the randomness in the DecisionTreeRegressor.fit method. While you set your random seeds to 42 in both the GradientBoostingRegressor and in all of the DecisionTreeRegressors, your DecisionTreeRegressor training loop does not duplicate the way GradientBoostingRegressor handles the random seed. In your loop, you set the seed on each iteration. In the GradientBoostingRegressor.fit method, the seed is (I assume) set only once at the beginning of training. I've modified your code as follows:

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
import numpy as np

X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)


alpha = 0.5
loop = 45
yhi_1=0
ypT=0

np.random.seed(42)
for i in range(loop):
    dt = DecisionTreeRegressor(max_depth=2)
    ri = y_train - yhi_1
    dt.fit(X_train, ri)
    hi = dt.predict(X_train)
    yhi = yhi_1 + alpha * hi
    ypi = dt.predict(X_test)*alpha
    ypT = ypT + ypi
    yhi_1 = yhi


r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)

np.random.seed(42)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)

print("R2loop - GBT: ", r2Loop - r2GBRT)

The only difference is in how I set the random seeds. I'm now using numpy to set the seed before each training loop. By making this change, I get the following output with loop = 45:

dtL: R^2 =  0.8720222156381277
GBT: R^2 =  0.8720222156381275
R2loop - GBT:  1.1102230246251565e-16

which is within reason for floating point errors (the other source of differences I was referring to in my first sentence), and for many values of loop I see no difference at all.

Good answer and checks out for me. I guess my next level down is to figure out where/how the random parameter impacts the algorithm! — AJR, May 11 '20 at 11:41

Numerical jump in sklearn GradientBoostingRegressor

1 Answers1