I have been investigating a "hand-rolled" version of a gradient boosted regression tree. I find that the errors agree very well with the sklearn GradientBoostingRegressor module until I increase the tree building loop above a certain value. I am not sure if this is a bug in my code or a feature of the algorithm that is manifesting itself, so I was looking for some guidance as to what may be happening. My full code listing that uses the Boston housing market data is shown below, and below that the output when I change the loop parameter.
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 44
yhi_1=0
ypT=0
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2, random_state=42)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,random_state=42,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
When the parameter loop=44
the output is
dtL: R^2 = 0.8702681499951852
GBT: R^2 = 0.8702681499951852
R2loop - GBT: 0.0
and the two agree. If I increase the loop parameter to loop=45
I get
dtL: R^2 = 0.8726215419913225
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 0.0005993263531949289
A sudden jump in accuracy between the two algorithms of 15 to 16 decimal places. Any thoughts?