R: how to improve gradient boosting model fit

Question

I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.

library(rpart)
library(gbm)
data(iris)

train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]

learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M, 
    interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)

tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)

> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438

If you change the learning rate to 0.01 the RMSE for gbm drops to 0.646 which is much closer to rpart. If you use caret train() function and allow it to split up training and and test data you can get the RMSE down to 0.33. — nmgeek, Jun 04 '23 at 15:50

score 0 · Accepted Answer · answered Apr 02 '20 at 23:10

In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.

library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()

Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:

set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)

gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm", distribution="gaussian",tuneGrid=tg,trControl=tr)

Then we use the same samples to fit rpart:

#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)

Finally we compare them, and they are very similar:

gbm_fit$resample
       RMSE  Rsquared       MAE Resample
1 0.3459311 0.5000575 0.2585884        0
2 0.3421506 0.4536114 0.2631338        1
3 0.3428588 0.5600722 0.2693837        2

       RMSE  Rsquared       MAE Resample
1 0.3492542 0.3791232 0.2695451        0
2 0.3320841 0.4276960 0.2550386        1
3 0.3284239 0.4343378 0.2570833        2

So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

This answer fits Sepal.Width but the OP asked about Sepal.Length. If you replace Width with Length in the code here, you will see that rpart still outperforms gbm. I suspect you could set a range of parameters in the tuning grid and get a better gbm model. — nmgeek, Jun 04 '23 at 15:39

R: how to improve gradient boosting model fit

1 Answers1