1

I am confused about the derivation of importance scores for an xgboost model. My understanding is that xgboost (and in fact, any gradient boosting model) examines all possible features in the data before deciding on an optimal split (I am aware that one can modify this behavior by introducing some randomness to avoid overfitting, such as the by using the colsample_bytree option, but I’m ignoring this for now).

Thus, for two correlated features where one is more strongly associated with an outcome of interest, my expectation was that the one that is more strongly associated with an outcome be selected first. Or in other words, that once this feature is selected, no additional useful information should be found in the other, correlated feature. This however does not seem to be always the case.

To put this concretely, I simulated the data below, where x1 and x2 are correlated (r=0.8), and where Y (the outcome) depends only on x1. A conventional GLM with all the features included correctly identifies x1 as the culprit factor and correctly yields an OR of ~1 for x2. However, examination of the importance scores using gain and SHAP values from a (naively) trained xgboost model on the same data indicates that both x1 and x2 are important. Why is that? Presumably, x1 will be used as the primary split (i.e. the stomp) since it has the strongest association with the outcome. Once this split happens (even if over multiple trees due to a low learning rate), x2 should have no additional information to contribute to the classification process. What am I getting wrong?

pacman::p_load(dplyr, xgboost,data.table,Matrix,MASS, broom, SHAPforxgboost)

expit<-function(x){
  exp(x)/(1+exp(x))
}

r=0.8
d=mvrnorm(n=2000, mu=c(0,0),Sigma=matrix(c(1,r,r,1),nrow=2),empirical=T)
data=data.table(d,
                replicate(10,rbinom(n=2000,size=1,prob=runif(1,min=0.01,max=0.6))))

colnames(data)[1:2]<-c("x1","x2")
cor(data$x1,data$x2)
data[,Y:=rbinom(n=2000,size=1,prob=expit(-4+2*x1+V2+V4+V6+V8+V3))]

model<-glm(Y~., data=data, family="binomial")
mod<-tidy(model)
mod$or<-round(exp(mod$estimate),2)

sparse_matrix<-sparse.model.matrix(Y~.-1,data=data)
dtrain_xgb<-xgb.DMatrix(data=sparse_matrix,label=data$Y)

xgb<-xgboost(tree_method="hist", 
             booster="gbtree",
             data=dtrain_xgb,
             nrounds=2000,
             fold=5,
             print_every_n=10,
             objective="binary:logistic",
             eval_metric="logloss",
             maximize = F)
shap<-shap.values(xgb,dtrain_xgb)
mean_shap<-data.frame(shap$mean_shap_score)
gain<-xgb.importance(model=xgb)

head(mod,14) #regression
head(mean_shap) #shap values
head(gain) #gain
dean
  • 31
  • 2
  • In trees, splits are greedy, not optimal. In addition, "hist" is not exact, this is an approximation trick. SHAP the way you apply it is an approximation too. In most general case, in ML a model with correlated features is unstable and deemed "incorrect" unless special tricks are involved. – Sergey Bushmanov Apr 11 '23 at 05:49
  • cross-posted at https://stats.stackexchange.com/q/612542/232706 – Ben Reiniger Apr 11 '23 at 12:56
  • @Sergey - right, but the same behavior is observed even when using moderately correlated features (e.g. r=0.4) and even when changing the tree_method to "exact". This is also observed when relying on gain rather then SHAP values to derive importance. Some correlations are bound to happen in any large database, so this xgboost behavior is still not clear to me. – dean Apr 11 '23 at 18:31
  • A single split on X does not fully represent its effect. The next split might be on some correlated other feature. Similar for the next tree. – Michael M Apr 29 '23 at 05:38
  • right, but wouldn't you expect that each tree would focus on the most correlated feature? I.e., even with a low learning rate, I would have expected that the same feature be selected across trees/splits until all information contained in that feature is exhausted (at which point, no information should also be gained from the other feature which is less corelated with the outcome). – dean May 01 '23 at 14:27

0 Answers0