Let's say I have a dataset with a lot of variables (more than in the reproductible example below) and I want to build a simple and interpretable model, a GLM.
I can use a xgboost model first, and look at importance of variables (which depends on the frequency and the gain of each variable in the successive decision trees) to select the 10 most influent variables:
Question : is there a way to highlight the most significant 2d-interactions ?
library(dplyr)
library(xgboost)
# data
data(mtcars)
dtrain <- xgb.DMatrix(
data = mtcars %>% select(-am) %>% as.matrix(),
label = mtcars$am
)
# xgboost parameters
xgb_params <- list(
objective = "binary:logistic",
eta = 0.1,
max_depth = 2
)
# xgboost fit
xgb_mod <- xgb.train(
data = dtrain,
params = xgb_params,
nrounds = 10,
eval = "auc",
maximize = TRUE
)
# feature importance
xgb.importance(dimnames(dtrain)[[2]], model = xgb_mod)
# Feature Gain Cover Frequency
# 1: wt 0.53965838 0.46589322 0.47619048
# 2: gear 0.41691383 0.37360220 0.28571429
# 3: qsec 0.03215627 0.11810252 0.19047619
# 4: hp 0.01127152 0.04240205 0.04761905
Question : is there a way to highlight the most significant interaction according to the xgboost model ?
According to the feature importance, I can built a GLM with 4 variables (wt
, gear
, qsec
, hp
) but I would like to know if some 2d-interaction (for instance wt:hp
) should have an interest to be added in a simple model.