6

I have an old linear model which I wish to improve using XGBoost. I have the predictions from the old model, which I wish to use as a base margin. Also, due to the nature of what I'm modeling, I need to use weights. My old glm is a poisson regression with formula number_of_defaults/exposure ~ param_1 + param_2 and weights set to exposure (same as denominator in response variable). When training the new XGBoost model on data, I do this:

xgb_model = xgb.XGBRegressor(n_estimators=25,
                             max_depth=100,
                             max_leaves=100,
                             learning_rate=0.01,
                             n_jobs=4,
                             eval_metric="poisson-nloglik",
                             nrounds=50)

model = xgb_model.fit(X=X_train, y=y_train, sample_weight=_WEIGHT, base_margin=_BASE_MARGIN)

, where _WEIGHT and _BASE_MARGIN are the weights and predictions (popped out of X_train). But how do I do cross validation or out of sample analysis when I need to specify weights and base margin?

As far as I see I can use sklearn and GridSearchCV, but then I would need to specify weights and base margin in XGBRegressor() (instead of in fit() as above). The equivalent of base_margin in XGBRegressor() is the argument base_score, but there is no argument for weight.

Also, I could potentially forget about doing cross-validation, and just use a training and test dataset, and I would then use eval_set argument in XGBRegressor(), but if I did that there is no way of specifying what is weight and what is base margin in the different sets.

Any guidance in the right direction is much appreciated!

  • Haven't used the XGBoost library much, but I can see that the DMatrix class receives base_margin and weight parameters (https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.DMatrix) and the XGBoost.cv function receives a DMatrix (https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.DMatrix). Perhaps there's a way you can combine both? – itscarlayall Jun 13 '22 at 14:12

1 Answers1

3

You can use cross_val_predict with fit_params argument, or GridSearchCV.fit with **fit_params.

Here is a working proof of concept

import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import cross_val_predict, GridSearchCV
import numpy as np

# Sample dataset
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]

xgb_model = xgb.XGBRegressor(n_estimators=5)
fit_params = dict(sample_weight=np.abs(X[:, 0]), base_margin=np.abs(X[:, 1]))

# Simple fit
xgb_model.fit(X, y, **fit_params)

# cross_val_predict
y_pred = cross_val_predict(xgb_model, X, y, cv=3, fit_params=fit_params)
print(y_pred.shape, y.shape)

# grid search
grid = GridSearchCV(xgb_model, param_grid={"n_estimators": [5, 10, 15]})
grid.fit(X, y, **fit_params)

You can see what happen in the code source: here, here and here. The last link is where fit_params get indexing following cross validation splits.

phi
  • 10,572
  • 3
  • 21
  • 30