0

I'm using a large dataset spanning many years to cross-validate hyperparameters for an XGBoost model. This data can look different in different years, so to reduce generalization error I would like to disallow the model from making any splits that are imbalanced with respect to years, i.e. don't let it split on year. For example, adding a constraint that all splits must contain at least n samples from each year, or adding a penalty on how far the ratio of each year's data in the split differs from 1/2. I don't have the timestamp as a feature but there are other features that would allow it to do effectively this. I don't see anything in the documentation that covers this use-case, but I was wondering if there might be some trick (eg. with monotonicity constraints) that could work.

moi
  • 1
  • 1

0 Answers0