0

I've created several GBM models to tune the parameters (trees, shrinkage and depth) to my data and the model performs well on the out-of-time sample. The data is credit card transactions (running into 100s of millions) so I sampled 1% of the good (non-event) and 100% of the bad.

However, when I increased the sample size to 3% of the good, there was a noticeable improvement in performance. My question is - how do I decide the optimal sampling rate, without running several iterations and deciding which one fits best? Is there a theory around this?

I have about 3 million total transactions (for the 1% sample), containing 380k bads and ~250 variables

Karan
  • 31
  • 1
  • 3

0 Answers0