Choosing a sample rate for GBM models

Asked Jun 18 '15 at 13:24

Active Jun 18 '15 at 13:24

Viewed 182 times

I've created several GBM models to tune the parameters (trees, shrinkage and depth) to my data and the model performs well on the out-of-time sample. The data is credit card transactions (running into 100s of millions) so I sampled 1% of the good (non-event) and 100% of the bad.

However, when I increased the sample size to 3% of the good, there was a noticeable improvement in performance. My question is - how do I decide the optimal sampling rate, without running several iterations and deciding which one fits best? Is there a theory around this?

I have about 3 million total transactions (for the 1% sample), containing 380k bads and ~250 variables

asked Jun 18 '15 at 13:24

Karan

P.S. Every iteration takes days to run, with run time increasing significantly as sample size is increased – Karan Jun 18 '15 at 13:25
Since this is not a programming question, you might get better answers on http://stats.stackexchange.com/. – cfh Jun 18 '15 at 13:37
Thanks, will post there. – Karan Jun 19 '15 at 04:00

Choosing a sample rate for GBM models

0 Answers0