6

I am using XGBoost cv to find the optimal number of rounds for my model. I would be very grateful if someone could confirm (or refute), the optimal number of rounds is:

    estop = 40
    res = xgb.cv(params, dvisibletrain, num_boost_round=1000000000, nfold=5, early_stopping_rounds=estop, seed=SEED, stratified=True)

    best_nrounds = res.shape[0] - estop
    best_nrounds = int(best_nrounds / 0.8)

i.e: the total number of rounds completed is res.shape[0], so to get the optimal number of rounds, we subtract the number of early stopping rounds.

Then, we scale up the number of rounds, based on the fraction used for validation. Is that correct?

Chris Parry
  • 2,937
  • 7
  • 30
  • 71

2 Answers2

4

Yep, it sounds correct if when you do best_nrounds = int(best_nrounds / 0.8) you consider that your validation set was 20% of your whole training data (another way of saying that you performed a 5-fold cross-validation).

The rule can then be generalized as:

n_folds = 5
best_nrounds = int((res.shape[0] - estop) / (1 - 1 / n_folds))

Or if you don't perform CV but a single validation:

validation_slice = 0.2
best_nrounds = int((res.shape[0] - estop) / (1 - validation_slice))

You can see an example of this rule being applied here on Kaggle (see the comments).

Jivan
  • 21,522
  • 15
  • 80
  • 131
  • thanks for your answer, and according to your solution do you mean we use cv to tune params and get the best boosting iterations and count the best iterations for our training data accoding to the folds and iterations of cv? and then we train the model directly on full train set with the iter rounds counted? – LancelotHolmes May 06 '17 at 03:32
  • 1
    I believe the best_nrounds = res.shape[0]. How come n_fold and estop affects the number of the best iteration? I believe res only reports the values below the best iteration point. – notilas Sep 13 '17 at 17:35
0

You can have the best iteration number via the 'res.best_iteration'

Yaron
  • 1,726
  • 14
  • 18
  • but that's the best_iteration of cv, how can we get the best iteration rounds for training set? – LancelotHolmes May 06 '17 at 03:32
  • That's correct. That's the best iteration of the CV and this is exactly what we interested in. The best iteration on the training set is probably going to be the last iteration that you ran. but in case the validation set stopped improved before that you actually started over fitting the data itself - something you don't want to do. – Yaron May 06 '17 at 03:50
  • thanks, but if I set the training num_round with a very large number, will I get an overfitting model finally? or shall I split the train set when I train the model and eval on the splited eval set with early stopping? – LancelotHolmes May 06 '17 at 05:56
  • Yes. you should split it. in case you'll have high 'num_round' and few training set samples - you'll overfit, this is exactly the reason why you're using the eval set during the training. – Yaron May 07 '17 at 01:02
  • I cannot find such parameter in xgb.cv in xgboost v0.6 – notilas Sep 13 '17 at 17:32