0

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.

Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.

What would be the best way to achieve this ?

It would be somewhat in line with the early stopping criteria but not exactly.

Alternately, if there is a possibility to get the model from an intermediate round ?

Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)

library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth =   2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")

Here is the output

[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063

Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?

I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.

1 Answers1

0

In the absence of any code you have tried or any data you are using then try something like this:

require(xgboost)
library(Metrics) # for rmse to calculate errors
    
# Assume you have a training set db.train and have some 
# feature indices of interest and a test set db.test

predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])

# you have some response you are interested in
outcomeName <- "myLabel"
    
# you may like to include for testing some other parameters like: 
# eta, gamma, colsample_bytree, min_child_weight
    
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
        
smallestError <- 100 # set to some sensible value depending on your eval metric

for (depth in seq(1, 4, 1)) {
  for (rounds in seq(1, 100, 1)) {
    # train
    bst <- xgboost(data = as.matrix(db.train[,predictors]),
                   label = db.train[,outcomeName],
                   max.depth = depth, 
                   nround = rounds,
                   eval_metric = "logloss",
                   objective = "binary:logistic", 
                   verbose=TRUE)
    gc()
                        
    # predict
    predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]), 
                                      outputmargin = TRUE))
    err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
        
    if (err < smallestError) {
      smallestError = err
      print(paste(depth,rounds,err))
    }     
  }
}  

You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

Leonardo
  • 2,439
  • 33
  • 17
  • 31
cousin_pete
  • 578
  • 4
  • 15
  • Thanks for your answer - If I understand correctly, the code you have involves repeatedly training the model with increasing number of rounds and stopping when the error gets below the cutoff. Is there a way to run the model just once, calculating the eval_metric after every round and then saving the model when the eval_metric goes below a cutoff. – Swagato Acharjee Jan 25 '17 at 14:24
  • You may need to re-read the xgboost documentation. Xgboost uses an iterative approach to arrive at a useful model. Each round gets closer to some error level you are comfortable with: a balance of the time to create the model and model complexity appropriate to your situation. I don't think there is any way you can know in advance a "correct" model. Again, in the absence of any of your code or any sample dataset it is hard for me to know what you are asking – cousin_pete Jan 26 '17 at 07:00
  • I added an example to my question - hopefully it explains my question better – Swagato Acharjee Jan 26 '17 at 16:21
  • In the loop you can use stop(), break or return() depending on the way your code is set up. For example in the err – cousin_pete Jan 27 '17 at 06:03