4

When using:

"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True

in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).

Any leads on how to do this in Python?

Community
  • 1
  • 1
Abhijeet Arora
  • 237
  • 3
  • 13

2 Answers2

3

The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).

Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.

Here's an updated example using XGBoost and showing both types of CV predictions:

import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()

# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()

# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)

# then train your model
xgb.train(x = x, y = y, training_frame = train)

# print the cross-validation predictions as a list
xgb.cross_validation_predictions()

# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()

The CV pred frame of predictions looks like this:

Out[57]:
  predict         p0        p1
---------  ---------  --------
        1  0.396057   0.603943
        1  0.149905   0.850095
        1  0.0407018  0.959298
        1  0.140991   0.859009
        0  0.67361    0.32639
        0  0.865698   0.134302
        1  0.12927    0.87073
        1  0.0549603  0.94504
        1  0.162544   0.837456
        1  0.105603   0.894397

[10000 rows x 3 columns]
Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • This helps a lot.. can not thank you enough.. Just to connect the dots, So `xgb.cross_validation_holdout_predictions()` gives prediction from each 5(nfold) different models for their particular holdout is it? – Abhijeet Arora Jul 23 '18 at 13:32
  • 1
    Yep, exactly. The predictions on each row are from when that row was part of the holdout set in the CV loop. – Erin LeDell Jul 23 '18 at 15:45
  • Awesome.. I think `xgb.cross_validation_holdout_predictions()` doesn't work when we have multiple classes because it shows None type when we use this method.. – Abhijeet Arora Jul 24 '18 at 05:01
  • @AbhijeetArora Make sure that you set `keep_cross_validation_predictions = True` or it will not store the CV predictions. – Erin LeDell Jul 24 '18 at 22:20
1

For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:

model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)

model.train(x = predictors, y = response, training_frame = train)

cv_predictions = model.cross_validation_predictions()
Michele Tonutti
  • 4,298
  • 1
  • 21
  • 22
  • 1
    This is almost correct, but not the right method -- this method will return a list of CV pred frames instead of the CV preds in a frame, like OP is asking for. – Erin LeDell Jul 21 '18 at 22:15