2

Just started with H2O AutoML so apologies in advance if I have missed something basic.

I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.

If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.

If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data

aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)

then according to http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html the validation_frame is ignored "...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."

Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?

Thanks a lot!

  • Hello, I believe you are mistaking. For instance define your training dataframe as k-2 years, validation dataframe as k-1 year and test set as k year, you will have the ensemble model created. The validation dataset is another way of choosing the best hyperparameters that will be used on the test set. – BP34500 Oct 04 '20 at 23:52
  • Thanks for the suggestion but I think the validation_frame will be ignored in this case (according to the documentation). In order to make sure the model hyperparameter tuning is done on the validation_frame I have to explicitly set nfolds=0. Which in turn means that no ensemble will be created. – Dinos Bachas Oct 15 '20 at 13:46
  • Apologies, yes you are correct, you need to specify n_folds = 0 and yes according to the documentation, no ensemble will be created. However, another way you could go is to create your individual models separately and then create your custom ensemble models. See the following link at the bottom of the Python code (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/blending_frame.html) – BP34500 Oct 15 '20 at 14:38

1 Answers1

1

You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.

I would explicitly add nfolds=0 so that CV is disabled in AutoML:

aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)

To have an ensemble, add a blending_frame which also applies to time-series. See more info here.

Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).