How do h2o models determine what columns to use for predictions (position, name, etc.)?

Question

Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict() method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)

The details of the code here (I think) are not majorly important, but the basic training logic looks something like

drf_dx = h2o.h2o.H2ORandomForestEstimator(
    # denoting update version name by epoch timestamp
    model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())), 
    response_column='dx_outcome',
    ignored_columns=[
        'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
        'procedure_outcome', 'provider_outcome',
        'weight'
    ],
    weights_column='weight',
    ntrees=64,
    nbins=32,
    balance_classes=True,
    binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train, 
          training_frame=train_u, validation_frame=val_u, 
          max_runtime_secs=max_train_time_hrs*60*60)

(note the ignored columns) and the prediction logic just looks like

preds = model.predict(X)

where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like

<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>

and X columns may look like

<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>

My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns arg in the model constructor?

** Slight aside: should the weights_column name be in the ignored_columns list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it

For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).

but these weight values are not something that comes with the data used in actual predictions.

score 2 · Accepted Answer · edited Oct 24 '18 at 18:11

I've summarized your question into a few distinct parts, so the answers will be in a Q/A type fashion.

1). When I use my_model.predict(X), how does H2O-3 know which columns to predict with?

H2O-3 will use the columns that you passed as predictors when you built your model (i.e. whatever you passed to the x argument in the estimator, or all the columns you included in your training_frame which were not: ignored using ignored_columns, passed as a target to the y argument, dropped because the column has a constant value.). My recommendation would be to use the x argument to specify your predictors and ignore the ignore_columns parameter. If X, the new dataframe you are predicting on includes columns that were not used when you were building a model, those columns will be ignored - so column names not column positions.

2) Should the weights column name be in the ignored column list?

No, if you pass the weights column to the ignored column list, that column will not be considered in any fashion during the model building phase. In fact, if you test this out, you should get a null pointer error or something similar.

3) Why is the "weights" column specified as a predictor and as the weights_column in the following code example?

This is a great question! I've created two Jira tickets one to update the documentation to clear up the confusion and another one to potentially add a user warning.
The short answer, is if you pass the same column to the predictors argument x and the weights_column argument, that column will only be used as a weight - it will not be used as a feature.

4) Does the user guide recommend using the weights as a feature and as a weight?

No, in the paragraph you are pointing to, the recommendation is to ensure that the column you pass as your weights_column exists in your training frame and validation frame - not that it should also be included as a feature.

How do h2o models determine what columns to use for predictions (position, name, etc.)?

1 Answers1