Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict()
method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)
The details of the code here (I think) are not majorly important, but the basic training logic looks something like
drf_dx = h2o.h2o.H2ORandomForestEstimator(
# denoting update version name by epoch timestamp
model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())),
response_column='dx_outcome',
ignored_columns=[
'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
'procedure_outcome', 'provider_outcome',
'weight'
],
weights_column='weight',
ntrees=64,
nbins=32,
balance_classes=True,
binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train,
training_frame=train_u, validation_frame=val_u,
max_runtime_secs=max_train_time_hrs*60*60)
(note the ignored columns) and the prediction logic just looks like
preds = model.predict(X)
where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like
<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>
and X columns may look like
<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>
My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns
arg in the model constructor?
** Slight aside: should the weights_column
name be in the ignored_columns
list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).
but these weight values are not something that comes with the data used in actual predictions.