0

I am currently using h2o autoML to train a model on a binary classification problem. I have a train (70% ~200k rows), valid (10% ~30k rows), test (10% ~30k rows) and blend (10% ~30k rows) datasets all coming from the time sensitive splitting of the original dataset (~300k rows).

When checking the training confusion matrix I only see 10k total cases instead of ~200k.

I create the model like this :

#  Create model
aml = H2OAutoML(
    max_runtime_secs=max_runtime_secs,
    stopping_metric=stopping_metric, # "AUCPR"
    sort_metric=sort_metric, # "AUCPR"
    nfolds=nfolds, # set to 0
    distribution=distribution, # "bernoulli"
    verbosity=verbosity,
    balance_classes=balance_classes, # False
    seed=seed,
)

aml.train(
    y=outcome_column,
    training_frame=train,
    validation_frame=valid,
    leaderboard_frame=test,
    blending_frame=blend,
)

# Get the best model
best_model = aml.get_best_model()

# get the performance on test
performance = best_model.model_performance(test)

# define the threshold based on the desired metric
best_threshold = best_model.find_threshold_by_max_metric(
        metric=metric_to_use, valid=True)

# inspect confusion matrix on training set using that threshold
train_confusion = best_model.confusion_matrix(
        thresholds=best_threshold, train=True)

# inspect confusion matrix on test using that threshold
test_confusion = performance.confusion_matrix(thresholds=best_threshold)

# confusion matrix validation using that threshold
valid_confusion = best_model.confusion_matrix(
    thresholds=best_threshold, valid=True)
)

These are the resulting confusion matrix:

confusion matrix train: Confusion Matrix (Act/Pred) @ threshold = 0.35701837501784456
       False    True    Error    Rate
-----  -------  ------  -------  --------------
False  8589     190     0.0216   (190.0/8779.0)
True   272      904     0.2313   (272.0/1176.0)
Total  8861     1094    0.0464   (462.0/9955.0) 

confusion matrix valid: Confusion Matrix (Act/Pred) @ threshold = 0.3555305434918455
       False    True    Error    Rate
-----  -------  ------  -------  ----------------
False  23367    802     0.0332   (802.0/24169.0)
True   1486     1580    0.4847   (1486.0/3066.0)
Total  24853    2382    0.084    (2288.0/27235.0) 

confusion matrix test: Confusion Matrix (Act/Pred) @ threshold = 0.3546996890950105
       False    True    Error    Rate
-----  -------  ------  -------  ----------------
False  23399    769     0.0318   (769.0/24168.0)
True   1537     1529    0.5013   (1537.0/3066.0)
Total  24936    2298    0.0847   (2306.0/27234.0) 

We can see that on valid and test confusion matrix I have my ~30k totals cases but I only have ~10k total cases on train confusion matrix instead of the initial ~200k rows. Why ?

EDIT 1: Here is the leaderboard of the models :

LEADERBOARD:
model_id                                                    aucpr       auc    logloss    mean_per_class_error      rmse        mse    training_time_ms    predict_time_per_row_ms  algo
StackedEnsemble_BestOfFamily_1_AutoML_1_20230504_164001  0.632635  0.876718   0.226965                0.260764  0.253097  0.0640579                4397                   0.041607  StackedEnsemble
GBM_1_AutoML_1_20230504_164001                           0.632514  0.876067   0.237024                0.262188  0.254668  0.0648556               24116                   0.038866  GBM
StackedEnsemble_BestOfFamily_2_AutoML_1_20230504_164001  0.631139  0.87681    0.22705                 0.262454  0.253345  0.0641838                1421                   0.049998  StackedEnsemble
GBM_4_AutoML_1_20230504_164001                           0.491181  0.810158   0.338646                0.308554  0.311599  0.0970939                 638                   0.001294  GBM
GBM_3_AutoML_1_20230504_164001                           0.374094  0.762457   0.342545                0.353879  0.312943  0.0979334                 642                   0.001166  GBM
DRF_1_AutoML_1_20230504_164001                           0.373471  0.755862   2.01974                 0.290401  0.311465  0.0970105                1735                   0.001758  DRF
GBM_2_AutoML_1_20230504_164001                           0.355587  0.758635   0.343282                0.371797  0.3132    0.0980943                 960                   0.001194  GBM
GLM_1_AutoML_1_20230504_164001                           0.330708  0.727998   0.312086                0.362253  0.298647  0.08919                 23648                   0.001599  GLM
[8 rows x 10 columns]
Guest6117
  • 13
  • 3

1 Answers1

0

DeepLearning and StackedEnsemble models have a parameter score_training_samples that defaults to 10 000 which speeds up the training by calculating the training scores only on a sample - the rationale behind it is that users don't generally care much about the training performance metrics so the estimate on the sample is often sufficient while providing a speed up.

You can use best_model.confusion_matrix(training_frame) to get confusion matrix for the whole training frame. More details are in the documentation.

Tomáš Frýda
  • 546
  • 3
  • 8
  • Thanks for your answer. I edited my post to add the leaderboard of the models. There is no deeplearning model used. I always computed training confusion matrix this way and never had an issue. I will try with your suggestion and come back to you – Guest6117 May 04 '23 at 15:29
  • @Guest6117 I forgot we use the same "trick" in the Stacked Ensemble so I updated the answer (even worse I think I was the one who added the parameter to the Stacked Ensemble). Thank you for adding the leaderboard! – Tomáš Frýda May 04 '23 at 16:20
  • Ok I understand now ! Is it possible to pass this parameter to the autoML function, down to the stacked ensemble ? And if the default value is 10k why do I have 9955 cases and not 10k exactly ? – Guest6117 May 05 '23 at 08:21
  • It's not possible to easily pass this argument to AutoML. Why 9955 instead of 10k? The sampling process works in parallel - estimate the proportion of rows that we want to get and in each worker sample that proportion. So the number of rows in the sample should be close to the 10k but it doesn't guarantee it will be exactly 10k. – Tomáš Frýda May 05 '23 at 09:30