0

[Python 3.5.2, h2o 3.22.1.1, JRE 1.8.0_201]

I am running a glm lambda_search and using the regularization path to select a lambda.

glm_h2o = H2OGeneralizedLinearEstimator(family='binomial', alpha=1., lambda_search=True, seed=param['GLM_SEED'])
glm_h2o.train(y='label', training_frame=train_h2o, fold_column='fold')

regpath_h2o = H2OGeneralizedLinearEstimator.getGLMRegularizationPath(glm_h2o)
regpath_pd = pd.DataFrame(index=np.arange(len(regpath_h2o['lambdas'])), columns=['lambda','ncoef','auc'])
for n,(lamb,coefs) in enumerate(zip(regpath_h2o['lambdas'],regpath_h2o['coefficients'])):
    mod = H2OGeneralizedLinearEstimator.makeGLMModel(model=glm_h2o, coefs=coefs)
    regpath_pd.loc[n] = [lamb, sum(1 for x in coefs.values() if abs(x)>1E-3), mod.model_performance(train_h2o).auc()]

The values in regpath_pd are as shown below:

  lambda  ncoef   auc |    lambda ncoef    auc |    lambda ncoef    auc
0  0.103    1   0.5   | 10  0.041   14   0.742 | 20  0.016   54   0.794
1  0.094    3   0.632 | 11  0.037   15   0.743 | 21  0.015   62   0.799
2  0.085    3   0.632 | 12  0.034   18   0.749 | 22  0.013   72   0.804
3  0.078    5   0.696 | 13  0.031   19   0.752 | 23  0.012   83   0.849
4  0.071    5   0.696 | 14  0.028   20   0.754 | 24  0.011   90   0.813
5  0.065    6   0.697 | 15  0.026   26   0.766 | 25  0.010  110   0.816
6  0.059    7   0.702 | 16  0.023   31   0.770 | 26  0.009  123   0.819
7  0.054    8   0.707 | 17  0.021   34   0.774 | 27  0.008  147   0.822
8  0.049   10   0.729 | 18  0.019   41   0.777 | 28  0.008  165   0.825
9  0.045   13   0.740 | 19  0.018   50   0.791 | 29  0.007  190   0.828

I was expecting that as lambda penalty decreases, ncoef and auc will increase (non-decreasing). That is true for most of the time with one exception. See index 23 - the auc increases a fair bit and then reduces again. Is there an explanation for that? Do I need to set some tolerance parameter or ...? In this run nlambdas = 100 (default). When I set it to 50, lambda, ncoef and auc values are monotonic.

FYI - for the purposes of this post I have truncated lambda and auc values to 3 decimal places. None of those values are truncated in the actual run.

enter image description here

UPDATE

Following the code here I re-wrote the loop so that the model is re-trained for every lambda. That works correctly and the monotonicity is maintained. Obviously that takes much longer to run. Here's the approach I ended up with: identify the index which has an issue and train the full model for that index only. FWIW here's that part of the code

auc_diff = regpath_pd['auc'][1:].values - regpath_pd['auc'][:-1].values
arg_bad = np.argwhere(auc_diff<-1E-3).ravel())

for n in arg_bad.tolist():
    lamb = regpath_h2o['lambdas'][n]
    mod = H2OGeneralizedLinearEstimator(family='binomial', alpha=1., lambda_search=False, Lambda=lamb, seed=param['GLM_SEED'])
    mod.train(y='label', training_frame=train_h2o, fold_column='fold')
    regpath_pd.loc[n] = [lamb, sum(1 for x in mod.coef().values() if abs(x)>1E-3), mod.model_performance().auc()]

The resulting graph is shown below (on a different scale). Looks like the problem is with getGLMRegularizationPath.

enter image description here

ironv
  • 978
  • 10
  • 25

1 Answers1

1

We suspect there may be a bug in the AUC code. We replicated your code with our data but could not reproduce. The best way moving forward is to provide us with your dataset if possible. Once we can reproduce we'll work on a fix. Thanks!

Joe
  • 268
  • 2
  • 5