h2o error when run on a subset of the data but runs perfectly on the original data

Question

The error that i am getting is this. The subset[~100k examples] of my data has exactly the same number of columns as the original dataset [400k examples].But it runs perfectly on the original dataset but not on the subset.

Traceback (most recent call last)
<ipython-input-14-35cf02055a2e> in <module>()
     15 from h2o.estimators.gbm import H2OGradientBoostingEstimator
     16 gbm_cv3 = H2OGradientBoostingEstimator(nfolds=2)
---> 17 gbm_cv3.train(x=x, y=y, training_frame=train)
     18 ## Getting all cross validated models
     19 all_models = gbm_cv3.cross_validation_models()



error_count = 2
    http_status = 412
    msg = u'Illegal argument(s) for GBM model: 
GBM_model_python_1533214798867_179.  Details: ERRR on field: 
_response: Response cannot be constant.'
    dev_msg = u'Illegal argument(s) for GBM model: 
GBM_model_python_1533214798867_179.  Details: ERRR on field: 
_response: Response cannot be constant.'

score 5 · Accepted Answer · edited Aug 02 '18 at 20:09

5

This is a user error.

The "response" is the y column. And for the subset of data you have given, every row has the same value for y. You cannot train a supervised machine learning model when every y value is the same — there is nothing for the model to learn.

This can happen if you have a rare outcome -- when you randomly split the data you might get a partition that only has one value represented. To check how many unique values you have in the response column in Python, do the following: train[y].unique()

edited Aug 02 '18 at 20:09

Erin LeDell

8,704
1
19
35

answered Aug 02 '18 at 14:22

TomKraljevic

3,661
11
14

1

How is this a user error? Sure, there's nothing for the model to learn, but the fact that the library throws exceptions like this makes it unnecessarily hard to embed in a large automated application. It would be much more user-friendly if it just built a model which predicted that same constant class. Instead we have to catch these exceptions and put in workarounds in our code for no good reason. Is this an edge case? Sure. But the problem is still not ill-defined, it's just super easy. – nirvana-msu Aug 13 '18 at 10:07
I guess the best way I can respond to that is to say no data scientist I've talked to in the last five years would intentionally give a constant Y column. It's a sign that the data was prepared incorrectly. So the software treats it as an error. And that's what this persona of user would expect and likes in my experience. – TomKraljevic Aug 14 '18 at 02:57

h2o error when run on a subset of the data but runs perfectly on the original data

1 Answers1