0

I built a GLM model using H2O (ver 3.14) in R. Please note that the training data contains integers, and also many NA, which I use MeanImputation to handle them.

glm <- h2o.glm(         
    training_frame = train.truth,        
    x=getColNames(train.truth),
    y="isFemale",                 
    family = "binomial",
    missing_values_handling = "MeanImputation",
    seed = 1000000) 

I then use a validation data set to look at the perf, and the Precision looks good to me:

h2o.performance(glm, newdata=valid.truth)%>% h2o.confusionMatrix()

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.529384526696015:
           0     1    Error         Rate
0      41962   300 0.007099   =300/42262
1        863 13460 0.060253   =863/14323
Totals 42825 13760 0.020553  =1163/56585

I then saved the model as a MOJO:

h2o.download_mojo(glm, path="models/mojo", get_genmodel_jar=TRUE)

I exported the validation DF to a CSV file:

dt.valid <- data.table(as.data.frame(valid.truth))
write.table(dt.valid, row.names = F, na="", file="models/test.csv")

I tried to use the saved mojo to do the same prediction by running this on my Linux shell:

java -cp h2o-genmodel.jar hex.genmodel.tools.PredictCsv \
    --mojo GLM_model_R_1511161743608_15 \
    --decimal --mojo GLM_model_R_1511161743608_15.zip \
    --input ../test.csv --output output.csv

However, the result is terrible. All the records were predicted as 0, which is very different from what I got when I ran the model in R.

I have been stuck in this for a day but I couldn't figure out what went wrong. Anyone can shed some light on this?

Patrick Ng
  • 160
  • 7
  • This kind of issue is very detail oriented, and needs a reproducible example. I recommend trying to use the MOJO on a single row using Java code, and single-stepping it in a Java debugger, to see where it behaves differently from what you expect. – TomKraljevic Nov 20 '17 at 13:52
  • Here is a link to the source code: https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/tools/PredictCsv.java – TomKraljevic Nov 20 '17 at 14:01
  • After more trying, I suspect the problem might be related to the strange behavior I saw in h2o.predict as described in other post: https://stackoverflow.com/questions/47404817/glm-model-h2o-predict-gives-very-different-results-depending-on-number-of-rows – Patrick Ng Nov 21 '17 at 03:44

0 Answers0