2

I am getting different predictions for the same test data set from both h2o.predict and h2o.mojo_predict_df. When compared - roughtly 50% of records have same probabilities but 50% are different with some where probabilities change drastically =e.g. 0.88 to 0.55 for the same class.

The modelling algorithm used is h2o.gbm and h2o.download_mojo(gbm_model,get_genmodel_jar = T)

I am trying to research and have found a few more posts with similar questions but no solution :

Reproduce predictions with MOJO file of a H2O GBM model

GLM model: h2o.predict gives very different results depending on number of rows used in the validation data

Why do I get different predictions with MOJO?

The codes used so far are as below :

# h2o start the cluster


h2o.init(nthreads=10,min_mem_size = '80g')

# variables 

predictors=c(1:76,78:681)
response=77

# getting datasets ready 

model_ready_df = model_ready_df %>% mutate_if(is.character,as.factor)
train.h2o = as.h2o(model_ready_df)
poc_test = poc_test %>% mutate_if(is.character,as.factor)
test.h2o <- as.h2o(poc_test)


# build model 

gbm_model <- h2o.gbm(x = predictors, y =response, training_frame = train.h2o , seed = 0xDECAF,ntrees = 1000, max_depth = 4,
                     learn_rate = 0.1,stopping_rounds=50,min_rows = 50,distribution = "bernoulli",ignore_const_col=F,
                     histogram_type='QuantilesGlobal',sample_rate=0.7,col_sample_rate=0.7,keep_cross_validation_models = T)


# save model object

h2o.download_mojo(gbm_model,get_genmodel_jar = T)

# predict 

preds=as.data.frame(h2o.predict(gbm_model,test.h2o))
preds2=h2o.mojo_predict_df(poc_test, 'GBM_model_R_1576045840818_1.zip',genmodel_jar_path = 'h2o-genmodel.jar',verbose = F)

# save 

fwrite(preds,"pred_usual.csv")
fwrite(preds2,"pred_mojo.csv")

example

enter image description here

Learner_seeker
  • 544
  • 1
  • 4
  • 21
  • Given what little information there is here, I can say this is unexpected. Are you able to provide enough info for a reproduction? – TomKraljevic Dec 11 '19 at 14:52
  • I could try to provide an example but it'll not be representative as my data set has 700+ columns and it is a complex gbm model. I have just used the above code to get the two sets of predictions and it is very strange that some are exactly the same while some are different. I am not sure if a mojo object works differently then h20.predict – Learner_seeker Dec 11 '19 at 14:56
  • @Pb89 if you could provide your GBM model + few rows (few with the same prediction, few with different) of your dataset we would be able to figure out what is going on. How do you create the Pandas dataframe that goes into `mojo_predict_df`. Can you share your code please? – Michal Kurka Dec 11 '19 at 15:02
  • @MichalKurka - i am using R for running the h2o. My code for model building and saving mojo object + prediction i have added to the question above. Hope that helps give some clarity. – Learner_seeker Dec 12 '19 at 01:28
  • @Pb89, thank you for the details. I don't see anything suspicious, can you please try to upload the MOJO to H2O and try mojo scoring in-h2o? mojo_model <- h2o.import_mojo('GBM_model_R_1576045840818_1.zip') predictions <- h2o.predict(mojo_model, test.h2o) Documentation of h2o.import_mojo can be found on http://docs.h2o.ai/h2o/latest-stable/h2o-docs/save-and-load-model.html#importing-in-r-or-python – Michal Kurka Dec 17 '19 at 16:33
  • @MichalKurka - strange - i tried again after rebuilding the model. There were still differences between `mojo_predict_df` and `h2o.predict`. however when i try your approach, import back the Mojo and predict again - i get exactly same as `h2o.predict`. My concern is that for deployment we are using MOJO object directly in a JAVA environment and hence using import MOJO isn't viable – Learner_seeker Dec 20 '19 at 05:55
  • @Pb89 this indicates there is no actual issue with the MOJO itself, there is likely a bug in the `mojo_predict_df` (my guess would be that the dataset is parsed in a different way) - as long as "import mojo" produces the same results as your trained model I wouldn't worry about putting the mojo to production – Michal Kurka Feb 14 '20 at 19:29

1 Answers1

0

h2o.mojo_predict_df converts the data frame into a csv and then essentially runs h2o.mojo_predict_csv. Hence in this process of writing and parsing the variables - certain variables may have formats which are incorrectly written in the csv and hence leads to difference in results. one example is scientific notation in R , if your numbers are displayed as e+10. When these are written into the csv , the formats get mixed up. Use options(scipen=999) to correct for this and then run the mojo functions. The results should be the same.

Learner_seeker
  • 544
  • 1
  • 4
  • 21