4

Getting a strange message from H2O ( h2o_3.26.0.2 ) when predicting using a MOJO file:

Detected 14 unused columns in the input data set: {X8,X9,X10,X12,X1,X11,X2,X14,X3,X13,X4,X5,X6,X7}

I know that it is not a missing variable issue, as then H2O outputs:

There were 1 missing columns found in the input data set: {X1}

To reproduce the warning message I have created a small example with X15 being my target variable:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(h2o))

# read in data ------------------------------------------------------------

data_set <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", col_names = FALSE) %>% 
  mutate_if(is.character, factor)
set.seed(3456)
trainIndex <- caret::createDataPartition(data_set$X15, p = .8, 
                                  list = FALSE, 
                                  times = 1)

train_strat <- data_set[ trainIndex,]
test_strat  <- data_set[-trainIndex,]

# start h2o ---------------------------------------------------------------

h2o.init(startH2O = T, max_mem_size = '400G')
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         25 minutes 3 seconds 
#>     H2O cluster timezone:       Etc/UTC 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.26.0.2 
#>     H2O cluster version age:    6 months and 29 days !!! 
#>     H2O cluster name:           H2O_started_from_R_rstudio_ltd590 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   353.12 GB 
#>     H2O cluster total cores:    64 
#>     H2O cluster allowed cores:  64 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
#>     R Version:                  R version 3.5.1 (2018-07-02)
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is too old (6 months and 29 days)!
#> Please download and install the latest version from http://h2o.ai/download/
# h2o.shutdown()
train_strat_h2o <- as.h2o(train_strat)
test_strat_h2o <- as.h2o(test_strat)
Y <- 'X15'
X <- setdiff(names(train_strat), Y)

# train and predict --------------------------------------------------------------

rf_h2o <- h2o.randomForest(         
  training_frame = train_strat_h2o,
  x = X,                        
  y = Y,    
  nfolds = 0,
  model_id = "big_rf",   
  ntrees = 25,                 
  max_depth = 55,               
  stopping_rounds = 5,          
  stopping_tolerance = 1e-3,
  score_each_iteration = T,
  seed = 123
)

After the model has been trained and is sitting in the environment, I do a prediction and get the following results :

predict(rf_h2o, test_strat_h2o)
#>   predict     <=50K      >50K
#> 1   <=50K 1.0000000 0.0000000
#> 2    >50K 0.4980000 0.5020000
#> 3    >50K 0.2465993 0.7534007
#> 4    >50K 0.1200000 0.8800000
#> 5   <=50K 0.8040000 0.1960000
#> 6   <=50K 0.8628571 0.1371429
#> 
#> [6512 rows x 3 columns]

Now I move on to put the model into production by downloading the model:

h2o.download_mojo(rf_h2o, path = "output/", get_genmodel_jar = T)

And finally we can now use the MOJO file to predict. HERE is where I get the odd message, Detected 14 unused columns in the input data set, although the predictions seem the same?

head(
  h2o.mojo_predict_df(
    test_strat[, -15],
    mojo_zip_path = "output/big_rf.zip",
    genmodel_jar_path = "output/h2o-genmodel.jar"
  )
)
#>  Detected 14 unused columns in the input data set: {X8,X9,X10,X12,X1,X11,X2,X14,X3,X13,X4,X5,X6,X7}
#>  predict    X..50K     X.50K
#> 1   <=50K 1.0000000 0.0000000
#> 2    >50K 0.4980000 0.5020000
#> 3    >50K 0.2465993 0.7534007
#> 4    >50K 0.1200000 0.8800000
#> 5   <=50K 0.8040000 0.1960000
#> 6   <=50K 0.8628571 0.1371429

Created on 2020-02-25 by the reprex package (v0.2.1)

Is this something I should be worried about?

Hanjo Odendaal
  • 1,395
  • 2
  • 13
  • 32

0 Answers0