0

I am trying to find the best max_depth value using the following code

   library(h2o)
h2o.init()
# import the titanic dataset
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
dim(df)
head(df)
tail(df)
summary(df,exact_quantiles=TRUE)

# pick a response for the supervised problem
response <- "survived"

# the response variable is an integer.
# we will turn it into a categorical/factor for binary classification
df[[response]] <- as.factor(df[[response]])

# use all other columns (except for the name) as predictors
predictors <- setdiff(names(df), c(response, "name"))

# split the data for machine learning
splits <- h2o.splitFrame(data = df,
                         ratios = c(0.6,0.2),
                         destination_frames = c("train.hex", "valid.hex", "test.hex"),
                         seed = 1234)
train <- splits[[1]]
valid <- splits[[2]]
test  <- splits[[3]]

# Establish a baseline performance using a default GBM model trained on the 60% training split
# We only provide the required parameters, everything else is default
gbm <- h2o.gbm(x = predictors, y = response, training_frame = train)

# Get the AUC on the validation set
h2o.auc(h2o.performance(gbm, newdata = valid))
# The AUC is over 94%, so this model is highly predictive!
[1] 0.9480135

# Determine the best max_depth value to use during a hyper-parameter search.
# Depth 10 is usually plenty of depth for most datasets, but you never know
hyper_params = list( max_depth = seq(1,29,2) )
# or hyper_params = list( max_depth = c(4,6,8,12,16,20) ), which is faster for larger datasets

grid <- h2o.grid(
  hyper_params = hyper_params,

  # full Cartesian hyper-parameter search
  search_criteria = list(strategy = "Cartesian"),

  # which algorithm to run
  algorithm="gbm",

  # identifier for the grid, to later retrieve it
  grid_id="depth_grid",

  # standard model parameters
  x = predictors,
  y = response,
  training_frame = train,
  validation_frame = valid,

  # more trees is better if the learning rate is small enough
  # here, use "more than enough" trees - we have early stopping
  ntrees = 10000,

  # smaller learning rate is better, but because we have learning_rate_annealing,
  # we can afford to start with a bigger learning rate
  learn_rate = 0.05,

  # learning rate annealing: learning_rate shrinks by 1% after every tree
  # (use 1.00 to disable, but then lower the learning_rate)
  learn_rate_annealing = 0.99,

  # sample 80% of rows per tree
  sample_rate = 0.8,

  # sample 80% of columns per split
  col_sample_rate = 0.8,

  # fix a random number generator seed for reproducibility
  seed = 1234,

  # early stopping once the validation AUC doesn't improve by at least
  # 0.01% for 5 consecutive scoring events
  stopping_rounds = 5,
  stopping_tolerance = 1e-4,
  stopping_metric = "AUC",

  # score every 10 trees to make early stopping reproducible
  # (it depends on the scoring interval)
  score_tree_interval = 10)

# by default, display the grid search results sorted by increasing logloss
# (because this is a classification task)
grid

# sort the grid models by decreasing AUC
sortedGrid <- h2o.getGrid("depth_grid", sort_by="auc", decreasing = TRUE)
sortedGrid

# find the range of max_depth for the top 5 models
topDepths = sortedGrid@summary_table$max_depth[1:5]
minDepth = min(as.numeric(topDepths))
maxDepth = max(as.numeric(topDepths))

> sortedGrid

I am getting the following errors:

  1. 'NULL' for the line 'h2o.auc(h2o.performance(gbm, newdata = valid))'
  2. 'ERRR on field: _stopping_metric: Stopping metric cannot be AUC for regression. ' when trying to execute the fnction 'h2o.grid'

How to resolve the issues?

The issue is resolved using the above sample code. The issue was mainly because I was using data which was encoded. After using data without encoding and also importing the data using " h2o.importFile" command instead of "read.csv", the issues were resolved!

Kitooos
  • 37
  • 7
  • it looks like something is wrong with your gbm model or your new data. Can you please test out your steps using this code snippet and see if you hit similar issues http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/x.html#example. h2o.auc() should not return null if done correctly. Also notice that your grid search error is because the grid search thinks your response is numeric. – Lauren Sep 25 '18 at 18:12
  • Thanks for the suggestion. It works now. Can you please tell (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/max_depth.html) why the values 9 to 27 are considered the best MaxDepth in this link sample code?? How can we interpret the maxDepth using the logloss or AUC as it is mentioned in the link? Also what's the use of topDepths,minDepth,maxDepth values in the code?? – Kitooos Sep 26 '18 at 00:39
  • That is just example code, showing how to use the parameters. Please see this tutorial for best practices: https://github.com/h2oai/h2o-3/blob/3.10.0.7/h2o-docs/src/product/tutorials/gbm/gbmTuning.Rmd. Can you please update your question add EDIT in bold and explain how you solved your issue, or please post a solution to your question just so it can help other people. Thanks! – Lauren Sep 26 '18 at 14:55
  • Thanks for the tutorial! I will surely try to read it to understand better about GBM logics! Also edited the post! :) – Kitooos Sep 26 '18 at 22:40
  • Hi I tried using the code you have given, but still have lots of doubts regarding the interpretations. Do you have any tutorial which gives a better interpretations on the results of GBM? I would like to understand the results in a better way and as beginner I find it is difficult interpret the results.! Thanks – Kitooos Sep 27 '18 at 19:47
  • Also, I tried applying the tuned parameters in RapidMiner. But I get a different accuracy in RapidMiner than R. Why R and Rapdminer gives different accuracy on the same parameters and same model? – Kitooos Sep 27 '18 at 19:51
  • H2O's models are not identical to those of other packages. I'm not sure if their is a better tutorial, maybe you can poke around in the documentation and the github repo a bit more. – Lauren Sep 27 '18 at 20:15

0 Answers0