2

dI'm new to R and ML but have a focused question that I am trying to answer.

I'm using my own data but following Matt Dancho's example here to predict attrition: http://www.business-science.io/business/2017/09/18/hr_employee_attrition.html

I have removed zero variance and scaled variables as per his update.

My issue is running the explain() on explainer step. I get variations of both errors below (in bold) when I run the former original code and the latter variation. Everything else runs up to that point.

explanation <- lime::explain(
as.data.frame(test_h2o[1:10,-1]), 
explainer    = explainer, 
n_labels     = 1, 
n_features   = 4,
kernel_width = 0.5)

gives:

Error during wrapup: arguments imply differing number of rows: 50000, 0

While

explanation <- lime::explain(
as.data.frame(test_h2o[1:500,-1]), 
explainer    = explainer, 
n_labels     = 1, 
n_features   = 5,
kernel_width = 1)

Gives:

ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost:54321/3/PostFile?destination_frame=C%3A%2FUsers%2Fsim.s%2FAppData%2FLocal%2FTemp%2FRtmpykNkl1%2Ffileb203a8d4a58.csv_sid_afd3_26)
Error: lexical error: invalid char in json text.
<html> <head> <meta http-equiv=
                 (right here) ------^

Please let me know if you have any ideas or insights for this problem, or need additional info from me.

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
Stacy S.
  • 21
  • 2
  • R has lazy evaluation, so the error might actually be on an earlier line. Can you show `nrow(test_h2o)` and `ncol(test_h2o)` just before you make that call? (My guess from the error message is that `test_h2o` is not what you think it is at that point.) – Darren Cook Jan 03 '18 at 08:37
  • I have 2179 rows and 48 columns which checks out with the 15% of the dataset that I expected. Do you see any issues with this? – Stacy S. Jan 04 '18 at 19:07
  • That seems fine. If they had been zero, or you had got the error running them, that would have suggested the client and server were out of sync, or some other data problem. – Darren Cook Jan 04 '18 at 21:21

1 Answers1

0

Try this and let me know what you get. Note that this assumes your excel file is stored in a folder called "data" in your working directory. Use getwd() and setwd() to get/set the working directory (or use Projects in RStudio IDE).

library(h2o)        # Professional grade ML pkg
library(tidyquant)  # Loads tidyverse and several other pkgs 
library(readxl)     # Super simple excel reader
library(lime)       # Explain complex black-box ML models
library(recipes)    # Preprocessing for machine learning

hr_data_raw_tbl <- read_excel(path = "data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

hr_data_organized_tbl <- hr_data_raw_tbl %>%
  mutate_if(is.character, as.factor) %>%
  select(Attrition, everything())

recipe_obj <- hr_data_organized_tbl %>%
  recipe(formula = Attrition ~ .) %>%
  step_rm(EmployeeNumber) %>%
  step_zv(all_predictors()) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep(data = hr_data_organized_tbl)

hr_data_bake_tbl <- bake(recipe_obj, newdata = hr_data_organized_tbl) 

h2o.init()

hr_data_bake_h2o <- as.h2o(hr_data_bake_tbl)

hr_data_split <- h2o.splitFrame(hr_data_bake_h2o, ratios = c(0.7, 0.15), seed = 1234)

train_h2o <- h2o.assign(hr_data_split[[1]], "train" ) # 70%
valid_h2o <- h2o.assign(hr_data_split[[2]], "valid" ) # 15%
test_h2o  <- h2o.assign(hr_data_split[[3]], "test" )  # 15%

y <- "Attrition"
x <- setdiff(names(train_h2o), y)

automl_models_h2o <- h2o.automl(
  x = x, 
  y = y,
  training_frame    = train_h2o,
  validation_frame  = valid_h2o,
  leaderboard_frame = test_h2o,
  max_runtime_secs  = 15
)

automl_leader <- automl_models_h2o@leader

explainer <- lime::lime(
  as.data.frame(train_h2o[,-1]), 
  model          = automl_leader, 
  bin_continuous = FALSE
)

explanation <- lime::explain(
  x              = as.data.frame(test_h2o[1:10,-1]), 
  explainer      = explainer, 
  n_labels       = 1, 
  n_features     = 4,
  n_permutations = 500,
  kernel_width   = 1
)

explanation
Matt Dancho
  • 6,840
  • 3
  • 35
  • 26
  • I am able to run your code for that data and get output...I'm just not sure what it's not liking about my dataset... – Stacy S. Jan 04 '18 at 13:30
  • The error message indicates an issue with the response column. Make sure that your response is a Factor with the same levels, meaning level 0 = No, level 1 = Yes, for all splits of your data. If the factor levels are inconsistent, that could cause this issue. – Matt Dancho Jan 04 '18 at 18:01
  • I am getting the same error at the explanation step: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 5000, 0" Any thoughts on why this is? – Stacy S. Jan 04 '18 at 19:01
  • Yes, thank you, Matt! I changed the responses to a simple Yes/No like the dataset in your example, and it accepted that part. – Stacy S. Jan 04 '18 at 19:04
  • I still haven't been able to solve variations of this error at the explanation step. Any insights?: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 500, 0 " – Stacy S. Jan 05 '18 at 22:24