R ranger confusion.matrix is larger than supposed when using expand.grid and purrr::pmap

Question

Sorry for all the purrr related questions today, still trying to figure out how to make efficient use of it.

So with some help from SO I managed to get random forest ranger model running based on input values coming from a data.frame. This is accomplished using purrr::pmap. However, I don't understand how the return values are generated from the called function. Consider this example:

library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris)  # let's assume these are different input tables

# the data.frame with the values for the function
hyper_grid <- expand.grid(
  Input_table = names(Input_list),
  mtry = c(1,2),
  Classification = TRUE,
  Target = "Species")

> hyper_grid
  Input_table mtry Classification  Target
1       iris1    1           TRUE Species
2       iris2    1           TRUE Species
3       iris1    2           TRUE Species
4       iris2    2           TRUE Species

# the function to be called for each row of the `hyper_grid`df
fit_and_extract_metrics <- function(Target, Input_table, Classification, mtry,...) {
  RF_train <- ranger(
    dependent.variable.name = Target, 
    mtry = mtry,
    data = Input_list[[Input_table]],  # referring to the named object in the list
    classification = Classification)  # otherwise regression is performed

  RF_train$confusion.matrix
}

# the pmap call using a row of hyper_grid and the function in parallel
purrr::pmap(hyper_grid, fit_and_extract_metrics)

It is supposed to return 4 times a 3*3 confusion matrix, as there are 3 levels in iris$Species, instead it returns giant confusion matrices. Can someone explain to me what is going on?

First lines:

> purrr::pmap(hyper_grid, fit_and_extract_metrics)
[[1]]
     predicted
true  4.4 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4
  4.3   1   0   0   0 0   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.4   1   1   1   0 0   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.5   1   0   0   0 0   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.6   0   1   1   1 1   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.7   1   0   1   0 0   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.8   0   0   1   3 1   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  4.9   0   0   1   2 2   0   0   0   0   0   0   0   0   0 1   0   0   0   0
  5     0   0   0   1 9   0   0   0   0   0   0   0   0   0 0   0   0   0   0
  5.1   0   0   0   0 0   8   0   0   0   1   0   0   0   0 0   0   0   0   0

The reason this isn't working is because things are not being assigned to the correct argument in the pmap call. Side note: any reason you're not using [`caret`](http://topepo.github.io/caret/index.html), that does all this tuning heavy lifting for you? — csgroen, Mar 31 '20 at 17:43
See this section on [model training and tuning](http://topepo.github.io/caret/model-training-and-tuning.html). — csgroen, Mar 31 '20 at 17:44
as far as I understand `caret` does not include all the parameters (e.g. `num.trees`) that I'm working with using ranger, randomForest and keras — crazysantaclaus, Mar 31 '20 at 17:51
@csgroen: So obviously I would be thankful if you could explain what exactly is wrong here — crazysantaclaus, Mar 31 '20 at 19:00
Howdy, sorry, work interrupted haha. I just ran the code and it's not really a pmap problem. I'll post the answer below ;) — csgroen, Mar 31 '20 at 19:09

score 1 · Accepted Answer · answered Mar 31 '20 at 19:12

The problem here was because the arguments passed to the function were levels, not characters. This tripped up the ranger function. To solve this, all you need to do is set stringsAsFactors = FALSE in the expand.grid:

hyper_grid <- expand.grid(
    Input_table = names(Input_list),
    mtry = c(1,2),
    Classification = TRUE,
    Target = "Species", stringsAsFactors = FALSE)

You'll get:

[[1]]
            predicted
true         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46         4
  virginica       0          4        46

[[2]]
            predicted
true         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46         4
  virginica       0          5        45

[[3]]
            predicted
true         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

[[4]]
            predicted
true         setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

I did not want to pressure you, I was just not sure if you would leave it at "use caret" ;-). As I have some more processing steps it took me a while to realize that the response variable is turned into factors again at a later stage. Now I know what to look for, thank you! — crazysantaclaus, Mar 31 '20 at 19:44
No worries. ;) I took a quick look at SO and then had something else to do, so I didn't end up following up. But yes, this is probably an unfortunate bug of `ranger` itself in the function fit. — csgroen, Apr 01 '20 at 12:39

R ranger confusion.matrix is larger than supposed when using expand.grid and purrr::pmap

1 Answers1

Linked