1

I am fairly new to R and am teaching myself some machine learning techniques. Currently I am working on hyperparameter tuning and to get a better understanding of the matter I try to do the tasks more manually than they need to be. So I am using a tibble with list columns where each row contains among other things a training set cross-validation fold and certain hyperparameter values of a random forest algorithm. The whole grid contains all unique combinations of these in a specified range. The models should be built by iterating the ranger function over all rows(i.e. fold/parameter combinations) and then be saved into a list column. For this purpose I use the map function family of the purrr package.

The problem is that this approach only works when mapping the data and one single parameter(mtry) to the ranger function by using map2. I know that I need to use pmap when mapping more than 2 elements to a function. But this, unlike the two element case described before, does not work for me with data and two parameters(mtry and min.node.size) as elements. The pmap function is somehow not able to map the third element(min.node.size) as argument to the ranger function and I get the following error:

"Error in ranger(Species ~ ., data = .x, mtry = .y, min.node.size = .z) : object '.z' not found"

This is my code using the iris data set:

### used packages
library(tidyverse)
library(ranger)
library(rsample)

### data preparation
set.seed(123)

initial_split_data <- initial_split(iris, prop = 0.8)

training <- training(initial_split_data)
testing <- testing(initial_split_data)

cv_split <- vfold_cv(training, v = 3)

cv_data <- cv_split %>% 
  mutate(train = map(.x = splits, .f = ~training(.x)),
         validate = map(.x = splits, .f = ~testing(.x)),
         validate_species = map(.x = validate, .f = ~.x$Species))

### modeling
## two elements being mapped works:
random_forest_model_mtry <- cv_data %>% 
  crossing(mtry = seq(2,4,1)) %>% 
  mutate(model = map2(.x = train, .y = mtry, 
                                    .f = ~ranger(Species ~., data = .x, mtry = .y)))


## three elements being mapped does not work:
random_forest_model_mtry_minnode <- cv_data %>% 
  crossing(mtry = seq(2,4,1),
           min.node.size = seq(1,5,1)) %>% 
  mutate(model = pmap(list(.x = train, .y = mtry, .z = min.node.size), 
                                    .f = ~ranger(Species ~., data = .x, mtry = .y, min.node.size = .z)))

It would be really helpful if someone could show me how to correctly use pmap in this case so that the random forest models get executed.

Best regards

francesco
  • 105
  • 1
  • 4

1 Answers1

7

From the ?pmap help page:

 .f: A function, formula, or vector (not necessarily atomic).

     If a *function*, it is used as is.

     If a *formula*, e.g. ‘~ .x + 2’, it is converted to a
     function. There are three ways to refer to the arguments:

       • For a single argument function, use ‘.’

       • For a two argument function, use ‘.x’ and ‘.y’

       • For more arguments, use ‘..1’, ‘..2’, ‘..3’ etc

For multiple arguments, we need to replace .x, .y, etc. with ..1, ..2, etc.:

random_forest_model_mtry_minnode <- cv_data %>% 
    crossing(mtry = seq(2,4,1),min.node.size = seq(1,5,1)) %>% 
    mutate(model = pmap(list(train, mtry, min.node.size), 
                        .f = ~ranger(Species ~., data = ..1, 
                                     mtry = ..2, min.node.size = ..3)))

Note that elements of the argument list (list(train, mtry, min.node.size) in your case) can be unnamed. What matters is their order, as that is what gets referenced by ..1, ..2, etc.

Artem Sokolov
  • 13,196
  • 4
  • 43
  • 74