0

I've seen various posts on how to select the independent variables for a model by using expand.grid and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.

library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris)  # let's assume these are different input tables

I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger) for my list of input tables. I do the following to set up the grid:

hyper_grid <- expand.grid(
  Input_table = names(Input_list),
  Trees = c(10, 20),
  Importance = c("none", "impurity"),
  Classification = TRUE,
  Repeats = 1:5,
  Target = "Species")

> head(hyper_grid)
  Input_table Trees Importance Classification Repeats  Target
1       iris1    10       none           TRUE       1 Species
2       iris2    10       none           TRUE       1 Species
3       iris1    20       none           TRUE       1 Species
4       iris2    20       none           TRUE       1 Species
5       iris1    10   impurity           TRUE       1 Species
6       iris2    10   impurity           TRUE       1 Species

My question is, what is the best way to pass this values to the model? Currently I'm using a for loop:

for (i in 1:nrow(hyper_grid)) {
  RF_train <- ranger(
    dependent.variable.name = hyper_grid[i, "Target"], 
    data = Input_list[[hyper_grid[i, "Input_table"]]],  # referring to the named object in the list
    num.trees = hyper_grid[i, "Trees"], 
    importance = hyper_grid[i, "Importance"], 
    classification = hyper_grid[i, "Classification"])  # otherwise regression is performed
  print(RF_train)
}

iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply for this role? This way of iterating also results in messy function calls:

Call:
 ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i,      "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i,      "Importance"], classification = hyper_grid[i, "Classification"])

Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix) and write the results into an extended version of the hyper_grid on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.

Thank you!

crazysantaclaus
  • 613
  • 5
  • 19

1 Answers1

1

I think it is cleanest to wrap the training and extraction of the values you need into a function. The dots (...) are needed for usage with the purrr::pmap function below.

fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
  RF_train <- ranger(
    dependent.variable.name = Target, 
    data = Input_list[[Input_table]],  # referring to the named object in the list
    num.trees = Trees, 
    importance = Importance, 
    classification = Classification)  # otherwise regression is performed

  data.frame(Prediction_error = RF_train$prediction.error,
             True_positive = RF_train$confusion.matrix[1])
}

Then you can add the results as a column by mapping over the rows using for example purrr::pmap:

hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)

By mapping in this way, the function is applied row by row, so you should not run into memory issues.

The result of purrr::pmap is a list, which means that the column res contains a list for every row. This can be unnested using tidyr::unnest to spread the elements of that list across your data frame.

tidyr::unnest(hyper_grid, res)

I think this approach is very elegant, but it requires some tidyverse knowledge. I highly recommend this book if you want to know more about that. Chapter 25 (Many models) describes an approach similar to the one I'm taking here.

Bas
  • 4,628
  • 1
  • 14
  • 16
  • That is a good approach, thanks. As the `confusion.matrix` is not a single value but a 3x3 table, it does not fit into the data.frames row, I should have been more precise there. Is it possible to return multiple values from the RF_train object into the same row of the `hyper_grid` data.frame? For example `hyper_grid$True_positive <- RF_train$confusion.matrix[1]; hyper_grid$Prediction_error <- RF_train$prediction.error`? – crazysantaclaus Mar 31 '20 at 10:30
  • 1
    Oh wow, this is fantastic, and pretty weird ;-). I did not know that you can add a list to a data.frame. I've been developing code for this kind of workflow now a couple of month, and people recommended `purrr` but I could not get it to work or get along with the outcome. This is really a great help – crazysantaclaus Mar 31 '20 at 11:42
  • 1
    Hm there seems to be some issue with how the arguments are passed, do you have any idea how to deal with this? https://stackoverflow.com/questions/60956516/r-ranger-confusion-matrix-is-larger-than-supposed-when-using-expand-grid-and-pur?noredirect=1#comment107844346_60956516 – crazysantaclaus Mar 31 '20 at 18:07