I've seen various posts on how to select the independent variables for a model by using expand.grid
and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.
library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris) # let's assume these are different input tables
I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger
) for my list of input tables. I do the following to set up the grid:
hyper_grid <- expand.grid(
Input_table = names(Input_list),
Trees = c(10, 20),
Importance = c("none", "impurity"),
Classification = TRUE,
Repeats = 1:5,
Target = "Species")
> head(hyper_grid)
Input_table Trees Importance Classification Repeats Target
1 iris1 10 none TRUE 1 Species
2 iris2 10 none TRUE 1 Species
3 iris1 20 none TRUE 1 Species
4 iris2 20 none TRUE 1 Species
5 iris1 10 impurity TRUE 1 Species
6 iris2 10 impurity TRUE 1 Species
My question is, what is the best way to pass this values to the model? Currently I'm using a for loop
:
for (i in 1:nrow(hyper_grid)) {
RF_train <- ranger(
dependent.variable.name = hyper_grid[i, "Target"],
data = Input_list[[hyper_grid[i, "Input_table"]]], # referring to the named object in the list
num.trees = hyper_grid[i, "Trees"],
importance = hyper_grid[i, "Importance"],
classification = hyper_grid[i, "Classification"]) # otherwise regression is performed
print(RF_train)
}
iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species
is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply
for this role? This way of iterating also results in messy function calls:
Call:
ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i, "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i, "Importance"], classification = hyper_grid[i, "Classification"])
Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix
) and write the results into an extended version of the hyper_grid
on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.
Thank you!