I am running a simulation study where my results are stored in a nested list structure. The first level of the list represents the different hyperparameters generated by the model. The second level is the number of replications of the same model (changing the seed).
In the example below, I list the output of a model that is governed by two hyperparameters (hyperpar1
and hyperpar2
) where both can take 2 different values, leading to 4 different combinations of the resulting model. Additionally, each of the 4 possible combinations was run twice (different seeds), leading to eight possible combinations (as can be seen below with str(res, max = 2)
). Finally, two performance metrics were recovered from each possible iteration of the models (metric1
and metric2
).
My problem is that in my real data, the number of iterations (second level of the list) is huge (up to 10000), and the full factorial of the number of hyperparameters is up to 2000 in some cases. Therefore the unlisting process becomes rather slow.
Below I list my current procedure with my desired output, but again, it is relatively slow. In particular, there is one part when I unlist everything, and I put it together in one big data.frame that takes way too long, but I haven't solved this in a faster way.
res <-list(
list(list(modeltype = "tree", time_iter = structure(0.7099, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4847, metric2 = 0.2576 ),
list(modeltype = "tree", time_iter = structure(0.058 , class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 0.5, metric1 = 0.4013, metric2 = 0.2569 )),
list(list(modeltype = "tree", time_iter = structure(0.046 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.4755, metric2 = 0.2988 ),
list(modeltype = "tree", time_iter = structure(0.0474, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 0.5, metric1 = 0.2413, metric2 = 0.2147 )),
list(list(modeltype = "tree", time_iter = structure(0.0502, class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.7131, metric2 = 0.5024 ),
list(modeltype = "tree", time_iter = structure(2.9419, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.5, hyperpar2 = 1 , metric1 = 0.4254, metric2 = 0.2824 )),
list(list(modeltype = "tree", time_iter = structure(0.041 , class = "difftime", units = "secs"),seed = 1, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.6709, metric2 = 0.4092 ),
list(modeltype = "tree", time_iter = structure(0.0396, class = "difftime", units = "secs"),seed = 2, nobs = 75, hyperpar1 = 0.8, hyperpar2 = 1 , metric1 = 0.4585, metric2 = 0.4115 )))
hyperpar1 <- c(0.5 , 0.8 )
hyperpar2 <- c(0.5 , 1 )
expand.grid(hyperpar1 = hyperpar1, hyperpar2 = hyperpar2)
# hyperpar1 hyperpar2
# 1 0.5 0.5
# 2 0.8 0.5
# 3 0.5 1.0
# 4 0.8 1.0
#List structure:
#The 4 elements represent the 4 combinations of the hyperparams
#Inside each of the 4 combinations of the hyperparams, 2 lists represent the 2 simulations (with different seeds)
str(res, max = 1)
#Finally, inside each of the final level (level=3) there is a list of 10 objects that are the results of each simulation
str(res, max = 2)
# List of 4
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
# $ :List of 2
# ..$ :List of 8
# ..$ :List of 8
#e.g Fist iteration of first model
t(res[[1]][[1]])
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# [1,] "tree" 0.7099 1 75 0.5 0.5 0.4847 0.2576
Parsing the list
In the following code, I unnest the list and put everything into a data.frame.
#Unlist the nested structure of the list `res`
all_in_list <- lapply(1:length(res), function(i) {
unlisting <- unlist(res[i],recursive = FALSE)
to_df <- do.call(rbind, lapply(unlisting, as.data.frame))
return(to_df)})
#Here is where averything get really really slow when the list is huge
all_in_df <- do.call(rbind, lapply(all_in_list, as.data.frame))
# modeltype time_iter seed nobs hyperpar1 hyperpar2 metric1 metric2
# 1 tree 0.7099 secs 1 75 0.5 0.5 0.4847 0.2576
# 2 tree 0.0580 secs 2 75 0.5 0.5 0.4013 0.2569
# 3 tree 0.0460 secs 1 75 0.8 0.5 0.4755 0.2988
# 4 tree 0.0474 secs 2 75 0.8 0.5 0.2413 0.2147
# 5 tree 0.0502 secs 1 75 0.5 1.0 0.7131 0.5024
# 6 tree 2.9419 secs 2 75 0.5 1.0 0.4254 0.2824
# 7 tree 0.0410 secs 1 75 0.8 1.0 0.6709 0.4092
# 8 tree 0.0396 secs 2 75 0.8 1.0 0.4585 0.4115
Summary statistics of the output
In the following, I recovered the mean and the standard deviation of the performance metrics adding their subfix to each colname
to identify it later (graphing purposes).
#auxiliar function to compute the metrics at aggregate level.
foo_summary <- function(df ,
metrics =c("time_iter","metric1", "metric2") ,
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
{
#compute the aggregate metrics
out <- as.data.frame(aggregate(
x = df[metrics],
by = df[by],
FUN = summary_function,
na.rm = TRUE))
#rename conviniently the metric computed
oldnames <- colnames(out[metrics])
names(out)[match(oldnames,names(out))] <- paste(colnames(out[metrics]),
as.character(substitute(summary_function)),
sep = "_")
return(out)
}
df_mean <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = mean)
df_sd <- foo_summary(df = all_in_df,
metrics =c("time_iter","metric1", "metric2"),
by = c("nobs","hyperpar1", "hyperpar2", "modeltype"),
summary_function = sd)
final_df <- merge(df_mean,df_sd )
Finally, the desired output.
# nobs hyperpar1 hyperpar2 modeltype time_iter_mean metric1_mean metric2_mean time_iter_sd metric1_sd metric2_sd
# 1 75 0.5 0.5 tree 0.38395 secs 0.44300 0.25725 0.4609629107 0.05897271 0.0004949747
# 2 75 0.5 1.0 tree 1.49605 secs 0.56925 0.39240 2.0447406792 0.20343462 0.1555634919
# 3 75 0.8 0.5 tree 0.04670 secs 0.35840 0.25675 0.0009899495 0.16560441 0.0594676803
# 4 75 0.8 1.0 tree 0.04030 secs 0.56470 0.41035 0.0009899495 0.15018948 0.0016263456