How do I save output from a large simulation in R? (multiple nodes, safe access)

Question

I am doing a large simulation for a research project--simulating 1,000 football seasons and analyzing the results. As the seasons will be spread across multiple nodes, I need an easy way to save my output data into a file (or files) to access later. Since I can't control when the nodes will finish, I can't have them all trying to write to the same file at the same time, but if they all save to a different file, I would need a way to aggregate all the data easily afterward. Thoughts?

Good question. The supercomputer has many machines with 24 processors apiece. I'm not sure if I'm going to do the simulation on one machine or across many. — jntrcs, Dec 08 '16 at 04:21
@jntrcs Is there a common storage area that all the nodes can access? If so, you can determine an appropriate folder structure and save the results of each individual simulation into the corresponding folder on a single drive. The code I posted below would work in this scenario. — dataanalyst, Dec 08 '16 at 05:59
do yo use `R` parallel function or spread the work _manually_ ? — ClementWalter, Dec 08 '16 at 10:36
in any case you can always generate a key with, for instance, the `digest` package so as to be sure to have unique names for each task. Then you can use `save` and, once done, loop with `list.files` onto your folder — ClementWalter, Dec 08 '16 at 10:38

score 0 · Accepted Answer · answered Dec 08 '16 at 01:36

I do not know if this question was asked already. But here is what I do in my research. You can loop through the file names and aggregate them into one object like so

require(data.table)
dt1 <- data.table()
for (i in 1:100) {
  k <- paste0("C:/chunkruns/dat",i,"/dt.RData")
  load(k)
  dt1 <- rbind(dt1,dt)
}

agg.data <- dt1
rm(dt1)

The above code assumes that all your files are saved in different folders but with same file name.

Or else, you can use the following to identify file paths matching a pattern and then combine them

require(data.table)
# Get the list of files and then read the files using read.csv command
k <- list.files(path = "W:/chunkruns/dat", pattern = "Output*", all.files = FALSE, full.names = TRUE, recursive = TRUE)
m <- lapply(k, FUN = function (x) read.csv(x,skip=11,header = T))
agg.data <- rbindlist(m)
rm(m)

How do I save output from a large simulation in R? (multiple nodes, safe access)

1 Answers1