I simulate reasonably sized datasets (10-20mb) through a large number of parameter combinations (20-40k). Each dataset x parameter set is pushed through mclapply
and the result is a list
where each item contains output data (as list item 1) and parameters used to generate that result as list item 2 (where each element of that list is a parameter).
I just ran through a 81K list (but had to run them in 30k chunks) and the resulting lists are around 700 mb each. I've stored them as .rdata
files but will probably resave them to .Rda
. But each file takes forever to be read into R
. Is there a best practice here, especially for long-term storage?
Ideally I would keep everything in one list but mclapply
throws an error about not being able to serialize vectors, AND a job this large would take forever on the cluster (split 3 ways, it took 3 hours/job). But having several results files results1a.rdata
, results2b.rdata
, results3c.rdata
also seems inefficient.