3

I simulate reasonably sized datasets (10-20mb) through a large number of parameter combinations (20-40k). Each dataset x parameter set is pushed through mclapply and the result is a list where each item contains output data (as list item 1) and parameters used to generate that result as list item 2 (where each element of that list is a parameter).

I just ran through a 81K list (but had to run them in 30k chunks) and the resulting lists are around 700 mb each. I've stored them as .rdata files but will probably resave them to .Rda. But each file takes forever to be read into R. Is there a best practice here, especially for long-term storage?

Ideally I would keep everything in one list but mclapply throws an error about not being able to serialize vectors, AND a job this large would take forever on the cluster (split 3 ways, it took 3 hours/job). But having several results files results1a.rdata, results2b.rdata, results3c.rdata also seems inefficient.

Maiasaura
  • 32,226
  • 27
  • 104
  • 108

1 Answers1

4

It sounds like you have a couple of different questions there -- I'd recommend asking about optimizing your list format in a separate question.

Regarding reading/writing R data to disk, however, I'm not sure that there's a better way than Rda files in terms of efficiency. However, I have found that the level of compression can have a real effect on the amount of time it takes to read/write these files depending on the computational setup. I've typically found that you get the best performance using no compression (save(x,file="y.Rda", compress=FALSE)).

As a backup plan, you can try leaving the compression on, but varying the level of compression, as well.

Jeff Allen
  • 17,277
  • 8
  • 49
  • 70
  • 1
    Another option is `saveRDS` which will allow you to restore the object with a different name. – mnel Jun 14 '12 at 23:52