1

I'm using the 'rhdf5' package for reading a large file (2GB) with about 5000 objects. I have to use this package since it appears to be the only one supporting bit64.

The problem is the following, it is very time consuming to read all the objects like this:

  library(rhdf5)
  library(bit64)
  library(parallel)

  groups = h5ls(h5_file)
  obj_names = paste(groups$group[which(groups$otype == 'H5I_DATASET')], groups$name[which(groups$otype == 'H5I_DATASET')], sep='/')

  # now we have 'obj_names', a list of the names contained in the file: 'h5_file'

  h5read_by_name <- function(x) {
    h5read(file=h5_file, name=x, bit64conversion='bit64')
  }

  h5data = do.call(rbind, mclapply(obj_names, h5read_by_name, mc.cores=2))

Even using multicore to speed up a little is still very long (days). If I use more cores, the stack size explodes and I'm already at the hard limit.

Any idea?

Joe
  • 75
  • 3
  • 2
    Do these 5000 objects make up the whole of the 2Gb file or are they a a subset? Because if they are the whole of the file then your memory usage is going up and your machine will grind to a halt. How long does it take to read one, two, five, ten, and a hundred? Instead of reading everything in a loop, can you read each one and save to separate .RData files? Multicore is probably a waste because this is disk I/O bound. Rent 100 Amazon servers for a couple of hours instead! – Spacedman Jun 12 '14 at 08:06
  • Thanks. Yes they make up the whole 2GB but memory is not really a problem. The processing machine has plenty. The problem was more about stack. Yes indeed I was considering saving each object to a separate .rda file. I was hoping that I do not had to do it but if it is the only option I can afford that. Actually, maybe it's better this way. Thanks for your answer. If anybody else has experienced reading many objects like this I'm open to any idea. – Joe Jun 12 '14 at 08:51
  • Like @Spacedman says, it would be good to get some timings to understand where your bottleneck is. You say it takes days to read 5000 objects, which implies >20s per object. Does it really take that long to read a single object? What and how large are the objects? I think also it's helpful to report the output of sessionInfo() (after running your minimal example) and even better to provide a helper to generate some realistic toy data for other to explore. It seems backward to have to create and manage Rdata files, rather than to use a high-performance system like hdf5. – Martin Morgan Jun 12 '14 at 13:08

0 Answers0