10

I have to work on large 3D cubes of data. I want to store them in HDF5 files (using h5py or maybe pytables). I often want to perform analysis on just a section of these cubes. This section is too large to hold in memory. I would like to have a numpy style view to my slice of interest, without copying the data to memory (similar to what you could do with a numpy memmap). Is this possible? As far as I know, performing a slice using h5py, you get a numpy array in memory.

It has been asked why I would want to do this, since the data has to enter memory at some point anyway. My code, out of necessity, already run piecemeal over data from these cubes, pulling small bits into memory at a time. These functions are simplest if they simply iterate over the entirety of the datasets passed to them. If I could have a view to the data on disk, I simply could pass this view to these functions unchanged. If I cannot have a view, I need to write all my functions to only iterate over the slice of interest. This will add complexity to the code, and make it more likely for human error during analysis.

Is there any way to get a view to the data on disk, without copying to memory?

NoDataDumpNoContribution
  • 10,591
  • 9
  • 64
  • 104
Caleb
  • 3,839
  • 7
  • 26
  • 35
  • 1
    Have you heard of [pandas](http://pandas.pydata.org/). It can be very useful in [reading/writing an HDF5 store](http://pandas.pydata.org/pandas-docs/stable/io.html?highlight=hdf5#hdf5-pytables)? – wflynny Jan 06 '15 at 16:51
  • This is a follow up to my earlier question: http://stackoverflow.com/q/27710245/1361752 – Caleb Jan 06 '15 at 16:51
  • Yes, I'm quite familiar with pandas DataFrames (although, not so much their 3D functionality). However, that mostly works in-memory, correct? I know you can use pytables to copy the tables to hdf5 files. Is there a way to use this for the functionality I need? Also, pandas usually provides high-level datatypes for tabular data I think. Isn't it overkill for simple arrays? That said, if it does what I need, I'd happily use it. – Caleb Jan 06 '15 at 16:57
  • I reread your post more carefully, and to be clear, it's that your desired slices of your datastore are still too large to hold in memory, correct? Or are you taking rectangular slice of your "cube", then slicing it further? Let me do some research, however, I know that pandas does natively support chunked hdf5 reading, which could possibly simplify your workflow. – wflynny Jan 06 '15 at 17:19
  • Basically, I often target an entire analysis on a subset of the full datacube. However, the subset is also too large to fill in memory. One workflow that is occurring to me is that I could copy the subset to a new temporary file, and work from that. – Caleb Jan 07 '15 at 01:29
  • You probably need something like slices from a slice. – NoDataDumpNoContribution Dec 21 '15 at 10:44
  • 1
    If you need the entire subset for your processing to function and that subset doesn't fit in memory then I don't see how you can tackle it without updating your processing to work on a subset of the subset. Apart from that h5py supports numpy-like slicing which should function through hyperslab selections but I don't know enough about your data to say whether that's sufficient. – somada141 Jun 04 '18 at 21:17
  • 1
    For me this sounds like `dask` arrays would be what you're searching for (although I've never worked with it for serious applications I should say). It's designed among others for exactly that what you describe: data is too large to fit into memory and integrates with wellknown tools like `pandas` and `numpy`. see main website:http://dask.pydata.org/en/latest/docs.html and how to create dask arrays, also from hfd5: http://dask.pydata.org/en/latest/array-creation.html. At least in their docs they write "_changes the space limitation from “fits in memory” to “fits on disk”._" – SpghttCd Jun 21 '18 at 05:12
  • +1 for dask.array. It solves OP's problem. I've used them for arrays out of memory with no change to my numpy code. – evamicur Aug 19 '18 at 01:42

2 Answers2

2

One possibility is to create a generator that yields the elements of the slice one by one. Once you have such a generator, you can pass it to your existing code and iterate through the generator as normal. As an example, you can use a for loop on a generator, just as you might use it on a slice. Generators do not store all of their values at once, they 'generate' them as needed.

You might be able create a slice of just the locations of the cube you want, but not the data itself, or you could generate the next location of your slice programmatically if you have too many locations to store in memory as well. A generator could use those locations to yield the data they contain one by one.

Assuming your slices are the (possibly higher-dimensional) equivalent of cuboids, you might generate coordinates using nested for-range() loops, or by applying product() from the itertools module to range objects.

IFcoltransG
  • 101
  • 6
0

It is unavoidable to not copy that section of the dataset to memory. Reason for that is simply because you are requesting the entire section, not just a small part of it. Therefore, it must be copied completely.

So, as h5py already allows you to use HDF5 datasets in the same way as NumPy arrays, you will have to change your code to only request the values in the dataset that you currently need.

1313e
  • 1,112
  • 9
  • 17