1

I'm trying to memory-map individual datasets in an HDF5 file:

import h5py
import numpy as np
import numpy.random as rdm
n = int(1E+8)
rdm.seed(70)
dset01 = rdm.rand(n)
dset02 = rdm.normal(0, 1, size=n).astype(np.float32)
with h5py.File('foo.h5', mode='w') as f0:    
    f0.create_dataset('dset01', data=dset01)
    f0.create_dataset('dset02', data=dset02)
fp = np.memmap('foo.h5', mode='r', dtype='double')
print(dset01[:3])
print(fp[:3])
del fp

However, the outputs below indicate that values in fp don't match those in dset01.

[0.92748054 0.87242629 0.58463127]
[5.29239776e-260 1.11688278e-308 5.18067355e-318]

I am guessing, maybe I should have set an 'offset' value when I did np.memmap. Is that the mistake in my code? If so, how do I find out the correct offset value of each dataset in an HDF5?

Indominus
  • 1,228
  • 15
  • 31
  • 1
    You can't use `np.memap` on that kind of file. The HDf5 storage layout is not the simple one that `memmap` asumes. – hpaulj Feb 10 '20 at 15:37
  • You can load selective slices of `h5py` datasets. So there's no need for `memmap`. _ – hpaulj Feb 10 '20 at 16:51
  • @hpaulj, when you say there is no need for `memmap`, do you mean I can retrieve a subset of an HDF5 dataset without loading the whole dataset into memory, or do you mean `h5py` has features that can memory map a dataset in an HDF5 file? – Indominus Feb 10 '20 at 19:24
  • 1
    You can retrieve a subset. http://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data – hpaulj Feb 10 '20 at 19:47
  • If I do `f = h5py.File('foo.h5', mode='r')`, then `f['dset01'][:500]`, does it only read 500 records from the disk? It would be great if it's true, because my real use case is `n = ~100 million` and the file may be stored on a remote and very slow hard drive. – Indominus Feb 10 '20 at 20:23
  • 1
    Correct: `f['dset01'][:500]` only reads 500 records. You can used most NumPy indexing methods to read subsets of a dataset (not all fancy indexing supported). – kcw78 Feb 11 '20 at 02:29

0 Answers0