4

Is it possible to read a given set of rows from an hdf5 file without loading the whole file? I have quite big hdf5 files with loads of datasets, here is an example of what I had in mind to reduce time and memory usage:

#! /usr/bin/env python

import numpy as np
import h5py

infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']

mdisk = group['mdisk'].value

val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]

m = group['mcold'][ind]
print m

ind doesn't give consecutive rows but rather scattered ones.

The above code fails, but it follows the standard way of slicing an hdf5 dataset. The error message I get is:

Traceback (most recent call last):
  File "./read_rows.py", line 17, in <module>
    m = group['mcold'][ind]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
    sel[arg]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
    raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
Seanny123
  • 8,776
  • 13
  • 68
  • 124
VGP
  • 85
  • 1
  • 6
  • Saying it 'fails' but not showing the error message, or what is wrong, is a big no-no around here. – hpaulj Feb 09 '15 at 21:27
  • You are loading the whole `mdisk` array in to memory. I'd have to dig into the documentation to determine how much of `mcold` is loaded. It may depend on whether `ind` is a compact slice or values scattered through out the array. – hpaulj Feb 09 '15 at 21:32

1 Answers1

5

I have a sample h5py file with:

data = f['data']
#  <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind]  # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error

This last error is caused by repeated values in the list.

But indexing with lists with unique values works fine

In [150]: data[[0,2]]
Out[150]: 
array([[ 0,  1,  2,  3,  4,  5],
       [12, 13, 14, 15, 16, 17]])

In [151]: data[:,[0,3,5]]
Out[151]: 
array([[ 0,  3,  5],
       [ 6,  9, 11],
       [12, 15, 17]])

So does an array with the proper dimension slicing:

In [157]: data[ind[[0,3,6]],:]
Out[157]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]: 
array([[ 0,  3,  5],
       [ 6,  9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]  
# errror about only one indexing array allowed

So if the indexing is right - unique values, and matching the array dimensions, it should work.

My simple example doesn't test how much of the array is loaded. The documentation sounds as though elements are selected from the file without loading the whole array into memory.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Yes! Thanks. It was actually a problem of matching array dimensions. In the above example code it was enough changing the where statement by: ind = (mdisk>val) – VGP Feb 10 '15 at 10:24
  • Of course, if you convert your h5 file in an array, it is easy to select rows, but the thing is : Can we remove rows without creating an array ? In my case it is really useful because i can't load the whole array into memory. So i want to extract rows directly from the h5 file. Thanks a lot – Tbertin Sep 04 '18 at 10:07
  • @Tbertin, my `data` is the dataset, not the loaded array. So I do show how to load selected rows. Slice indexing also works. – hpaulj Sep 04 '18 at 14:11
  • Even if data is a dataset, as soon as you write data[index], you're creating an array, and you load into memory all the selected rows – Tbertin Sep 04 '18 at 14:29
  • 1
    @Tbertin, so by `remove` and `extract` you mean change the data on the file itself? If so you need to look at the underlying `HDF5` code, not the python interface. – hpaulj Sep 04 '18 at 14:35