hdf to ndarray in numpy - fast way

Question

I am looking for a fast way to set my collection of hdf files into a numpy array where each row is a flattened version of an image. What I exactly mean:

My hdf files store, beside other informations, images per frames. Each file holds 51 frames with 512x424 images. Now I have 300+ hdf files and I want the image pixels to be stored as one single vector per frame, where all frames of all images are stored in one numpy ndarray. The following picture should help to understand:

What I got so far is a very slow method, and I actually have no idea how i can make it faster. The problem is that my final array is called too often, as far as I think. Since I observe that the first files are loaded into the array very fast but speed decreases fast. (observed by printing the number of the current hdf file)

My current code:

os.chdir(os.getcwd()+"\\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

For further information, I need this for learning a decision tree. Since my hdf file is bigger than my RAM, I think converting into a numpy array save memory and is therefore better suited.

Thanks for every input.

Does your algorithm need more than one frame at a time? I'm guessing that the speed decrease comes from all the calls to vstack and you may not need to do anything like that. — Elliot, Mar 29 '17 at 14:47
Also, I'm not sure what's going on with the `if idx == 0 and frame == 0:` condition. I think you're just getting a 0x217088 element array out of it. — Elliot, Mar 29 '17 at 14:49
I am going to use random forrests which use all the feature space, unfortunately. Maybe there is another option on how to feed them with scikit learn, but I am not aware of such an. — mrks, Mar 30 '17 at 06:11
@Elliot the mentioned line is for removing the first randomly initialized line. — mrks, Mar 30 '17 at 06:11

score 1 · Accepted Answer · edited May 23 '17 at 12:17

I don't think you need to iterate over

imgs = f['img']['data'][:]

and reshape each 2d array. Just reshape the whole thing. If I understand your description right, imgs is a 3d array: (51, 512, 424)

imgs.reshape(51, 512*424)

should be the 2d equivalent.

If you must loop, don't use vstack (or some variant to build a bigger array). One, it is slow, and two it's a pain to cleanup the initial 'dummy' entry. Use list appends, and do the stacking once, at the end

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack (and family) takes a list of arrays as input, so it can work with many at once. List append is much faster when done iteratively.

I question whether putting things into one array will help. I don't know exactly how the the size of a hdf5 file relates to the size of the downloaded array, but I expect they are in the same order of magnitude. So trying to load all 300 files into memory might not work. That's what, 3G of pixels?

For an individual file, h5py has provision for loading chunks of an array that is too large to fit in memory. That indicates that often the problem goes the other way, the file holds more than fits.

Is it possible to load large data directly into numpy int8 array using h5py?

max9111 · Answer 2 · 2017-03-30T13:49:04.553

Do you really wan't to load all Images into the RAM and not use a single HDF5-File instead? Accessing a HDF5-File can be quite fast if you don't make any mistakes (unnessesary fancy indexing, improper chunk-chache-size). If you wan't the numpy-way this would be a possibility:

os.chdir(os.getcwd()+"\\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

Writing your data to a single HDF5-File would be quite similar:

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

If you only wan't to access whole images afterwards the chunk-size should be okay. If not you have to change that to your needs.

What you should do when accessing a HDF5-File:

Use a chunk-size, which fits your needs.
Set a proper chunk-chache-size. This can be done with the h5py low level api or h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0

Avoid any type of fancy indexing. If your Dataset has n dimensions access it in a way that the returned Array has also n dimensions.

# Chunk size is [50,50] and we iterate over the first dimension
numpyArray=h5_dset[i,:] #slow
numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster

EDIT This shows how to read your data to a memmaped numpy array. I think your method expects data of format np.float32. https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

 numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

Everything else could be kept the same. If it works I would also recommend to use a SSD instead of a hardisk.

I am going to use random forrests/Decision trees with this data and I found out, that these methods need the whole data at once. That is why I think I can not go with the chunked version. Or do I missunderstand how chunking the hdf files work? — mrks, Mar 30 '17 at 06:29
Ok, is my first suggestion (only reading the data in a numpy Array) working for you? — max9111, Mar 30 '17 at 10:05
Works well for the purpose I asked for. But I do not know how I will feed my data to the learning algorithm (decision trees) though. it reduced my dataset from 26GB to ~3GB in numpy binary format, since this was just a subset of my actual dataset which is ~20 times bigger I do not know how to handle this without going out of core memory. — mrks, Mar 30 '17 at 12:55
You are using this method? http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.apply It expects an array-like Matrix. Maybe it accepts a memmaped numpy array or a dask array and hopefully it will not make a copy of large parts of the data internally. — max9111, Mar 30 '17 at 13:22
kind of, I use the regressor one. (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) I will try your suggestions. — mrks, Mar 30 '17 at 13:29
I will edit my answer to show how to create a np.float32 memaped array. — max9111, Mar 30 '17 at 13:37

hdf to ndarray in numpy - fast way

2 Answers2