Can you read HDF5 dataset directly into SharedMemory with Python?

Question

I need to share a large dataset from an HDF5 file between multiple processes and, for a set of reasons, mmap is not an option.

So I read it into a numpy array and then copy this array into shared memory, like this:

import h5py
from multiprocessing import shared_memory

dataset = h5py.File(args.input)['data']
shm = shared_memory.SharedMemory(
    name=memory_label,
    create=True,
    size=dataset.nbytes
)
shared_tracemap = np.ndarray(dataset.shape, buffer=shm.buf)
shared_tracemap[:] = dataset[:]

But this approach doubles the amount of required memory, because I need to use a temporary variable. Is there a way to read the dataset directly into SharedMemory?

A suggestion: check the methods of `dataset` variable. I remember there was one called `read_direct` or something like that. [h5py doc for dataset](https://docs.h5py.org/en/stable/high/dataset.html) — Memristor, May 02 '23 at 18:39

score 1 · Accepted Answer · answered May 03 '23 at 13:04

First, an observation: in your code dataset is an h5py dataset object, not an NumPy array. It does not load the entire dataset into memory!

As @Monday's commented, read_direct() reads directly from a HDF5 dataset to a NumPy array. Use it to avoid making an intermediate copy when slicing.

This is how to add it to your code. (Note, I suggest including the dtype keyword with your np.ndarray() call.)

shared_tracemap = np.ndarray(dataset.shape, dtype=dataset.dtype, buffer=shm.buf)
dataset.read_direct(shared_tracemap)

You can use source_sel= and dest_sel= keywords to read a slice from the dataset. Example:

dataset.read_direct(shared_tracemap,source_sel=np.s_[0:100],dest_sel=np.s_[0:100])

Can you read HDF5 dataset directly into SharedMemory with Python?

1 Answers1