h5py conflicts with swifter.apply

Question

I am currently working with .h5 files. The file contains several tables, which I have to process (row filtering and other basic stuff). Then, as one of the steps, I have to compute an integral for each row which takes as input two of the two columns. A simplified version of the code looks something like this:

# Inside an object method

# Function to apply which I know it is not vectorized yet
def compute_alpha_val(row):
    weight, degree = row["norm_weight"], row["degree"]
    if degree == 1:
        return 1
    func = lambda x: (1 - x) ** (degree - 2)
    alpha = 1 - (degree - 1) * scipy.integrate.quad(func, 0, weight)[0]
    return round_half_up(alpha, 4)

for chunk in table_chunks:  # Generator of pd.DataFrames from the table stored in the h5 file
    # Do some operations
    alphas = chunk.apply(compute_alpha_val, axis=1)          # Works
    alphas = chunk.swifter.apply(compute_alpha_val, axis=1)  # Does not work
    # Do stuff with alphas

The normal apply is fairly slow (about 50 sec per million of rows) but it works, the swifter one does not; since the function is not vectorized I know that the swifter version would probably be slower, but it throws a completely different error:

BlockingIOError: [Errno 11] Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

which looks as if the code is trying to do multiprocessing on the h5 file itself, causing a lock to raise an error. This should not be the case, since the apply does not involve anything inside the file. Moreover, some print lines that I placed to monitor the progress show some strange stuttering-like behavior.

# Beginning of the script
STARTING table15
Num rows to process 2018915
Starting to run: compute alphas
Starting chunk
Starting bin1_id
# Here the apply should happen
STARTING table15  # As if the script started from the beginning
STARTING table15
Num rows to process 2018915
Num rows to process 2018915

I checked that a dataset is being passed and not something else. I also checked that all possible connections to the h5 files are closed even though it should not matter.

My hypothesis is that somehow h5py sees an open connection to the file and tries to prevent the multiprocessing package underneath swifter to avoid file corruption. Any idea on how to solve this? I am open to anything, as long as in the end I can vectorize the function and speed up this code since it should run on more than 10 billion rows and with current times it is too slow.

That error comes from the OS when HDF5 wants to lock the file for read or write access; using [`flock`](https://linux.die.net/man/2/flock) on Linux. Whatever you think is happening, either you have the file opened for writing and now another process opens it for reading, or vice-versa, or everyone wants to open it for writing — Homer512, Aug 27 '23 at 20:34
@Homer512, Thank you, I get that and it makes sense, but I do not see how the function called by the apply can start a new process on the file... I tried substituting the function with something super simple (like return row["a"] + row["b"]) and in that case it works, but as soon as the apply becomes more complex it breaks. Maybe something about h5 buffers? — Stefano Cretti, Aug 27 '23 at 21:15
If you are on Linux, you can use the command line `lsof /directory/filename` to see which process has a file open — Homer512, Aug 27 '23 at 21:46
I'm not seeing the usual `h5py` file access (that I'm used to seeing)? Is this using pandas table's access? What is `row` - a row of a frame already in memory, or one that's loaded on the fly from the file? — hpaulj, Aug 27 '23 at 22:20
If you can't find the location where the file is kept open or is opened, we really need a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) on this one — Homer512, Aug 28 '23 at 07:15
@Homer512 `lsof` does not list anything, I assume cause the file it was actually closed. I'll try creating a minimal reproducible example but it is kind of hard since most of the code under NDA or on non public access files. I'll try making a mock file and a simplified version of the code. — Stefano Cretti, Aug 28 '23 at 09:05
@hpaulj Let me add the function which fetches the table to the post. — Stefano Cretti, Aug 28 '23 at 09:06

h5py conflicts with swifter.apply

0 Answers0