I am currently working with .h5 files. The file contains several tables, which I have to process (row filtering and other basic stuff). Then, as one of the steps, I have to compute an integral for each row which takes as input two of the two columns. A simplified version of the code looks something like this:
# Inside an object method
# Function to apply which I know it is not vectorized yet
def compute_alpha_val(row):
weight, degree = row["norm_weight"], row["degree"]
if degree == 1:
return 1
func = lambda x: (1 - x) ** (degree - 2)
alpha = 1 - (degree - 1) * scipy.integrate.quad(func, 0, weight)[0]
return round_half_up(alpha, 4)
for chunk in table_chunks: # Generator of pd.DataFrames from the table stored in the h5 file
# Do some operations
alphas = chunk.apply(compute_alpha_val, axis=1) # Works
alphas = chunk.swifter.apply(compute_alpha_val, axis=1) # Does not work
# Do stuff with alphas
The normal apply is fairly slow (about 50 sec per million of rows) but it works, the swifter one does not; since the function is not vectorized I know that the swifter version would probably be slower, but it throws a completely different error:
BlockingIOError: [Errno 11] Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
which looks as if the code is trying to do multiprocessing on the h5 file itself, causing a lock to raise an error. This should not be the case, since the apply does not involve anything inside the file. Moreover, some print lines that I placed to monitor the progress show some strange stuttering-like behavior.
# Beginning of the script
STARTING table15
Num rows to process 2018915
Starting to run: compute alphas
Starting chunk
Starting bin1_id
# Here the apply should happen
STARTING table15 # As if the script started from the beginning
STARTING table15
Num rows to process 2018915
Num rows to process 2018915
I checked that a dataset is being passed and not something else. I also checked that all possible connections to the h5 files are closed even though it should not matter.
My hypothesis is that somehow h5py sees an open connection to the file and tries to prevent the multiprocessing package underneath swifter to avoid file corruption. Any idea on how to solve this? I am open to anything, as long as in the end I can vectorize the function and speed up this code since it should run on more than 10 billion rows and with current times it is too slow.