I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas dataframes into a dask dataframe and perform operations on the resulting dataframe (things like creating calculated columns, extracting parts of the dataframe, plotting etc). I tried using Dask bag but couldnt succeed. So I decided to write the resulting dataframe into an HDFStore and then use Dask to read from the HDFStore file. This worked well when I was doing it from my own computer. Code below.
cc = read_custom("demo.xyz", chunks=1000) # Generator of pandas dataframes
from pandas import HDFStore
s = HDFStore("demo.h5")
for c in cc:
s.append("data", c, format='t', append=True)
s.close()
import dask.dataframe as dd
ddf = dd.read_hdf("demo.h5", "data", chunksize=100000)
seqv = (
(
(ddf.sxx - ddf.syy) ** 2
+ (ddf.syy - ddf.szz) ** 2
+ (ddf.szz - ddf.sxx) ** 2
+ 6 * (ddf.sxy ** 2 + ddf.syz ** 2 + ddf.sxz ** 2)
)
/ 2
) ** 0.5
seqv.compute()
Since the last compute was slow, I decided to distribute it over a few systems on my LAN and started a scheduler on my machine and couple of workers in other systems. And fired up a Client
as below.
from dask.distributed import Client
client = Client('mysystemip:8786') #Establishing connection with the scheduler all fine.
And then read in the Dask dataframe. However, I got error below when I executed seqv.compute()
.
HDF5ExtError: HDF5 error back trace
File "H5F.c", line 509, in H5Fopen
unable to open file
File "H5Fint.c", line 1400, in H5F__open
unable to open file
File "H5Fint.c", line 1615, in H5F_open
unable to lock the file
File "H5FD.c", line 1640, in H5FD_lock
driver lock request failed
File "H5FDsec2.c", line 941, in H5FD_sec2_lock
unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
End of HDF5 error back trace
Unable to open/create file 'demo.h5'
I have made sure that all workers have access to demo.h5
file. I tried passing in the lock=False
in read_hdf
. Got the same error.
Isn't this possible to do? May be try another file format? I guess writing each pandas dataframe to separate files may work, but I'm trying to avoid it (I dont even want an intermediate HDFS file). But before I get to that route, I'd like to know if there is any other better approach to solve the problem.
Thanks for any suggestions!