4

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas dataframes into a dask dataframe and perform operations on the resulting dataframe (things like creating calculated columns, extracting parts of the dataframe, plotting etc). I tried using Dask bag but couldnt succeed. So I decided to write the resulting dataframe into an HDFStore and then use Dask to read from the HDFStore file. This worked well when I was doing it from my own computer. Code below.

cc = read_custom("demo.xyz", chunks=1000) # Generator of pandas dataframes
from pandas import HDFStore
s = HDFStore("demo.h5")
for c in cc:
    s.append("data", c, format='t', append=True)
s.close()

import dask.dataframe as dd
ddf = dd.read_hdf("demo.h5", "data", chunksize=100000)
seqv = (
    (
        (ddf.sxx - ddf.syy) ** 2
        + (ddf.syy - ddf.szz) ** 2
        + (ddf.szz - ddf.sxx) ** 2
        + 6 * (ddf.sxy ** 2 + ddf.syz ** 2 + ddf.sxz ** 2)
    )
    / 2
) ** 0.5
seqv.compute()

Since the last compute was slow, I decided to distribute it over a few systems on my LAN and started a scheduler on my machine and couple of workers in other systems. And fired up a Client as below.

from dask.distributed import Client
client = Client('mysystemip:8786') #Establishing connection with the scheduler all fine.

And then read in the Dask dataframe. However, I got error below when I executed seqv.compute().

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1400, in H5F__open
    unable to open file
  File "H5Fint.c", line 1615, in H5F_open
    unable to lock the file
  File "H5FD.c", line 1640, in H5FD_lock
    driver lock request failed
  File "H5FDsec2.c", line 941, in H5FD_sec2_lock
    unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'

End of HDF5 error back trace

Unable to open/create file 'demo.h5'

I have made sure that all workers have access to demo.h5 file. I tried passing in the lock=False in read_hdf. Got the same error.

Isn't this possible to do? May be try another file format? I guess writing each pandas dataframe to separate files may work, but I'm trying to avoid it (I dont even want an intermediate HDFS file). But before I get to that route, I'd like to know if there is any other better approach to solve the problem.

Thanks for any suggestions!

najeem
  • 1,841
  • 13
  • 29

1 Answers1

3

If you want to read data from a custom format in a text file I recommend using the dask.bytes.read_bytes function, which returns a list of delayed objects, each of which points to a block of bytes from your file. Those blocks will be cleanly separated by a line delimiter by default.

Something like this might work:

def parse_bytes(b: bytes) -> pandas.DataFrame:
    ...

blocks = dask.bytes.read_bytes("my-file.txt", delimiter=b"\n")
dataframes = [dask.delayed(parse_bytes)(block) for block in blocks]
df = dask.dataframe.from_delayed(dataframes)
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks for the suggestion. The file format that I'm working with has few lines of header data. I search for a particular text block to identify where the actual useful data starts. Can I use this method to read such a file? I see that there is a `not_zero` parameter in `read_bytes` method to discard header. Can I use that to discard multiple lines? If so, I can do an advance pass on the file and identify how many lines/bytes I need to skip before I get to the actual data. – najeem Jan 26 '20 at 20:33
  • 1
    You can probably use that block of text as a delimiter if you want. You will be guaranteed that Dask always splits on that text. I recommend reading the read_bytes docstring. – MRocklin Jan 30 '20 at 00:23