0

I currently read a custom binary file to a dask.bags using a generator and dask.delayed:

@dask.delayed
def get_entry_from_binary(file_name, chunk_size=8+4+4):
    with open(file_name, "rb") as f: 
        while (entry := f.read(chunk_size)):
            yield dict(zip(("col1","col2","col3"), struct.unpack("qfi", entry)))

entries_bag = dask.bag.from_delayed(get_entry_from_binary(file_name))

However, against what I had expected, the whole file is read by a single worker, even when several are available, they just sit in idle.

I notice this by looking into the dashboard.

How can I read the file in parallel using the available workers?

BBG
  • 73
  • 1
  • 9

1 Answers1

0

Even though your delayed function is a generator, it still becomes a single task, producing a single output (as a list). In general, you cannot of course parallelise access to a single open file, because it has an internal state (the current location). What you should instead do, is make a function which can read an arbitrary offset in the file, and make delayed tasks out of those.

Something like

@dask.delayed
def get_entry_from_binary(file_name, offset, count, chunk_size=8+4+4):
    with open(file_name, "rb") as f:
        f.seek(offset) 
        return [dict(
            zip(("col1","col2","col3"), struct.unpack("qfi", f.read(chunk_size)))) 
            for _ in range(count)]

entries_per_part = 500
entries_bag = dask.bag.from_delayed(
    [get_entry_from_binary(file_name, o, entries_per_part) 
    for o in range(0, size_of_file, entries_per_part*chunk_size)]
)
mdurant
  • 27,272
  • 5
  • 45
  • 74