I currently read a custom binary file to a dask.bags using a generator and dask.delayed:
@dask.delayed
def get_entry_from_binary(file_name, chunk_size=8+4+4):
with open(file_name, "rb") as f:
while (entry := f.read(chunk_size)):
yield dict(zip(("col1","col2","col3"), struct.unpack("qfi", entry)))
entries_bag = dask.bag.from_delayed(get_entry_from_binary(file_name))
However, against what I had expected, the whole file is read by a single worker, even when several are available, they just sit in idle.
I notice this by looking into the dashboard.
How can I read the file in parallel using the available workers?