I have a long-ish dask chained pipeline, and one of the last bits is a string of dask.dataframe.from_delayed
calls like below. That line is extremely slow - many minutes per call. It take 1-2 hours to just setup the pipeline.
When I debug the problem, I pull out the relevant code and pass in arrays with the same shape. It runs instantly.
Is this because my real life pipeline has an upstream graph that it's contending with? My solution is going to be to split my pipeline into two and see if that solves it. Anything else that could be going on here?
import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
image = da.zeros((100, 8192, 8192), chunks=(100,256,256))
labels = da.zeros((100, 8192, 8192), chunks=(100,256,256))
image_chunks = image.to_delayed().ravel()
labels_chunks = labels.to_delayed().ravel()
results = []
for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
result = dask.delayed(lambda x,y,z: None)(image_chunk, labels_chunk, offsets)
results.append(result)
df_meta = pd.DataFrame(columns=['a', 'b'], dtype=np.float64)
df_meta = df_meta.astype({'a': np.int64})
# This line takes forever in actual use, but is instantaneous in this example.
df = dd.from_delayed(results, meta=df_meta)