0

I have a long-ish dask chained pipeline, and one of the last bits is a string of dask.dataframe.from_delayed calls like below. That line is extremely slow - many minutes per call. It take 1-2 hours to just setup the pipeline.

When I debug the problem, I pull out the relevant code and pass in arrays with the same shape. It runs instantly.

Is this because my real life pipeline has an upstream graph that it's contending with? My solution is going to be to split my pipeline into two and see if that solves it. Anything else that could be going on here?

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd

image = da.zeros((100, 8192, 8192), chunks=(100,256,256))
labels = da.zeros((100, 8192, 8192), chunks=(100,256,256))

image_chunks = image.to_delayed().ravel()
labels_chunks = labels.to_delayed().ravel()

results = []
for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
    offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
    result = dask.delayed(lambda x,y,z: None)(image_chunk, labels_chunk, offsets)
    results.append(result)

df_meta = pd.DataFrame(columns=['a', 'b'], dtype=np.float64)
df_meta = df_meta.astype({'a': np.int64})

# This line takes forever in actual use, but is instantaneous in this example.
df = dd.from_delayed(results, meta=df_meta)  

HoosierDaddy
  • 720
  • 6
  • 19

1 Answers1

0

The code that you have posted works great for me (as you predicted). Without knowing more I don't know how to help. In your situation I would slowly add back in parts of your actual pipeline and see when things get slow. That should help you to isolate the problem.

MRocklin
  • 55,641
  • 23
  • 163
  • 235