Best way to parallelize computation over dask blocks that do not return np arrays?

Question

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.

Edit: Thanks to a @AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.

def extract_features_generalized(chunk, offsets, depth, columns):
    shape = np.asarray(chunk.shape)
    offsets = np.asarray(offsets)
    depth = np.asarray(depth)
    coordinates = np.stack(np.nonzero(chunk)).T     
    keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)    
    data = coordinates + offsets - depth
    df = pd.DataFrame(data=data, columns=columns)
    return df[keep]

def my_overlap_generalized(data, chunksize, depth, columns, boundary):
    data = data.rechunk(chunksize)
    data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets, 
                                                              depth=depth, columns=columns)
        dfs.append(df_block)

    return dd.from_delayed(dfs)

data = np.zeros((2,4,8,16,16))  
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1

arr = da.from_array(data)
df = my_overlap_generalized(arr, 
                            chunksize=(-1,-1,-1,8,8), 
                            depth=(0,0,0,2,2), 
                            columns=['r', 'c', 'z', 'y', 'x'],
                            boundary=tuple(['reflect']*5))
df.compute().reset_index()

-- Remainder of original post, including original bugs --

My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?

def my_overlap(data, chunk_xy, depth_xy):
    data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
    data_overlapping_chunks = da.overlap.overlap(data, 
                                                 depth=(0,0,0,depth_xy,depth_xy), 
                                                 boundary={3: 'reflect', 4: 'reflect'})

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
        dfs.append(df_block)

    # All computation is delayed, so downstream comptutions need to know the format of the data.  If the meta
    # information is not specified, a single computation will be done (which could be expensive) at this point
    # to infer the metadata.
    # This empty dataframe has the index, column, and type information we expect in the computation.
    columns = ['r', 'c', 'z', 'y', 'x']

    # The dtypes are float64, except for a small number of columns
    df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
    df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
    df_meta.index.name = 'feature'

    return dd.from_delayed(dfs, meta=df_meta)

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy, 
                       'x': x+offsets[4]-depth_xy})
    df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
    return df

data = np.zeros((2,4,8,16,16))  # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()

Anna M · Answer 1 · 2020-03-27T18:55:17.940

First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.

When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.

Here is a corrected version:

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
    df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
    df['y'] = df['y'] + offsets[3] - depth_xy
    df['x'] = df['x'] + offsets[4] - depth_xy
    return df

The updated code now returns the indices that were set to 1:

   index  r  c  z  y  x
0      0  0  0  4  2  2
1      1  0  1  4  6  2
2      2  0  3  4  2  2
3      1  1  2  4  8  2

For comparison, this is the output of the original version:

   index  r  c  z  y  x
0      1  0  1  4  6  2
1      3  1  2  4  8  2
2      0  0  1  4  6  2
3      1  1  2  4  8  2

It returns lines number 2 and 4, two times each.

The reason why this happens is three mistakes in the extract_features function:

You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap

To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:

def extract_features_generalized(chunk, offsets, depth, columns):
    coordinates = np.nonzero(chunk) 
    df = pd.DataFrame()
    rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
    for i in range(len(columns)):
        df[columns[i]] = coordinates[i]
        rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
                     np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
        df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
    del coordinates
    return df[rows_to_keep > 0]

def my_overlap_generalized(data, chunksize, depth, columns):
    data = data.rechunk(chunksize)
    data_overlapping_chunks = da.overlap.overlap(data, depth=depth, 
                                                 boundary=tuple(['reflect']*len(columns)))

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets, 
                                                              depth=depth, columns=columns)
        dfs.append(df_block)

    return dd.from_delayed(dfs)

data = np.zeros((2,4,8,16,16))  
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1

arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8), 
                            depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

I have also tried to run this for a larger array (100x1000x1000) to check the memory usage. It seems to work fine if the chunk size is chosen properly and there are not too many chunks. A too-small chunk size (e.g. 10x100x100 in my case) leads to the following warning: `distributed.utils_perf - WARNING - full garbage collections took 34% CPU time recently (threshold: 10%)`. For information on how to choose chunk size, see this link: [https://docs.dask.org/en/latest/array-chunks.html](https://docs.dask.org/en/latest/array-chunks.html) — Anna M, Mar 27 '20 at 18:59
That generalization is nice -- thanks for the help! As far as memory usage, I don't know the test you did and the system you have. When I profile it on an all-zeros array that returns an empty dataframe, I see ~800MB of memory -- about what I expect from a 100 x 1000 x 1000 array of float64. But when I do a map_overlap(lambda x:x, depth), I see a ~1.7GB memory usage, about twice that. Is it possible your array is generating huge dataframes? — HoosierDaddy, Mar 27 '20 at 20:33
I updated the post, building off your awesome answer. I think I made the code even cleaner, and hopefully didn't add any bugs. It runs about 2x faster (~50 sec vs 25 sec on 2.6GHz Xeon w/36 threads) on a zero-array of size 100x1000x1000. I wonder if a dask afficionado might find a better solution. — HoosierDaddy, Mar 27 '20 at 21:01

Best way to parallelize computation over dask blocks that do not return np arrays?

1 Answers1