Limit memory footprint when storing `dask.array.map_blocks` output

Question

Consider a 2D array X to large to fit in memory--in my case it's stored in the Zarr format but that doesn't matter. I would like to map a function block-wise over the array and save the result without having ever loading the entire array into memory. E.g.,

import dask.array as da
import numpy as np

X = da.arange(10000000,
    dtype=np.int32).reshape((10,1000000)).rechunk((10,1000))

def toy_function(chunk):
    return np.mean(chunk,axis=0)

lazy_result = X.map_blocks(toy_function)

lazy_result.to_zarr("some_path")

Is there a way to limit the number of blocks evaluated at a single time? In my use case, lazy_result[:,:1000].compute() fits into memory, but lazy_result.compute() is too big for memory. When I try to write to Zarr, memory usage increases until it maxes out and is killed. Can I do this without having to result to something inconvenient like for i in range(1000): lazy_result[:,(i*1000):((i+1)*1000)].to_zarr('some_path'+str(i))

Dask does not currently have the ability to manage memory in this way. I believe your problem is related to the following github issue: https://github.com/dask/distributed/issues/2602 — Ryan, Jun 09 '20 at 01:13
Dask absolutely has the ability to manage memory to support streaming workflows like this. — MRocklin, Jun 13 '20 at 15:17

score 0 · Answer 1 · answered Jun 13 '20 at 15:16

I suspect your problem is actually in how you are constructing your original data here:

X = da.arange(10000000,
    dtype=np.int32).reshape((10,1000000)).rechunk((10,1000))

Operations that flip around the chunking of an array often require many chunks to be in memory at once. I suspect that you are using arange mostly as a test dataset. I recommend trying with a function that supports chunking like ones or zeros and see if your problem persists.

da.ones((10, 1000000), chunks="128 MiB")

Limit memory footprint when storing `dask.array.map_blocks` output

1 Answers1