Correct choice of chunks-specification for dask array

Question

According to the dask documentaion it's possible to specify the chunks in one of three ways:

a blocksize like 1000

a blockshape like (1000, 1000)

explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400))

Your chunks input will be normalized and stored in the third and most explicit form..

After trying to understand the way chunks work using the visualize() function, there are still a few things I'm not sure about:

If the input is normalized, does it matter which input form I choose?

Blocksize means every chunk has the size of X, i.e. 1000. What does the blockshape input specify?

When giving a blockshape input, does the order of parameters make a difference? How is it related to the shape of the array/matrix?

score 10 · Accepted Answer · answered Jan 20 '16 at 15:46

10

The forms lower in that list are more explicit and allow for greater asymmetry in your block shapes.

Examples

We'll discuss this through a sequence of examples of chunks on the following array:

1 2 3 4 5 6
7 8 9 0 1 2
3 4 5 6 7 8
9 0 1 2 3 4 
5 6 7 8 9 0 
1 2 3 4 5 6

We show how different chunks arguments split the array into different blocks

`chunks=3`

Symmetric blocks of size 3

1 2 3  4 5 6
7 8 9  0 1 2
3 4 5  6 7 8

9 0 1  2 3 4 
5 6 7  8 9 0 
1 2 3  4 5 6

`chunks=2`

Symmetric blocks of size 2

1 2  3 4  5 6
7 8  9 0  1 2

3 4  5 6  7 8
9 0  1 2  3 4 

5 6  7 8  9 0 
1 2  3 4  5 6

`chunks=(3, 2)`

Asymmetric but repeated blocks of size (3, 2)

1 2  3 4  5 6
7 8  9 0  1 2
3 4  5 6  7 8

9 0  1 2  3 4 
5 6  7 8  9 0 
1 2  3 4  5 6

`chunks=(1, 6)`

Asymmetric but repeated blocks of size (1, 6)

1 2 3 4 5 6

7 8 9 0 1 2

3 4 5 6 7 8

9 0 1 2 3 4 

5 6 7 8 9 0 

1 2 3 4 5 6

`chunks=((2, 4), (3, 3))`

Asymmetric and non-repeated blocks

1 2 3  4 5 6
7 8 9  0 1 2

3 4 5  6 7 8
9 0 1  2 3 4 
5 6 7  8 9 0 
1 2 3  4 5 6

`chunks=((2, 2, 1, 1), (3, 2, 1))`

Asymmetric and non-repeated blocks

1 2 3  4 5  6
7 8 9  0 1  2

3 4 5  6 7  8
9 0 1  2 3  4 

5 6 7  8 9  0 

1 2 3  4 5  6

Discussion

The latter examples are rarely provided by users on original data but arise from complex slicing and broadcasting operations. Generally I use the simplest form until I need more complex forms. The choice of chunks should align with the computations you want to do.

For example, if you plan to take out thin slices along the first dimension then you might want to make that dimension skinnier than the others. If you plan to do linear algebra then you might want more symmetric blocks.

answered Jan 20 '16 at 15:46

MRocklin

55,641
23
163
235

1

on http://dask.pydata.org/en/latest/faq.html?highlight=chunks the author says they aim for chunks `10MB-100MB` but this answer mentions shape. Is there an easy way to see what size in MB a proposed chunk would consume? – mobcdi Aug 26 '16 at 15:18
The chunk shape should give you the number of elements, e.g. (1000, 1000) is a million elements. Then look at `np.dtype(your_dtype).itemsize` to see the bytes per element like `np.dtype(float).itemsize == 8` – MRocklin Aug 26 '16 at 15:37
If we're working from a Pandas dataframe, is there a simple relationship between its shape and the optimal chunksize? Like if we're using apply along axis 1 across five columns, should it be `(1, 5)`? – Jeff Jan 16 '17 at 12:40
Defining "optimal" is a hard problem. Additionally, generally dask.dataframes don't chunk in the same way. It does not chunk column-wise. The number of rows per partition are not known. In this setting one might define the chunksize to be fixed at something like `(unknown, n_columns)`. – MRocklin Jan 16 '17 at 16:24
1

well, me as a `dask` novice, so scared that loading data in `dask.dataframe` is so complex, I could even understand what is this `blocksize` (or shape?) arg in `read_csv`. My intuition is that, let's say `a.csv` is one huge file, `read_csv` would read this huge file chunk by chunk, with each chunk containing `blocksize` (parsed) lines, isn't this true? Then what's going on with this `blockshape` tuple? BTW, I'm talking about `dask.dataframe` not array – avocado Dec 12 '17 at 09:31
This question is about dask array, not dask dataframe. You may want to open a separate question. – MRocklin Dec 12 '17 at 13:56

Correct choice of chunks-specification for dask array

1 Answers1

Examples

chunks=3

chunks=2

chunks=(3, 2)

chunks=(1, 6)

chunks=((2, 4), (3, 3))

chunks=((2, 2, 1, 1), (3, 2, 1))