Adjust chunk sizes
The answer by @isternberg is correct that you should adjust chunk sizes. A good choice of chunk size follows the following rules
- A chunk should be small enough to fit comfortably in memory.
- A chunk must be large enough so that computations on that chunk take significantly more than the 1ms overhead per task that dask incurs (so 100ms-1s is a good number to shoot for).
- Chunks should align with the computation that you want to do. For example if you plan to frequently slice along a particular dimension then it's more efficient if your chunks are aligned so that you have to touch fewer chunks.
I generally shoot for chunks that are 1-100 megabytes large. Anything smaller than that isn't helpful and usually creates enough tasks that scheduling overhead becomes our largest bottleneck.
Comments about the original question
If your array is only of size (1000, 100)
then there is no reason to use dask.array
. Instead, use numpy and, if you really care about using mulitple cores, make sure that your numpy library is linked against an efficient BLAS implementation like MLK or OpenBLAS.
If you use a multi-threaded BLAS implementation you might actually want to turn dask threading off. The two systems will clobber each other and reduce performance. If this is the case then you can turn off dask threading with the following command.
dask.set_options(get=dask.async.get_sync)
To actually time the execution of a dask.array computation you'll have to add a .compute()
call to the end of the computation, otherwise you're just timing how long it takes to create the task graph, not to execute it.
Larger Example
In [1]: import dask.array as da
In [2]: x = da.random.normal(10, 0.1, size=(2000, 100000), chunks=(1000, 1000)) # larger example
In [3]: %time z = x.dot(x.T) # create task graph
CPU times: user 12 ms, sys: 3.57 ms, total: 15.6 ms
Wall time: 15.3 ms
In [4]: %time _ = z.compute() # actually do work
CPU times: user 2min 41s, sys: 841 ms, total: 2min 42s
Wall time: 21 s