Answering again with a totally different approach... your performance test is really unfair to dask.
A huge share of the time in your test is taken up by allocating a 16GB array and filling it with normally-distributed random values. The first line of your test, where you define the array x
, you only schedule this operation. Then, dask has to execute this allocation & population operation during the %time
d test. In the numpy test, you first compute y
, giving numpy a pre-defined array, and you're only timing the dot product.
Dask is evaluated lazily, meaning it waits for you to need a result before it computes it. This is really powerful, as it means you can for example open a really large file, do a bunch of math on it, and then subset the result, and dask will only read in and do math on the required subset. On the other hand, it does mean you have to be really careful when interpreting errors and timing, as the computation only occurs when you trigger it with a blocking call such as compute
(other examples are write or plot operations).
Setting up a fair comparison, the two are much more similar:
In [1]: import dask.array as da, numpy as np
In [2]: %%time
...: x = da.random.normal(10, 0.1, size=(20000 * 100000), chunks=(20000 * 100000))
...: z = x.dot(x)
...: z.compute()
...:
...:
CPU times: user 48.4 s, sys: 15.9 s, total: 1min 4s
Wall time: 52.8 s
Out[2]: 200020152771.42023
In [3]: %%time
...: x = np.random.normal(10, 0.1, size=(20000 * 100000))
...: z = x.dot(x)
...:
...:
CPU times: user 48.3 s, sys: 14.8 s, total: 1min 3s
Wall time: 53 s
On the other hand, splitting the dask array job into 100 chunks cuts down on the total time dramatically:
In [4]: %%time
...: x = da.random.normal(10, 0.1, size=(20000 * 100000), chunks=(200 * 100000))
...: z = x.dot(x)
...: z.compute()
...:
...:
CPU times: user 54.4 s, sys: 1.61 s, total: 56 s
Wall time: 6.05 s
Out[4]: 200020035893.7987