0

I'm testing dask and i can't understand how dask is slower that plain python. I was developed in jupyer two examples to get the time for each, and i think that i am doing something wrong

The first with dask: 28.5 seconds and after in plain python 140 ms

    import dask
    import dask.array as da
    %%time
    def inc(x):
        return x + 1

    def double(x):
        return x + 2

    def add(x, y):
        return x + y

    N = 100000

    data = [0 for x in range(N)]
    x = da.from_array(data, chunks=(1000))

    output = []
    for x in data:
        a = dask.delayed(inc)(x)
        b = dask.delayed(double)(x)
        c = dask.delayed(add)(a, b)
        output.append(c)

    total = dask.delayed(sum)(output)
    total.compute()
**28.8 seconds**

Now with plain python

    %%time
    def inc(x):
        return x + 1

    def double(x):
        return x + 2

    def add(x, y):
        return x + y

    N = 100000

    data = [0 for x in range(N)]

    output = []
    for x in data:
        a = inc(x)
        b = double(x)
        c = add(a, b)
        output.append(c)

    total = sum(output)
**140 milliseconds**

1 Answers1

0

Your code run on my machine: 38s. This code:

x = da.from_array(data, chunks=(1000))
%time ((x + 1) + (2*x)).compute()

runs in 24ms.

x = np.array(data)
%time ((x + 1) + (2*x))

runs in 350us.

Points:

  • is your data fits in memory easily (numpy or pandas), you probably don't get anything from dask, since those libraries are already fast
  • Dask has collection APIs like array, so use them
  • don't for-iterate over arrays!
  • if an individual function runs in a time <<1ms, dask is only adding overhead; this is certainly your case. You'll notice in the tutorial that the functions include sleep to simulate CPU work, so that you actually get some parallelism
  • don't call .compute() many many times, try to form what you want to do into a single call to compute, which takes an arbitrary number of arguments.
mdurant
  • 27,272
  • 5
  • 45
  • 74