Using dask.bag vs normal python list?

Question

When I run this parallel dask.bag code below, I seem to get much slower computation than the sequential Python code. Any insights into why?

import dask.bag as db

def is_even(x):
    return not x % 2

Dask code:

%%timeit
b = db.from_sequence(range(2000000))
c = b.filter(is_even).map(lambda x: x ** 2)
c.compute() 

>>> 12.8 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With n = 8000000
>>> 50.7 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python code:

%%timeit
b = list(range(2000000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 2, b))

>>> 547 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With n = 8000000
>>> 2.25 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To make this a fair comparison, unless this is Python 2, you should probably start with `b = list(range(2000000))`. Otherwise your pure-Python version is allocating only half the memory, because it doesn’t make a list until after the first filter. (I suspect that won’t make the problem go away and you’ll still have a good question—but it would be an even better question because you ruled out one possibility.) — abarnert, Sep 11 '18 at 20:06
Good point, I edited the question to fix this. It adds a slight time increase (about 10 ms) but is overall negligible. — max, Sep 11 '18 at 20:17
Maybe the task is just so small that the overhead overwhelms the parallelism? Can you repeat your tests with filter and map functions that take successively more CPU time to see if `bag` gradually gains on and beats `list` at some point? — abarnert, Sep 11 '18 at 20:42
Just added an additional data point - doesn't seem that overhead is the issue. — max, Sep 11 '18 at 21:12
I don't think that proves it. If the overhead is about copying the data around between processes, or setting up large hunks of shared mem, or something like that, bigger N probably means bigger overhead. Try replacing `lambda x: x ** 2` with something that takes more CPU time. — abarnert, Sep 11 '18 at 21:15

score 1 · Accepted Answer · answered Sep 11 '18 at 21:38

Thanks to @abarnert for the suggestion to look at overhead through longer task length.

It seems like the length of each task was too short, and the overhead made Dask slower. I changed the exponent from 2 to 10000 to make each task longer. This example produces what I was expecting:

Python code:

%%timeit
b = list(range(50000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 10000, b))

>>> 34.8 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dask code:

%%timeit
b = db.from_sequence(range(50000))
c = b.filter(is_even).map(lambda x: x ** 10000)
c.compute()

>>> 26.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using dask.bag vs normal python list?

1 Answers1