How to parallelize groupby() in dask?

Question

I tried:

df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)

They take the same time, why num_workers does not work?

Thanks

Please give more information: what is the data like, and how is it being loaded? Are you using a dask distributed client? — mdurant, Jul 01 '19 at 16:07
more information is needed. it should be able to parallelize work like that since that divides neatly into a process map depending on how many groups there are — zero, Jul 04 '19 at 08:53
I'll put in the third "please add more information" comment, and raise you: how big is `df`? Dask has overhead in scheduling tasks which is still significant even at very large file sizes. If `df` is less than a few hundred MB, the overhead is probably costing you more time than the actual calculations. Thus, the time you're seeing isn't computation time, but rather the scheduler overhead, which could be roughly equal in these cases. — kingfischer, Jul 05 '19 at 10:42

score 6 · Answer 1 · answered Jul 08 '19 at 10:04

By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)

If you want to use several processors to compute your operation, you have to use a different scheduler:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

df = dd.read_csv("data.csv")

def group(num_workers): 
    start = time.time() 
    res = df.groupby("name").agg("count").compute(num_workers=num_workers) 
    end = time.time() 
    return res, end-start

print(group(4))

clust = LocalCluster()
clt = Client(clust, set_as_default=True) 
print(group(4))

Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

How to parallelize groupby() in dask?

1 Answers1