I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.