Dask: DataFrame taking forever to compute

Question

I created a Dask dataframe from a Pandas dataframe that is ~50K rows and 5 columns:

ddf = dd.from_pandas(df, npartitions=32)

I then add a bunch of columns (~30) to the dataframe and try to turn it back into a Pandas dataframe:

DATA = ddf.compute(get = dask.multiprocessing.get)

I looked at the docs and if I don't specify num_workers, it defaults to using all my cores. I'm on a 64 core EC2 instance and the above line has taken minutes already without finishing...

Any idea how to speed up or what I'm doing incorrectly?

Thanks!

Have you tried using the default threaded scheduler? How do you add your new columns? Any chance you can provide a more complete example? http://dask.pydata.org/en/latest/scheduler-choice.html — MRocklin, Jul 27 '17 at 23:02

score 2 · Answer 1 · answered Oct 24 '19 at 09:30

2

I'd suggest to try lowering the amount of threads and increasing the amount of processes to help speed up things.

answered Oct 24 '19 at 09:30

msarafzadeh

395
1
4
15

can you give an example/guide how to do that? – sogu Feb 19 '21 at 11:53
1

I found this documentation very helpful: https://medium.com/analytics-vidhya/how-to-efficiently-parallelize-dask-dataframe-computation-on-a-single-machine-1f10b5b02177 – msarafzadeh Feb 21 '21 at 08:27

Dask: DataFrame taking forever to compute

1 Answers1