4

I created a Dask dataframe from a Pandas dataframe that is ~50K rows and 5 columns:

ddf = dd.from_pandas(df, npartitions=32)

I then add a bunch of columns (~30) to the dataframe and try to turn it back into a Pandas dataframe:

DATA = ddf.compute(get = dask.multiprocessing.get)

I looked at the docs and if I don't specify num_workers, it defaults to using all my cores. I'm on a 64 core EC2 instance and the above line has taken minutes already without finishing...

Any idea how to speed up or what I'm doing incorrectly?

Thanks!

anon_swe
  • 8,791
  • 24
  • 85
  • 145
  • Have you tried using the default threaded scheduler? How do you add your new columns? Any chance you can provide a more complete example? http://dask.pydata.org/en/latest/scheduler-choice.html – MRocklin Jul 27 '17 at 23:02

1 Answers1

2

I'd suggest to try lowering the amount of threads and increasing the amount of processes to help speed up things.

msarafzadeh
  • 395
  • 1
  • 4
  • 15
  • can you give an example/guide how to do that? – sogu Feb 19 '21 at 11:53
  • 1
    I found this documentation very helpful: https://medium.com/analytics-vidhya/how-to-efficiently-parallelize-dask-dataframe-computation-on-a-single-machine-1f10b5b02177 – msarafzadeh Feb 21 '21 at 08:27