0

I have installed Dask on OSX Mojave. Does it execute computations in parallel by default? Or do I need to change some settings?

I am using the DataFrame API. Does that make a difference to the answer?

I installed it with pip. Does that make a difference to the answer?

power
  • 1,680
  • 3
  • 18
  • 30
  • If you have a task to give Dask that will naturally parallelize, why don't you just run it, and then see what your CPU usage profile looks like? It sounds to me like you just don't want to read the Dask documentation. – CryptoFool Mar 13 '19 at 01:42

1 Answers1

1

Yes, Dask is parallel by default.

Unless you specify otherwise, or create a distributed Client, execution will happen with the "threaded" scheduler, in a number of threads equal to your number of cores. Note, however, that because of the python GIL (only one python instruction executed at a time), you may not get as much parallelism as available, depending on how good your specific tasks are at releasing the GIL. That is why you have a choice of schedulers.

Being on OSX, installing with pip: these make no difference. Using dataframes makes a difference in that it dictates the sorts of tasks you're likely running. Pandas is good at releasing the GIL for many operations.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks! I am using pandas, but working mostly with strings. I think that I am either hitting the GIL or it's not such a parrellisable task. I am using `DataFrameGroupBy.apply()` and the size of the groups are "small" in relation to the size of the dataset. – power Mar 13 '19 at 23:11
  • Strings will indeed be a problem with threads, and small tasks will always be a problem for performance. If you use the distributed scheduler, you can choose the process/thread mix, and the dashboards gives you a lot of diagnostic information. – mdurant Mar 14 '19 at 00:04