4

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example:

import dask.dataframe as dd
from dask.multiprocessing import get
# o - is pandas DataFrame
o['dist_center_from'] = dd.from_pandas(o, npartitions=8).map_partitions(lambda df: df.apply(lambda x: vincenty((x.fromlatitude, x.fromlongitude), center).km, axis=1)).compute(get=get)

That code run 8 CPU's simultaneously. Now, I have a problem that each process eats much memory, like the main process. So, I want to run it multi-threaded with shared memory. I tried, to change from dask.multiprocessing import get to from dask.threaded import get. But it doesn't use all of my CPUs, and I think it runs on single core.

zhc
  • 53
  • 1
  • 5
  • According to http://dask.pydata.org/en/latest/scheduling.html it only provides parallelism when your computation is dominated by non-Python code, such as is the case when operating on numeric data in NumPy arrays, Pandas dataframes, or using any of the other C/C++/Cython based projects in the ecosystem. So, I think it doesn't work. – zhc Jul 06 '18 at 06:43
  • I think you would be welcome to write this up as an answer, although I would also mention the distributed scheduler, which allows for a combination of threads and processes. – mdurant Jul 06 '18 at 12:58

1 Answers1

7

Yes, this is the tradeoff between threads and processes:

  • Threads: only parallelizes well if you use non-python code (most of the Pandas API on numeric data other than apply)
  • Processes: requires copying data around between processes
MRocklin
  • 55,641
  • 23
  • 163
  • 235