Switching from multiprocess to multithreaded Dask.DataFrame

Question

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example:

import dask.dataframe as dd
from dask.multiprocessing import get
# o - is pandas DataFrame
o['dist_center_from'] = dd.from_pandas(o, npartitions=8).map_partitions(lambda df: df.apply(lambda x: vincenty((x.fromlatitude, x.fromlongitude), center).km, axis=1)).compute(get=get)

That code run 8 CPU's simultaneously. Now, I have a problem that each process eats much memory, like the main process. So, I want to run it multi-threaded with shared memory. I tried, to change from dask.multiprocessing import get to from dask.threaded import get. But it doesn't use all of my CPUs, and I think it runs on single core.

According to http://dask.pydata.org/en/latest/scheduling.html it only provides parallelism when your computation is dominated by non-Python code, such as is the case when operating on numeric data in NumPy arrays, Pandas dataframes, or using any of the other C/C++/Cython based projects in the ecosystem. So, I think it doesn't work. — zhc, Jul 06 '18 at 06:43
I think you would be welcome to write this up as an answer, although I would also mention the distributed scheduler, which allows for a combination of threads and processes. — mdurant, Jul 06 '18 at 12:58

score 7 · Accepted Answer · answered Jul 07 '18 at 11:53

7

Yes, this is the tradeoff between threads and processes:

Threads: only parallelizes well if you use non-python code (most of the Pandas API on numeric data other than apply)
Processes: requires copying data around between processes

answered Jul 07 '18 at 11:53

MRocklin

55,641
23
163
235

Switching from multiprocess to multithreaded Dask.DataFrame

1 Answers1