move from pandas to dask to utilize all local cpu cores

Question

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas?

Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated.

score 5 · Accepted Answer · answered Mar 07 '17 at 13:16

5

Would dask work well to use all (local) CPU cores?

Yes.

how compatible is it to pandas?

Pretty compatible. Not 100%. You can mix in Pandas and NumPy and even pure Python stuff with Dask if needed.

Could I use multiple CPUs with pandas?

You could. The easiest way would be to use multiprocessing and keep your data separate--have each job independently read from disk and write to disk if you can do so efficiently. A significantly harder way is using mpi4py which is most useful if you have a multi-computer environment with a professional administrator.

answered Mar 07 '17 at 13:16

John Zwinck

239,568
38
324
436

What option would you suggest as long as only a single (local) worker with a lot of cpu cores is involved? – Georg Heiler Mar 07 '17 at 13:17
For example date time parsing of a single column would it be possible for dask to parallelize this? – Georg Heiler Mar 07 '17 at 13:24

mdurant · Answer 2 · 2020-01-28T17:51:32.580

4

Dask implements a large fraction of the pandas API in its dataframes. These operations call the very same pandas function on chunks of your overall dataframe, so you should expect them to be totally compatible.

The resulting computations can be run in any of the available schedulers allowing you to choose whether you are running low-overhead threads or something more complex. The distributed scheduler gives you full control over the split between threads and processes, has more features, and can be scaled out later across a cluster, so is becoming increasingly the preferred option, even for simple single-machine tasks.

Many pandas operations do release the GIL and so will work efficiently with threads. Also, many pandas operations can be easily broken down into parallel chunks - but some cannot and will either be slower (such as joins requiring shuffles), or not work at all (such as multi-indexing). The best way to find out is to give it a try!

edited Jan 28 '20 at 17:51

answered Mar 07 '17 at 16:35

mdurant

27,272
5
45
74

Will do that. Do you expect any speedup for parsing date time column? – Georg Heiler Mar 07 '17 at 16:36
1

You mean reading from CSV? The c-parser (default, and able to cope with most situations) does release the GIL and you should get a speedup for parsing even with just threads. You never *quite* get the multiplier equal to the number of cores, there is always some overhead, especially for smaller data. – mdurant Mar 07 '17 at 16:48
I will look into that. Is there a way to get the speedup for the plain pandas version? – Georg Heiler Mar 07 '17 at 16:50
That parallelism speed-up is what dask was made for :) Oh, and to be able to handle larger-than-memory datasets. – mdurant Mar 07 '17 at 16:53
Fixed; it's been a while! – mdurant Jan 28 '20 at 17:51

move from pandas to dask to utilize all local cpu cores

2 Answers2

Linked