parallelize loop over dataframe itertuples() rows using joblib

Question

I want to iterate over a data frame using itertuples(), the common way to do this:

for row in df.itertuples():
    my_funtion(row) # do something with row

However now I wish to do the loop in parallel using joblib like this (which seems very straightforward to me):

num_cores = multiprocessing.cpu_count()
processed_list = Parallel(n_jobs=num_cores)(delayed(my_function(row) for row in df.itertuples()))

However I got the following error:

File "/home/anaconda3/envs/pytorch/lib/python3.7/site-packages/joblib/parallel.py", line 885, in call iterator = iter(iterable) TypeError: 'function' object is not iterable

Please, any idea what could be the problem?

As peter mentions in his answer use the "pandas" way of processing. One of the primary benefits of pandas is that it uses numpy under the hood to allow vectorized operations (basically running the operations in parallel) See: https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7 — monkut, Apr 03 '20 at 00:44

score 1 · Answer 1 · answered Apr 06 '20 at 18:14

I think that dask.org satisfies my needs related with this post (following @monkut suggestion). This is an example:

import dask.dataframe as dd
sd = dd.from_pandas(some_df, npartitions=40)
sr = pd.Series([1,1.8,2.8,3.8,4.8,5.8]) 
['col1','col2','col3','col4','col5']) # this is a meta sample of the ouput to help dask infer output shape
df_out = sd.apply(my_function,axis=1,meta=sr).compute(scheduler='processes')

This solution applies my_function to every row of the whole dataframe in 31 seconds as measure by timeit. I was able to see multiple ZMQbg Jupyter processes (up to 16) running during the execution. I guess this means it is executing in parallel.

The alternative solution:

df_out = df.apply(my_function,axis=1,result_type="expand")

produce the same result but in 325 seconds. Approx 10 times slower. With this solution i don't see multiple running processes in top.

score -1 · Answer 2 · answered Apr 02 '20 at 21:22

-1

Iterating over dataframes is NOT the common way. Don't use itertuples(), but a simple vectorization with

df.apply(my_function)

Pandas will do the "multiprocessing" for you.

answered Apr 02 '20 at 21:22

Peter

10,959
2
30
47

I implemented your solution, but how i can check that apply is really doing the multiprocessing. I am working in a server with 24 cores. When i check the cpu utilization with top I dont see all the process created by apply. Actually, i am not sure how to check if apply is using all cores. Can you help me with that. Thanks!! – German Farinas Apr 02 '20 at 21:43
1

Pandas is not multi-core, but the operations are performed at once. If your looking for multi-core pandas operations (and have a large data set that doesn't fit in memory) you may want to look at https://dask.org/ – monkut Apr 03 '20 at 00:46
If your data fits in memory then your probably fine with just using straight pandas. – monkut Apr 03 '20 at 00:46
@GermanFarinas: Yes, Pandas doesn't do "multiprocessing" (hence the aphostophes), but serialization is way faster than iteration. The only thing that counts in this context is calculation speed - which you can measure. Using iteration + multiprocessing is simply bullshit ... – Peter Apr 03 '20 at 18:11

parallelize loop over dataframe itertuples() rows using joblib

2 Answers2