0

I have a function that I want to parallelize so that it returns a dataframe with multiple columns based on an array. How can I use multiprocessing to do this? Here is an example of what my code is.

def f(df, x): 
   df['x'] = somefunc(x)

def run_parallel():
   df = *existing dataframe*
   values = ['a', 'b', 'c', 'd', 'e']
   for i,s in enumerate(values):
       j = multiprocessing.Process(target=f, args=(df, s))
       jobs.append(j)
   for j in jobs:
       j.start()
   return df

Where somefunc(x) returns a list of values based on what x is and df is the dataframe I want to return. I'm not sure how to get back the dataframe with these columns if I'm running it through multiprocessing.

Roxanne
  • 77
  • 4
  • Scary - check this out: https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe – jch Jul 29 '22 at 16:17
  • @jch is there a different way to write to a df safely with parallel processing? Without it my data runs really slowly, so I'd like to find a way to speed this up. – Roxanne Jul 29 '22 at 16:36
  • Would it work for your use case to partition the main DF into separate DFs? The put it all back together at the end? – jch Jul 29 '22 at 16:47
  • @jch yes, how could I do that? – Roxanne Jul 29 '22 at 17:39

1 Answers1

0

See pandarallel~

from pandarallel import pandarallel
pandarallel.initialize()

df['x'] = df['x'].parallel_apply(somefunc, args=(x,))
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • This probably isn't set up quite right for your use case, but with some more details on `somefunc()`, I bet it could be adapted. – BeRT2me Jul 29 '22 at 17:28
  • is there a way to loop this so I can get back multiple columns for df? The way you did it seems like only the 'x' column would come back, is there a way I can get back a column for each value in the x array? – Roxanne Jul 29 '22 at 21:01
  • It only modifies the x column, but the whole dataframe still exists... are you modifying more than the x column? – BeRT2me Jul 29 '22 at 21:22