1

I have a DataFrame with 343,500 records and a predefined get_zipcode function.

In order to speed up the apply, I split the data in four and created the following threaded process using the threading module:

df['subsections'] = np.resize([1,2,3,4], len(df))

if __name__ == '__main__':
    t1 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 1)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t2 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 2)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t3 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 3)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
    t4 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 4)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))

    t1.start()
    t2.start()
    t3.start()
    t4.start()
    
    t1.join()
    t2.join()
    t3.join()
    t4.join()

This seems to work reasonably well. But I've since found the modin module, which (from what I understand of the documentation) utilizes multithreading as well.

In a case like this, where I am essentially applying a function across the entire dataframe, is there an advantage to using threading versus modin?

And in a broader sense, based on the documentation, is there ever an advantage to not using modin?

Yehuda
  • 1,787
  • 2
  • 15
  • 49
  • 1
    That code doesn't actually multithread anything. `df.loc[(df['EMPTY'] == True) & (df['subsections'] == 1)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1)` is executed and then its result is assigned as the target of the thread. You'll only get so much parallelization from `apply` because it needs to make everything into python objects and acquire the GIL. – tdelaney Jan 05 '21 at 00:55
  • What is `get_zipcode`? Is it a blocking call... maybe getting something from the internet? – tdelaney Jan 05 '21 at 00:57
  • I think you can make your `get_zipcode()` function much more efficient and avoid `axis=1`. Passing `axis=1` will absolutely destroy your performance here. For example, rewriting your current function to something like this: `t4 = threading.Thread(target=get_zipcode(df.loc[(df['EMPTY'] == True) & (df['subsections'] == 4), ['LATITUDE', 'LONGITUDE']]))` could greatly improve performance if you change everything within your function to be vectorized. – David Erickson Jan 05 '21 at 01:18
  • If modin works as claimed, it could be quite an improvement on pandas. Pandas frequently has to keep the GIL locked, especially when you are using `apply`. Assuming your lats and longs are `float` you need to create python objects and `get_zipcodes` is python so the GIL is locked there. That would be the same for modin as pandas. – tdelaney Jan 05 '21 at 01:39
  • @tdelaney Thanks for the advisement. `get_zipcodes` is an API call function to an API that only accepts single-item requests. – Yehuda Jan 05 '21 at 02:45

0 Answers0