I have a DataFrame with 343,500 records and a predefined get_zipcode
function.
In order to speed up the apply
, I split the data in four and created the following threaded process using the threading
module:
df['subsections'] = np.resize([1,2,3,4], len(df))
if __name__ == '__main__':
t1 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 1)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
t2 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 2)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
t3 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 3)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
t4 = threading.Thread(target=df.loc[(df['EMPTY'] == True) & (df['subsections'] == 4)].apply(lambda x: get_zipcode(lat=x['LATITUDE'], lon=x['LONGITUDE']), axis=1))
t1.start()
t2.start()
t3.start()
t4.start()
t1.join()
t2.join()
t3.join()
t4.join()
This seems to work reasonably well. But I've since found the modin
module, which (from what I understand of the documentation) utilizes multithreading as well.
In a case like this, where I am essentially apply
ing a function across the entire dataframe, is there an advantage to using threading
versus modin
?
And in a broader sense, based on the documentation, is there ever an advantage to not using modin
?