Below, I've gathered 4 ways to complete the execution of code that involves sorting updating Pandas Dataframes.
I would like to apply the best methods to speed up the code execution. Am I using the best available practices?
Would someone please share some thoughts on the following ideas?
I'm looping over the Data Frame because the process to solve my problem appears to call for it. Would there be a big increase in speed from using Dask Dataframes?
Can the Dask Distributed version benefit from setting a particular number workers, processes, threads per worker? People point out that increasing the number of processes instead of threads (or vice versa) is best for some cases.
What would be the most powerful hardware infrastructure to use for this kind of code? The Multiprocessing version is even faster on an AWS instance with more physical CPU cores.
- Would a Kubernetes/AWS setup with Dask Distributed be much faster?
- Could this be easily adapted to run with the help of a GPU locally or on a multi-GPU AWS instance?
These are the completion times for reference:
- Regular 'For' loop:
34 seconds
- Dask Delayed:
21 seconds
- Dask Distributed (local machine):
21 seconds
- Multiprocessing:
10 seconds
from dask.distributed import Client
from multiprocessing import Pool
from dask import delayed
import pandas as pd
import numpy as np
client = Client()
import random
import dask
#Setting original input data that will be used in the functions
alist=['A','B','C','D','E','F','G','H','I']
set_table=pd.DataFrame({"A":alist,
"B":[i for i in range(1,10)],
"C":[i for i in range(11,20)],
"D":[0]*9})
#Assembled random list of combinations
criteria_list=[]
for i in range(0,10000):
criteria_list.append(random.sample(alist,6))
#Sorts and filters the original df
def one_filter_sorter(criteria):
sorted_table=set_table[set_table['A'].isin(criteria)]
sorted_table=sorted_table.sort_values(['B','C'],ascending=True)
return sorted_table
#Exists to help the function below. Simplified for this example
def helper_function(sorted_table,idx):
if alist.index(sorted_table.loc[idx,'A'])>5:
return True
#last function that retuns the gathered result
def two_go_downrows(sorted_table):
for idx, row in sorted_table.iterrows():
if helper_function(sorted_table,idx)==True:
sorted_table.loc[idx,'D'] = 100 - sorted_table.loc[idx,'C']
res=sorted_table.loc[:,['A','D']].to_dict()
return res
#--Loop version
result=[]
for criteria in criteria_list:
A=one_filter_sorter(criteria)
B=two_go_downrows(A)
result.append(B)
#--Multiprocessed version
result=[]
if __name__ == '__main__':
pool=Pool(processes=6)
A=pool.map(one_filter_sorter, criteria)
B=pool.map(two_go_downrows, A)
result.append(B)
#--Delayed version
result=[]
for criteria in criteria_list:
A=delayed(one_filter_sorter)(criteria)
B=delayed(two_go_downrows)(A)
result.append(B)
dask.compute(result)
#--Distributed version
A= client.map(one_filter_sorter,criteria_list)
B= client.map(two_go_downrows,A)
client.gather(B)
Thank you