I'm using delayed
to read many large CSV files:
import pandas as pd
def function_1(x1, x2):
df_d1 = pd.read_csv(x1)
# Some calculations on df_d1 using x2.
return df_d1
def function_2(x3):
df_d2 = pd.read_csv(x3)
return df_d2
def function_3(df_d1, df_d2):
# some calculations and merging data-sets (output is "merged_ds").
return merged_ds
function_1
: importing data-set 1 and doing some calculations.function_2
: importing data-set 2.function_3
: merge data-sets and some calculations.
Next, I use a loop to call these functions using delayed
function. I have many CSV files, and every file is more than 500MB. Is this a suitable procedure to do my tasks using DASK (delayed
)?