I have three sources and a Dask Dataframe for each of them. I need to apply a function that computes an operation that combines data from the three sources. The operation requires a state to be calculated ( I can't change that).
The three sources are in parquet format and I read the data using read_parquet
Dask Dataframe function:
@dask.delayed
def load_data(data_path):
ddf = dd.read_parquet(data_path, engine="pyarrow")
return ddf
results = []
sources_path=["/source1","/source2","/source3"]
for source_path in sources_path:
data = load_data(source_path)
results.append(data)
I create another delayed function that executes the operation:
@dask.delayed
def process(sources):
operation(sources[0][<list of columns>],sources[1][<list of columns>],sources[2][<list of columns>])
The operation
function comes from a custom library. It could not actually be parallelized because it has an internal state.
Reading the dask documentation, this is not a best practice.
Is there a way to apply a custom function on multiple dask dataframe without using delayed function?