I have a large collection of entries E and a function f: E --> pd.DataFrame
. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame.
The situation I'd like to avoid is a partitioning (using 2 partitions for the sake of the example) where accidentally all fast function executions happen on partition 1 and all slow executions on partition 2, thus not optimally using the workers.
partition 1:
[==][==][==]
partition 2:
[============][=============][===============]
--------------------time--------------------->
My current solution is to iterate over the collection of entries and create a Dask graph using delayed
, aggregating the delayed partial DataFrame results in a final result DataFrame with dd.from_delayed
.
delayed_dfs = []
for e in collection:
delayed_partial_df = delayed(f)(e, arg2, ...)
delayed_dfs.append(delayed_partial_df)
result_df = from_delayed(delayed_dfs, meta=make_meta({..}))
I reasoned that the Dask scheduler would take care of optimally assigning work to the available workers.
- is this a correct assumption?
- would you consider the overall approach reasonable?