3

I have a large collection of entries E and a function f: E --> pd.DataFrame. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame.

The situation I'd like to avoid is a partitioning (using 2 partitions for the sake of the example) where accidentally all fast function executions happen on partition 1 and all slow executions on partition 2, thus not optimally using the workers.

partition 1:
[==][==][==]

partition 2:
[============][=============][===============]

--------------------time--------------------->

My current solution is to iterate over the collection of entries and create a Dask graph using delayed, aggregating the delayed partial DataFrame results in a final result DataFrame with dd.from_delayed.

delayed_dfs = []  

for e in collection:
    delayed_partial_df = delayed(f)(e, arg2, ...)

    delayed_dfs.append(delayed_partial_df)

result_df = from_delayed(delayed_dfs, meta=make_meta({..}))

I reasoned that the Dask scheduler would take care of optimally assigning work to the available workers.

  1. is this a correct assumption?
  2. would you consider the overall approach reasonable?
martineau
  • 119,623
  • 25
  • 170
  • 301
Thomas Moerman
  • 882
  • 8
  • 16
  • Are the fast and slow executions randomly spread across the dataframe or in some known distribution? Have you seen the bokehjs monitor that would help to visualize your answer (albeit after execute)? As for your questions I would say yes and yes though I honestly have not dealt with very large differences in execution time. – Rookie Nov 11 '17 at 22:08
  • 2
    It is not clear what you mean by "partition" here. Generally, Dask takes good care to scheduler tasks to workers as they become available and you shouldn't have to worry about what happens where. – mdurant Nov 12 '17 at 00:41
  • To clarify: with ‘partition’ I meant a chunk of a dataset that is assigned to a particular worker in a distributed computation, cfr. what Apache Spark does with RDDs. In a previous implementation of my use case (using Spark), I observed that towards the end of the computation, more and more workers became idle, thus not optimally using the available workers. I now understand that Dask is able to dynamically steal tasks from workers, which solves this problem. – Thomas Moerman Nov 13 '17 at 10:07

1 Answers1

1

As mentioned in the comments above, yes, what you are doing is sensible.

The tasks will be assigned to workers initially, but if some workers finish their allotted tasks before others then they will dynamically steal tasks from those workers with excess work.

Also as mentioned in the comments, you might consider using the diagnostic dashboard to get a good sense of what the scheduler is doing. All of the information about worker load, work stealing, etc. are easily viewable.

http://distributed.readthedocs.io/en/latest/web.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235