Merging a huge list of dataframes using dask delayed

Question

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.

from functools import reduce 
d = []
for lot in lots:
    lot_data = data[data["LOTID"]==lot]
    trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
    d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)

Visualized graph of the operations

score 0 · Answer 1 · answered Nov 18 '18 at 15:49

General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.

Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.

Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Merging a huge list of dataframes using dask delayed

1 Answers1