1

I want to merge a large pandas dataframe with shape of df1.shape = (80000, 18) to a small one with shape of df2.shape = (1, 18) on a column called "key". Here is the time performance using dd.merge:

ddf1 = from_pandas(df1, npartitions=20)
ddf2 = from_pandas(df2, npartitions=1)
start = time.time()
pred_mldf = dd.merge(ddf1 , ddf2, on =['key'])
print(pred_mldf)
t0 = time.time()
print("deltat = ", t0 - start)

And the result is deltat = 0.04.

Then I started implementing this using dask delayed in this manner:

def mymerge(df1, df2, key):
    pred_mldf = pd.merge(df1, df2, on = key)
    return pred_mldf

start = time.time()
pred_mldf = dask.delayed(mymerge)(df1, df2, ['key'])
pred_mldf.compute()
t0 = time.time()
print("deltat = ", t0 - start)

And the results is deltat = 3.48.

My hypothesis is that I need to reach the same time performance with two approaches. What I am doing wrong here?

M_x
  • 782
  • 1
  • 8
  • 26
Neuronix
  • 65
  • 4
  • You are not actually executing the merge in the first code block. Dask generally uses a programming model called lazy execution or lazy evaluation. You may want to explore the docs https://tutorial.dask.org/01x_lazy.html – Nick Becker Jan 19 '21 at 03:53
  • @NickBecker Thanks. You made good point! I edited my question based on your comment. But the problem still exists. – Neuronix Jan 19 '21 at 04:27

1 Answers1

1

As @Nick Becker pointed out in the comment, right now your first code block only defines the merge, but does not execute it (while the second code block does), so adding .compute() should give a different merge time:

ddf1 = from_pandas(df1, npartitions=20)
ddf2 = from_pandas(df2, npartitions=1)
start = time.time()
pred_mldf = dd.merge(ddf1 , ddf2, on =['key']).compute()
print(pred_mldf)
t0 = time.time()
print("deltat = ", t0 - start)

Another reason for different execution speeds is that in the second code block you are passing the complete df1 to the delayed function. If df1 is large, then it might be a bit more fair to split it into 20 chunks (like in the first code block) and pass those individually to the delayed function.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46