I want to merge a large pandas dataframe with shape of df1.shape = (80000, 18) to a small one with shape of df2.shape = (1, 18) on a column called "key". Here is the time performance using dd.merge:
ddf1 = from_pandas(df1, npartitions=20)
ddf2 = from_pandas(df2, npartitions=1)
start = time.time()
pred_mldf = dd.merge(ddf1 , ddf2, on =['key'])
print(pred_mldf)
t0 = time.time()
print("deltat = ", t0 - start)
And the result is deltat = 0.04.
Then I started implementing this using dask delayed in this manner:
def mymerge(df1, df2, key):
pred_mldf = pd.merge(df1, df2, on = key)
return pred_mldf
start = time.time()
pred_mldf = dask.delayed(mymerge)(df1, df2, ['key'])
pred_mldf.compute()
t0 = time.time()
print("deltat = ", t0 - start)
And the results is deltat = 3.48.
My hypothesis is that I need to reach the same time performance with two approaches. What I am doing wrong here?