1

I couldn't figure out how to compute delayed objects coming from df.groupy.apply() operation. I really appreciate if someone can help. Here is a sample code I wrote

import pandas as pd
import dask
df = pd.DataFrame(columns=['id','id2','val1'])
df['id'] = ['A','A','A','B','C','C','D','D']
df['id2']=['a','a','b','a','a','b','b','b']
df['val1']= [1,2,3,4,5,6,7,8]
@dask.delayed
def dask_test(group,val_col):
    for idx,row in group.iterrows():
        group.loc[idx,'test']=2*group.loc[idx,val_col]
    return group

tmp_grp = df.groupby(['id','id2']).apply(dask_test,'val1')

The output of tmp_grp is

id  id2
A   a      Delayed('copy-f0e26845-fc3a-4bb7-8609-47b923c0...
    b      Delayed('copy-9b6cecf5-9fa4-4301-ba2d-dec5478d...
B   a      Delayed('copy-7b538f4b-ac3f-4c83-b37b-e620d0ba...
C   a      Delayed('copy-c722fa78-c46e-422a-88a5-b9e48cac...
    b      Delayed('copy-01454a03-fd28-4fa5-b487-563ccc66...
D   b      Delayed('copy-f6cf94bd-d457-4495-bb2e-1db0152c...
dtype: object

I don't know how to call delayed objects from this and compute them.

Thank you so much in advance.

Sinem
  • 13
  • 2

2 Answers2

0

When working with delayed it's better to explicitly construct the list of delayed values, in your context this would be:

delayeds=[dask_test(group, 'val1') for _, group in df.groupby(['id', 'id2'])]

Then, the delayed values can be computed using dask.compute(*delayeds).

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    Thank you so much @sultanOrazbayev. That solves the issue. Curious why it requires constructing delayed values explicitly because it didn't work when I tried ```dask.compute(*list(tmp_grp))``` before. Also do you have any suggestion how to unpack output of ```dask.compute(*delayeds)```? First thing came to my mind is doing for loop over output length and append them but wondering is there a better way to do that? – Sinem Dec 23 '21 at 15:02
  • For unpacking, depends on what is inside the delayeds, but if they are dataframes, then `pd.concat` will work. Otherwise, just storing them in a variable will create a list that contains the computed results... – SultanOrazbayev Dec 23 '21 at 15:06
  • 1
    Got it! Thanks @SultanOrazbayev But dask doesn't improve running time. Actually it's slower than the original version. Do you know what might cause that? – Sinem Dec 23 '21 at 19:53
  • This will depend on your specific use case, see https://docs.dask.org/en/stable/dataframe-best-practices.html – SultanOrazbayev Dec 24 '21 at 02:41
0

Since you're working with pandas, you can consider using Dask DataFrame instead of Delayed here, it's better optimized. :)

Something like:

import dask.dataframe as dd

ddf = dd.from_pandas(df, npartitions=4)

def dask_test(group,val_col):
    for idx,row in group.iterrows():
        group.loc[idx,'test']=2*group.loc[idx,val_col]
    return group

tmp_grp = ddf.groupby(['id','id2']).apply(dask_test,'val1')
tmp_grp.compute().sort_index()

Note that when you use Dask DataFrame you wouldn't need to have a for-loop in the dask_test function. Also, see: https://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases for optimization tips

pavithraes
  • 724
  • 5
  • 9