12

I need to create a column which is based on some condition on dask dataframe. In pandas it is fairly straightforward:

ddf['TEST_VAR'] = ['THIS' if x == 200607 else  
              'NOT THIS' if x == 200608 else 
              'THAT' if x == 200609 else 'NONE'  
              for x in ddf['shop_week'] ]

While in dask I have to do same thing like below:

def f(x):
    if x == 200607:
         y= 'THIS'
    elif x == 200608 :
         y= 'THAT'
    else :
         y= 1 
    return y

ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))
ddf1.compute()

Questions:

  1. Is there a better/more straightforward way to achieve it?
  2. I can't modify the first dataframe ddf, i need to create ddf1 to se the changes is dask dataframe Immutable object?
Simon Bosley
  • 1,114
  • 3
  • 18
  • 41
Puneet Tripathi
  • 412
  • 3
  • 15

3 Answers3

7

Answers:

  1. What you're doing now is almost ok. You don't need to call compute until you're ready for your final answer.

    # ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))
    ddf1 = ddf.assign(col1 = ddf.shop_week.apply(f))
    

    For some cases dd.Series.where might be a good fit

    ddf1 = ddf.assign(col1 = ddf.shop_week.where(cond=ddf.balance > 0, other=0))
    
  2. As of version 0.10.2 you can now insert columns directly into dask.dataframes

    ddf['col'] = ddf.shop_week.apply(f)
    
MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

You could just use:

f = lambda x: 'THIS' if x == 200607 else 'NOT THIS' if x == 200608 else 'THAT' if x == 200609 else 'NONE'

And then:

ddf1 = ddf.assign(col1 = list(ddf.shop_week.apply(f).compute()))

Unfortunately I don't have an answer to the second question or I don't understand it...

Ohumeronen
  • 1,769
  • 2
  • 14
  • 28
0

A better approach might be pull out the column as a dask array and then perform some nested where operations before adding it back to the dataframe:

import dask.array as da

x = ddf['shop_week'].to_dask_array()

df['TEST_VAR'] = \
    da.where(x == 200607, 'THIS',
    da.where(x == 200608, 'NOT THIS',
    da.where(x == 200609, 'THAT', 'NONE')))

df['TEST_VAR'].compute()
Zelazny7
  • 39,946
  • 18
  • 70
  • 84