3

In Pandas if I want to create a column of conditional dummies (say 1 if a variable is equal to a string and 0 if it is not), then my goto in pandas is:

data["ebt_dummy"] = np.where((data["paymenttypeid"]=='ebt'), 1, 0)

Naively trying this in a dask dataframe throws an error. Following the directions in the documentation for map_partitions also throws an error:

data = data.map_partitions(lambda df: df.assign(ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)),  meta={'paymenttypeid': 'str', 'ebt_dummy': 'i8'})

What is a good way, or the most Dask-thonic way, of doing this?

Community
  • 1
  • 1
sfortney
  • 2,075
  • 6
  • 23
  • 43

3 Answers3

1

Here's some sample data to play with:

In [1]:
df = pd.DataFrame(np.transpose([np.random.choice(['ebt','other'], (10)),
              np.random.rand(10)]), columns=['paymenttypeid','other'])

df

Out[1]:

  paymenttypeid                 other
0         other    0.3130770966143612
1         other    0.5167434068096931
2           ebt    0.7606898392115471
3           ebt    0.9424572692382547
4           ebt     0.624282017575857
5           ebt    0.8584841824784487
6         other    0.5017083765654611
7         other  0.025994123211164233
8           ebt   0.07045354449612984
9           ebt   0.11976351556850084

Let's convert this to a dataframe

In [2]: data = dd.from_pandas(df, npartitions=2)

and use apply(on a Series) to assign:

In [3]:
data['ebt_dummy'] = data.paymenttypeid.apply(lambda x: 1 if x =='ebt' else 0, meta=('paymenttypeid', 'str'))
data.compute()

Out [3]:
  paymenttypeid                 other  ebt_dummy
0         other    0.3130770966143612          0
1         other    0.5167434068096931          0
2           ebt    0.7606898392115471          1
3           ebt    0.9424572692382547          1
4           ebt     0.624282017575857          1
5           ebt    0.8584841824784487          1
6         other    0.5017083765654611          0
7         other  0.025994123211164233          0
8           ebt   0.07045354449612984          1
9           ebt   0.11976351556850084          1

Update:

It seems that the meta you pass is the problem, since this works:

data = data.map_partitions(lambda df: df.assign(
                                    ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)))

data.compute()

In my example, if I wanted to specify the meta, I would have to pass the dtypes of the current data, not the one I expect once I assign:

data.map_partitions(lambda df: df.assign(
                                    ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)), 
               meta={'paymenttypeid': 'str', 'other': 'float64'})
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • Thanks for the answer! Is there a vectorized way to do this? I had thought of this, but doing a row-by-row apply on anything other than a small dataset is quite prohibitive – sfortney Nov 22 '16 at 23:08
  • 1
    I updated my answer to do it with `map_partitions` (never used Dask until now...). I suggest you revise your `meta` keyword. The above works fine. – Julien Marrec Nov 22 '16 at 23:21
  • Yea that what it was. I was misunderstanding the meta. Thanks for the help! – sfortney Nov 22 '16 at 23:51
  • 1
    Just a final addendum for anyone who stumbles upon this in the future. Oddly enough, and in direct contradiction to my first critique, the row-by-row apply seems to be faster and use less ram than the vectorized method, at least in the tests with my data. Must be an odd quirk of dask (or it could have to do with having to make less total copies of the data) – sfortney Nov 23 '16 at 00:39
  • 1
    I think that's going to be dependent on a few key factors including how big your data is, what hardware do you have (if you want dozens and dozens of cores with large data it will likely be more cost effective), and how you've configured dask. Also, if you perform multiple operations before calling compute obviously that's better too. – Julien Marrec Nov 23 '16 at 00:42
  • Good points all. I just found it interesting in light of the typical vectorized>row wise rule of thumb – sfortney Nov 23 '16 at 01:01
0

This also worked for me:

data['ebt_dummy'] = dd.from_array(np.where((df["paymenttypeid"]=='ebt'), 1, 0))
tmsss
  • 1,979
  • 19
  • 23
0

I believe what you're looking for is a ternary operation. For numerics, something like this should work.

import dask.dataframe as dd
import typing as t
def ternary(conditional: dd.Series, option_true: t.Union[float, int], option_false: t.Union[float, int]) -> dd.Series:
    return conditional * option_true + (~conditional) * option_false

data["ebt_dummy"] = ternary(data["paymenttypeid"]=='ebt', 1, 0)