4

I'm trying to use get_dummies via dask but it does not transform my variable, nor does it error out:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical) 
>>> daskDataDummies.head()
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
>>> daskDataDummies.compute() 
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
5      F
6      M
7      F
8      M
9      F
>>>

The pandas equivilent (run in a new terminal just in case) is:

>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
   gender_F  gender_M
0       0.0       1.0
1       0.0       0.0
2       0.0       0.0
3       1.0       0.0
4       0.0       0.0
>>> 

My understanding of this resolved issue is that it should work, but is it required to be pulled into pandas first? If so it defeats the purpose of me using it since my datasets (~500GB) won't fit into a pandas dataframe. Am I misreading this? TIA.

Frank B.
  • 1,813
  • 5
  • 24
  • 44

1 Answers1

7

You'll want to convert your column of strings to a Categorical before trying to use get_dummies. This pull request added a dask.dataframe.get_dummies, which will error if you try to pass object (string) columns, unlike pd.get_dummies.

To get a Categorical you can either use .categorize before dd.get_dummies, or with pandas >= 0.19, use read in your CSV with the dtype keyword like

df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})

Here's a small example:

In [2]: import dask.dataframe as dd

In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)

In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[4]:
   A
0  a
1  b
2  a

In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)

/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
     68         if columns is None:
     69             if (data.dtypes == 'object').any():
---> 70                 raise NotImplementedError(not_cat_msg)
     71             columns = data._meta.select_dtypes(include=['category']).columns
     72         else:

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.

In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1

Dask requires categoricals for get_dummies because it needs to know all of the new dummy-variables it needs to create. pandas doesn't have to worry about this since all of your data is already in memory.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Hi Tom, thanks for the reply and that makes sense. While your example works adding the .categorize() to my example gives me: Traceback (most recent call last): File "", line 1, in AttributeError: 'Series' object has no attribute 'categorize' – Frank B. Jan 25 '17 at 23:16
  • 1
    You should be able to `dd.get_dummies(data.to_frame().categorize())` – TomAugspurger Jan 26 '17 at 12:09
  • sorry, that gives me this error: `raise AttributeError("'DataFrame' object has no attribute %r" % key) AttributeError: 'DataFrame' object has no attribute 'to_frame'` – Frank B. Jan 26 '17 at 14:19
  • Are you sure you're running the same code in both places? Your first comment had an issue because you had a series instead of a DataFrame; your second comment has an issue because you have already have a dataframe. – TomAugspurger Jan 26 '17 at 17:32