10

I'm Trying to use Pivot_table on Dask with the following dataframe:

    date    store_nbr   item_nbr    unit_sales  year    month
0   2013-01-01  25       103665      7.0        2013      1
1   2013-01-01  25       105574      1.0        2013      1
2   2013-01-01  25       105575      2.0        2013      1
3   2013-01-01  25       108079      1.0        2013      1
4   2013-01-01  25       108701      1.0        2013      1

When I try to pivot_table as follows:

ddf.pivot_table(values='unit_sales', index={'store_nbr','item_nbr'}, 
                                  columns={'year','month'}, aggfunc={'mean','sum'})

I got this error:

ValueError: 'index' must be the name of an existing column

And If I just use only one value on index and columns parameters as follows:

df.pivot_table(values='unit_sales', index='store_nbr', 
                                  columns='year', aggfunc={'sum'})

I got this error:

ValueError: 'columns' must be category dtype
ambigus9
  • 1,417
  • 3
  • 19
  • 37

1 Answers1

11

That error is telling you that dask dataframe expects the column used in the columns keyword to be a categorical dtype. It needs this so that it can define the columns correctly, even during lazy operation. You can accomplish this as follows:

df = df.categorize(columns=['year'])
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    Thanks! Can I use multiple columns and also multiple index to pivot? – ambigus9 Mar 25 '18 at 21:44
  • When I do that, it returns me this error: `/home/miguel/anaconda3/lib/python3.6/site-packages/dask/dataframe/categorical.py:24: RuntimeWarning: None of the categories were found in values. Did you mean to use 'Categorical.from_codes(codes, categories)'? df[col] = pd.Categorical(df[col], categories=vals, ordered=False)` – ambigus9 Mar 25 '18 at 21:48
  • @Ambigus9 Seems like I can't use multiple indexes either but you can group/concat your desired columns together to be an index... – Apichart Thanomkiet Jan 10 '20 at 16:28
  • Yeah, no, it is not working. Even with `categorize` I get the same error. – Soerendip May 04 '21 at 01:19
  • Do I have to convert the dask `ddf` first to a pandas `df`? That seems inefficient. – Soerendip May 04 '21 at 01:23