3

I have a Dask DataFrame of following format:

date       hour device  param     value
20190701    21  dev_01  att_1   0.000000
20190718    22  dev_01  att_2   20.000000
20190718    22  dev_01  att_3   18.611111
20190701    21  dev_01  att_4   18.706083
20190718    22  dev_01  att_5   23.333333

I am trying to pivot the dataframe using Dask.DataFrames.pivot_table() API. However, I want to use 'date', 'hour' and 'device' as the index (i.e, in the pivoted table each row would be uniquely identified by the date, hour and device identifier):

ddf.pivot_table(index = ['date', 'hour', 'device'], columns='param', values='value')

However, it's failing with the following error:

'index' must be the name of an existing column

As I understand from the API documentation (here), the parameter 'index' accepts name of a single column (and not a list) and hence this error.

Is there any other alternative of pivoting a dask dataframe using multiple columns as index?

Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60

1 Answers1

2

As mentioned in the docstring the column on which you pivot must be a single column, and it must be of categorical dtype. So to accomplish what you want you would have to convert your three columns into a single categorical column.

This is doable using normal Pandas syntax, but will likely require a full pass through the data to get the categories.

MRocklin
  • 55,641
  • 23
  • 163
  • 235