4

I am loosing index column after I use pivot_table for Dask Dataframe and save data to Parquet file.

import dask.dataframe as dd
import pandas as pd

df=pd.DataFrame()
df["Index"]=[1,2,3,1,2,3]
df["Field"]=["A","A","A","B","B","B"]
df["Value"]=[10,20,30,100,120,130]
df

My dataframe:

   Index Field  Value
0      1     A     10
1      2     A     20
2      3     A     30
3      1     B    100
4      2     B    120
5      3     B    130

Dask code:

ddf=dd.from_pandas(df,2)
ddf=ddf.categorize("Field")
ddf=ddf.pivot_table(values="Value", index="Index", columns="Field")
dd.to_parquet("1.parq",ddf)
dd.read_parquet("1.parq").compute()

This gives an error:

ValueError: Multiple possible indexes exist: ['A', 'B']. Please select one with index='index-name'

I can select A or B as index, but I am missing the Index column.

I tried dd.to_parquet("1.parq",ddf, write_index=True), but it gives me the following error:

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

Can someone help me save the table with the column "Index" into the Parquet file?

PS:

ddf.pivot_table(values="Value", index="Index", columns="Field").compute() gives result as expected:

Field     A      B
Index             
1      10.0  100.0
2      20.0  120.0
3      30.0  130.0

And using Pandas is not a solution, because my data is 20 GB.

EDIT:

I tried

ddf.columns = list(ddf.columns)
dd.to_parquet("1.parq",ddf, write_index=True)

And it gives me a new error:

dask.async.TypeError: expected list of bytes

Google shows that those kind of errors arise from Tornado asynchronous library.

keiv.fly
  • 3,343
  • 4
  • 26
  • 45

1 Answers1

5

There are two problems here:

  1. pivot_table produces a column index which is categorical, because you made the original column "Field" categorical. Writing the index to parquet calls reset_index on the data-frame, and pandas cannot add a new value to the columns index, because it is categorical. You can avoid this using ddf.columns = list(ddf.columns).

  2. The index column has object dtype but actually contains integers. Integers are not one of the types expected in an object column, therefore you should convert it.

The whole block now looks like:

ddf = dd.from_pandas(df,2)
ddf = ddf.categorize("Field")
ddf = ddf.pivot_table(values="Value", index="Index", columns="Field")
ddf.columns = list(ddf.columns)
ddf = ddf.reset_index()
ddf['index'] = ddf.index.astype('int64')
dd.to_parquet("1.parq", ddf)
AmyChodorowski
  • 392
  • 2
  • 14
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    1. Without categorize I have the following error for pivot_table: ValueError: 'columns' must be category dtype. 2. ddf.columns = list(ddf.columns) and write_index gave another error: dask.async.TypeError: expected list of bytes. So both did not help. – keiv.fly Mar 07 '17 at 17:10
  • I also tried ddf.columns=pd.Index(list(ddf.columns)). Now the class is the same as before, but it still throws an error: dask.async.TypeError: expected list of bytes. Do you have any other ideas? – keiv.fly Mar 07 '17 at 17:26
  • 1
    Edited. Obviously this could be more friendly. – mdurant Mar 07 '17 at 17:45
  • That worked, thanks! And it is great that there is someone who can answer questions about dask! As I understand you are connected to the development of dask. I think it would be a good idea to reduce the lines to ddf.pivot_table(); dd.to_parquet() – keiv.fly Mar 07 '17 at 21:09
  • 1
    I agree completely - please feel free to raise an [issue](https://github.com/dask/dask/issues). I must be honest and say I don't exactly understand what happens within dataframe.pivot_table. – mdurant Mar 07 '17 at 21:43
  • ddf['index'] = ddf.index.astype('int64') should be ddf['Index'] = ddf.Index.astype('int64') or else I get an error. "index" should be "Index". Not sure if you have a typo or not. – user798719 Aug 07 '17 at 06:06