17

How do I save the dataframe shown at the end to parquet?
It was constructed this way:

df_test = pd.DataFrame(np.random.rand(6,4))
df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), 
      ('c1', 'c2', 'c3', 'c4')], names=['lev_0', 'lev_1'])
df_test.to_parquet("c:/users/some_folder/test.parquet")

The last line of that code returns:

ValueError: parquet must have string column names

Should I assume I can't save a dataframe with column headers created by multi-indexes (of strings)? Thanks.

--The dataframe looks like this:

lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.713922  0.551404  0.289861  0.178739
1      0.693925  0.425073  0.660924  0.695474
2      0.280258  0.827231  0.282844  0.523069
3      0.424731  0.380963  0.462356  0.491140
4      0.786677  0.102935  0.382453  0.199056
5      0.783115  0.295409  0.236880  0.388399
techvslife
  • 2,273
  • 2
  • 20
  • 26
  • From pandas 1.2 this issue will be resolved. See [GH34777](https://github.com/pandas-dev/pandas/issues/34777). – cs95 Dec 19 '20 at 09:20

2 Answers2

16

pyarrow can write pandas multi-index to parquet files.

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

df_test = pd.DataFrame(np.random.rand(6,4))
df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), 
      ('c1', 'c2', 'c3', 'c4')], names=['lev_0', 'lev_1'])
table = pa.Table.from_pandas(df_test)
pq.write_table(table, 'test.parquet')

df_test_read = pd.read_parquet('test.parquet')
cheekybastard
  • 5,535
  • 3
  • 22
  • 26
9

pandas >= 1.2

With pandas 1.2 this issue has been fixed, see GH34777

pd.__version__
# '1.2.0'

# Writing.
df_test

lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.208907  0.875918  0.610843  0.155938
1      0.325854  0.271798  0.916347  0.368343
2      0.650087  0.238840  0.415166  0.218156
3      0.684763  0.075124  0.761239  0.567883
4      0.633933  0.362682  0.214050  0.955370
5      0.561144  0.017972  0.197339  0.251407

# Writes successfully
df_test.to_parquet('test.parquet')
# Reading.
pd.read_parquet('test.parquet')
 
lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.208907  0.875918  0.610843  0.155938
1      0.325854  0.271798  0.916347  0.368343
2      0.650087  0.238840  0.415166  0.218156
3      0.684763  0.075124  0.761239  0.567883
4      0.633933  0.362682  0.214050  0.955370
5      0.561144  0.017972  0.197339  0.251407

To run this code, you'll need a backend engine for parquet (namely pyarrow).

cs95
  • 379,657
  • 97
  • 704
  • 746