How do I save multi-indexed pandas dataframes to parquet?

Question

How do I save the dataframe shown at the end to parquet?
It was constructed this way:

df_test = pd.DataFrame(np.random.rand(6,4))
df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), 
      ('c1', 'c2', 'c3', 'c4')], names=['lev_0', 'lev_1'])
df_test.to_parquet("c:/users/some_folder/test.parquet")

The last line of that code returns:

ValueError: parquet must have string column names

Should I assume I can't save a dataframe with column headers created by multi-indexes (of strings)? Thanks.

--The dataframe looks like this:

lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.713922  0.551404  0.289861  0.178739
1      0.693925  0.425073  0.660924  0.695474
2      0.280258  0.827231  0.282844  0.523069
3      0.424731  0.380963  0.462356  0.491140
4      0.786677  0.102935  0.382453  0.199056
5      0.783115  0.295409  0.236880  0.388399

From pandas 1.2 this issue will be resolved. See [GH34777](https://github.com/pandas-dev/pandas/issues/34777). — cs95, Dec 19 '20 at 09:20

cheekybastard · Answer 1 · 2020-07-22T00:54:47.777

16

pyarrow can write pandas multi-index to parquet files.

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

df_test = pd.DataFrame(np.random.rand(6,4))
df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), 
      ('c1', 'c2', 'c3', 'c4')], names=['lev_0', 'lev_1'])
table = pa.Table.from_pandas(df_test)
pq.write_table(table, 'test.parquet')

df_test_read = pd.read_parquet('test.parquet')

edited Jul 22 '20 at 00:54

answered May 06 '19 at 04:35

cheekybastard

5,535
3
22
26

4

Can multi-index handling be achieved using the built-in `pandas.DataFrame.to_parquet`? – Nyxynyx Oct 29 '19 at 16:06
3

@cheekybastard what is pa in `pa.Table.from_pandas(df_test)`? – AleB May 06 '20 at 14:37
@AleB pa is pyarrow: "import pyarrow as pa". Thanks for spotting that missing import – cheekybastard Jul 22 '20 at 00:54

cs95 · Answer 2 · 2021-01-02T07:20:38.210

pandas >= 1.2

With pandas 1.2 this issue has been fixed, see GH34777

pd.__version__
# '1.2.0'

# Writing.
df_test

lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.208907  0.875918  0.610843  0.155938
1      0.325854  0.271798  0.916347  0.368343
2      0.650087  0.238840  0.415166  0.218156
3      0.684763  0.075124  0.761239  0.567883
4      0.633933  0.362682  0.214050  0.955370
5      0.561144  0.017972  0.197339  0.251407

# Writes successfully
df_test.to_parquet('test.parquet')

# Reading.
pd.read_parquet('test.parquet')
 
lev_0         A                   B          
lev_1        c1        c2        c3        c4
0      0.208907  0.875918  0.610843  0.155938
1      0.325854  0.271798  0.916347  0.368343
2      0.650087  0.238840  0.415166  0.218156
3      0.684763  0.075124  0.761239  0.567883
4      0.633933  0.362682  0.214050  0.955370
5      0.561144  0.017972  0.197339  0.251407

To run this code, you'll need a backend engine for parquet (namely pyarrow).

How do I save multi-indexed pandas dataframes to parquet?

2 Answers2

pandas >= 1.2