How to write a partitioned Parquet file using Pandas

Question

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

Are you sure there is not a typo in the `partitiol_cols` argument? — sophros, Oct 22 '18 at 17:00
Yeah, this was not the problem. Notice that the error message was correct. — Ivan, Oct 22 '18 at 18:35
`partition_cols` has been added in pandas 0.24.0: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html — Giorgio Balestrieri, Feb 28 '20 at 15:52

RubenLaguna · Answer 1 · 2021-06-23T13:15:41.337

First make sure that you have a reasonably recent version of pandas and pyarrow:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet

ostrokach · Answer 2 · 2018-10-22T19:01:21.960

17

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

edited Oct 22 '18 at 19:01

answered Oct 22 '18 at 18:41

ostrokach

17,993
11
78
90

I believe that I did with the `engine=pyarrow` option, and it seems that the default engine is `pyarrow` and not `fastparquet`: "engine : {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’ Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable." https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_parquet.html – Ivan Oct 22 '18 at 18:49
Yes, you are right. They must have changed it in one of the recent versions. – ostrokach Oct 22 '18 at 18:56
1

Recent `pandas` has incorporated `partitioned_cols` and starts using `write_to_dataset` as well. – stucash Mar 31 '23 at 16:44

score 9 · Answer 3 · answered Jul 27 '19 at 11:41

9

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

answered Jul 27 '19 at 11:41

sharadlahoti

91
1
2

How to write a partitioned Parquet file using Pandas

3 Answers3

Linked