12

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows

_dtype = {"column_1": "float64",
          "column_2": "category",
          "column_3": "int64",
          "column_4": "int64"}

df = pd.read_csv("data.csv", dtype=_dtype)

I then do some more data cleaning and write the data out into Parquet for downstream use.

_parquet_kwargs = {"engine": "pyarrow",
                   "compression": "snappy",
                   "index": False}

df.to_parquet("data.parquet", **_parquet_kwargs)

But when I read the data into Pandas for further analysis using from_parquet I can not seem to recover the category dtypes. The following

df = pd.read_parquet("data.parquet")

results in a DataFrame with object dtypes in place of the desired category.

The following seems to work as expected

import pyarrow.parquet as pq

_table = (pq.ParquetFile("data.parquet")
            .read(use_pandas_metadata=True))

df = _table.to_pandas(strings_to_categorical=True)

however I would like to know how this can be done using pd.read_parquet.

davidrpugh
  • 4,363
  • 5
  • 32
  • 46

2 Answers2

13

This is fixed in Arrow 0.15, now the next code keeps the columns as categories (and the performance is significantly faster):

import pandas

df = pandas.DataFrame({'foo': list('aabbcc'),
                       'bar': list('xxxyyy')}).astype('category')

df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes
Marc Garcia
  • 3,287
  • 2
  • 28
  • 37
  • 1
    Does this depend on which engine is used? (`pyarrow` vs `fastparquet`) – BallpointBen Mar 08 '20 at 15:31
  • Yes, that's for pyarrow, not sure about fastparquet – Marc Garcia Mar 18 '20 at 13:06
  • 1
    for a 100mb csv file do you know if it's worth switching to parquet? – Topde Sep 03 '20 at 12:50
  • 3
    @Topde sure, switching to parquet not only brings speed but also preserves types of columns and structure of the data. Parquet files can store multi index, column types (especially categorical and datetime) which makes sense to use parquet over CSV for the correctness of the data (and also speed). – Furkan Tektas Nov 11 '21 at 11:24
  • 1
    >"preserves types of columns". Unfortunately with Pandas it doesn't preserve the types, as shown by this question and some other questions. Read DataFrame. Try `pandas.DataFrame({'foo': [1, 2, 3, 3, 1]}).astype("category")`. `df.dtypes` shows `categorical`, but after reloading from parquet it's back to int64. Tested with pyarrow==9.0.0. – Ark-kun Aug 19 '22 at 23:37
  • Parquet (at least the pyarrow-powered) doesn't preserve types of columns in a dataframe. That can easily be tested. – Kiryl A. Feb 28 '23 at 20:59
3

We are having a similar problem. When working with a multi file parquet are work around is as follows: Using the Table.to_pandas() documentation the following code may be relevant:

import pyarrow.parquet as pq
dft = pq.read_table('path/to/data_parquet/', use_pandas_metadata=True)
df = dft.to_pandas(categories=['column_2'] )

the use_pandas_metadata works for the dtype datetime64[ns]

foglerit
  • 7,792
  • 8
  • 44
  • 64
skibee
  • 1,279
  • 1
  • 17
  • 37