Pandas Dataframe Parquet Data Types?

Question

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type.

I'm getting into situations where the resulting parquet data types are not what I want them to be. For example, I may write an int64 to a column and the resulting parquet will be in double format. This is causing a lot of trouble on the processing side where 99% of the data is typed correctly but in 1% of cases it's just the wrong type.

I've tried importing numpy and wrapping the values this way-

import numpy as np

pandas.DataFrame({
  'a': [ np.int64(5100), np.int64(5200), np.int64(5300) ]
})

But i'm still getting the occasional double so this must be the wrong way to do it. How can I ensure data types are consistent across columns across parquet files?

Update-

I found this only happens when the column contains one or more Nones.

data_frame = pandas.DataFrame({
  'a': [ None, np.int64(5200), np.int64(5200) ]
})

Can parquet not handle mixed None-int64 cols?

score 8 · Accepted Answer · answered Sep 12 '18 at 15:01

8

Pandas itself cannot handle null/na values in integer columns at the moment (version 0.23.x). In the next release, there will be a nullable integer type. In the meantime, once you have a null value in an integer column, Pandas automatically converts this into a float column. Thus you also have a float column in your resulting Parquet file:

import numpy as np
import pandas as pd

df = pd.DataFrame({
  'a': [np.int64(5100), np.int64(5200), np.int64(5300)]
})
# df['a'].dtype == dtype('int64')
df = pd.DataFrame({
  'a': [None, np.int64(5200), np.int64(5200)]
})
# df['a'].dtype == dtype('float64')

answered Sep 12 '18 at 15:01

Uwe L. Korn

8,080
1
30
42

1

Since `pandas>=0.24.0` you can use data type `Int64` (note the capital) which does support [nullable entries (``)](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html). – gosuto Mar 04 '20 at 16:07
Are there any parquet datatypes that match this functionality? – Isaacnfairplay Dec 20 '22 at 02:37

Pandas Dataframe Parquet Data Types?

1 Answers1