I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type.
I'm getting into situations where the resulting parquet data types are not what I want them to be. For example, I may write an int64
to a column and the resulting parquet will be in double
format. This is causing a lot of trouble on the processing side where 99% of the data is typed correctly but in 1% of cases it's just the wrong type.
I've tried importing numpy and wrapping the values this way-
import numpy as np
pandas.DataFrame({
'a': [ np.int64(5100), np.int64(5200), np.int64(5300) ]
})
But i'm still getting the occasional double so this must be the wrong way to do it. How can I ensure data types are consistent across columns across parquet files?
Update-
I found this only happens when the column contains one or more None
s.
data_frame = pandas.DataFrame({
'a': [ None, np.int64(5200), np.int64(5200) ]
})
Can parquet not handle mixed None-int64 cols?