I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not allow me to save the column to parquet from pandas:
import pandas as pd
dates = [None, "20200710", "20200711", "20200712"]
data_df = pd.DataFrame(dates, columns=['date'])
data_df['date'] = pd.to_datetime(data_df['date']).dt.date
data_df.to_parquet(r'my_path', engine='pyarrow')
I receive this error below:
Traceback (most recent call last):
File "", line 123, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow\array.pxi", line 265, in pyarrow.lib.array
File "pyarrow\array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type datetime.date)
If I move the None
value towards the end of the date list, this will work without any issue and pyarrow would infer the date column as Date32[Day]
. My guess is that since the Pandas column type for dt.date
is object
plus the first value for the column is NaT
(not a time), pyarrow is not able to infer the column as Date32[Day]
from Pandas dataframe or some sample value, it infers the column as Integer
instead. What is a good way to save this dataframe column to parquet as a Date32[Day]
column without sorting the column values? Thanks.