I'm trying to save DataFrame with date type column to a parquet format to be used later in Athena. As far as I understand parquet has native DATE type, by the only type I can really use is datetime64[ns] with pyarrow engine (here is the same issue discussed https://github.com/pandas-dev/pandas/issues/20089). The issue is I'd like to have date type rather than datetime in Athena schema. Any suggestions?
Asked
Active
Viewed 6,909 times
1
-
Change the column type of dataframe first and then dump it to parquet – Shrey Nov 14 '19 at 10:38
-
If I keep the type as date, parquet schema saves it as null – kismsu Nov 14 '19 at 10:43
-
In my project i have kept it as string in MM/DD/YYYY format. – Shrey Nov 14 '19 at 10:48
-
I know I can do that, but It would be nice to avoid type casting down the line – kismsu Nov 14 '19 at 10:51
-
2Have you tried the latest version of Arrow. Looking at the [Arrow's Pandas integration documentation](https://arrow.apache.org/docs/python/pandas.html) it seems like datetime.date can now be round-tripped. And it [appears](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L268) there is support for storing date columns in parquet. – Micah Kornfield Nov 26 '19 at 07:21
-
You are right, @MicahKornfield. Thanks for point this out – kismsu Nov 27 '19 at 09:45
1 Answers
4
As mentioned in the comment I believe Apache Arrow 0.15.1 now supports round-tripping dates between Pandas and Parquet.

Micah Kornfield
- 1,325
- 5
- 10