1

I'm trying to save DataFrame with date type column to a parquet format to be used later in Athena. As far as I understand parquet has native DATE type, by the only type I can really use is datetime64[ns] with pyarrow engine (here is the same issue discussed https://github.com/pandas-dev/pandas/issues/20089). The issue is I'd like to have date type rather than datetime in Athena schema. Any suggestions?

kismsu
  • 1,049
  • 7
  • 22
  • Change the column type of dataframe first and then dump it to parquet – Shrey Nov 14 '19 at 10:38
  • If I keep the type as date, parquet schema saves it as null – kismsu Nov 14 '19 at 10:43
  • In my project i have kept it as string in MM/DD/YYYY format. – Shrey Nov 14 '19 at 10:48
  • I know I can do that, but It would be nice to avoid type casting down the line – kismsu Nov 14 '19 at 10:51
  • 2
    Have you tried the latest version of Arrow. Looking at the [Arrow's Pandas integration documentation](https://arrow.apache.org/docs/python/pandas.html) it seems like datetime.date can now be round-tripped. And it [appears](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L268) there is support for storing date columns in parquet. – Micah Kornfield Nov 26 '19 at 07:21
  • You are right, @MicahKornfield. Thanks for point this out – kismsu Nov 27 '19 at 09:45

1 Answers1

4

As mentioned in the comment I believe Apache Arrow 0.15.1 now supports round-tripping dates between Pandas and Parquet.

Micah Kornfield
  • 1,325
  • 5
  • 10