PySpark Outputting '01/01/0001' and '12/31/9999' Incorrectly in Parquet

Question

I am using PySpark 3.0.1 to generate parquet files.

When executing the following command

sparkDF.write.mode("overwrite").parquet(file_name)

The 9999-12-31 00:00:00.0000000 datetime is written as 1816-03-29 11:56:08.066277376 in the parquet file.

The 0001-01-01 00:00:00.0000000 datetime is written as 1754-08-29 04:43:41.128654848 in the parquet file.

In contrast, sparkDF.write.mode("overwrite").csv(file_name) outputs the correct datetime value in CSV format.

Does anybody know what is going on? Thanks.

I cannot reproduce this. Are you sure you're reading correct parquet files? — pltc, Oct 27 '21 at 22:11
How are you reading the data back? This looks very similar to the issues discussed in https://stackoverflow.com/questions/69458623/pyarrow-parquet-saving-large-timestamp-incorrectly — Micah Kornfield, Oct 29 '21 at 00:37

score 1 · Accepted Answer · answered Oct 29 '21 at 00:40

I believe the issue is whatever system you are reading them with is likely misinterpreting or has an overflow issue with handling the int96 timestamp format. You can write a more standard format as:

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")

(Credit: How to save spark dataframe to parquet without using INT96 format for timestamp columns?)

PySpark Outputting '01/01/0001' and '12/31/9999' Incorrectly in Parquet

1 Answers1