5

I have a spark dataframe that I want to save as parquet then load it using the parquet-avro library.

There is a timestamp column in my dataframe that is converted to a INT96 timestamp column in parquet. However parquet-avro does not support INT96 format and throws.

Is there a way to avoid it ? Is it possible to change the format used by Spark when writing timestamps to parquet in something supported by avro ?

I currently use

date_frame.write.parquet("path")
Fabich
  • 2,768
  • 3
  • 30
  • 44

1 Answers1

12

Reading spark code I have found the spark.sql.parquet.outputTimestampType property

spark.sql.parquet.outputTimestampType :
Sets which Parquet timestamp type to use when Spark writes data to Parquet files.
INT96 is a non-standard but commonly used timestamp type in Parquet.
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch.
TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.

So I can do the following :

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")
Fabich
  • 2,768
  • 3
  • 30
  • 44
  • Aren't you going to accept your own answer? :P Also thanks, this was helpful. – Sal Aug 17 '20 at 20:17
  • It's a shame that there isn't an option to just write as String. It's also strange that Spark 3.0 is outputting timestamps in INT96 without asking... Thanks for the answer, it did exactly what I wanted. – Ben Watson Sep 22 '20 at 14:01
  • Similar issue running spark 2.4 -- `Parquet does not support date. See HIVE-6384`. Unfortunately this answer changes nothing in my case. – Wassadamo Sep 23 '20 at 08:02