Pyspark - timestamp col being Null while reading from Parquet Value

Question

I am reading a csv file and writing into a parquet file partitioned by a col. After reading from the csv file this is what i am getting

>>> df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col5: double (nullable = true)
 |-- col6: timestamp (nullable = true)
 |-- col7: string (nullable = true)

>>> df.show()
[Stage 0:>                                                          (0 + 1) / 1]
+----+-----+-------------------+-------------+                                  
|col1| col5|               col6|         col7|
+----+-----+-------------------+-------------+
|   f| 3.34|1970-01-01 00:00:00|this is test3|
|   f| 2.13|1980-02-05 00:00:00|this is test3|
|   f|12.13|1981-02-05 00:00:00|this is test3|
|   e|  2.3|1982-03-05 00:00:00|this is test3|
|   e|  2.3|1983-04-12 00:00:00|this is test3|
|   e|212.0|1984-05-04 00:00:00|this is test3|
|   e| 2.13|1985-01-10 00:00:00|this is test3|
+----+-----+-------------------+-------------+

when i am using this dataframe to write into a partitioned parquet file its getting succesfully written but then when i am reading from there and displaying the value the timestamp col is coming as NULL, though the datatype is still timestamp only

>>> df.write.partitionBy("col1").mode("append").parquet("<some_location>/testparquetData/")
>>> df1 = spark.read.parquet("<some_location>/testparquetData/")   

>>> df1.show()
+-----+----+-------------+----+                                                 
| col5|col6|         col7|col1|
+-----+----+-------------+----+
|  2.3|null|this is test3|   e|
|  2.3|null|this is test3|   e|
|212.0|null|this is test3|   e|
| 2.13|null|this is test3|   e|
| 3.34|null|this is test3|   f|
| 2.13|null|this is test3|   f|
|12.13|null|this is test3|   f|
+-----+----+-------------+----+

>>> df1.printSchema()
root
 |-- col5: double (nullable = true)
 |-- col6: timestamp (nullable = true)
 |-- col7: string (nullable = true)
 |-- col1: string (nullable = true)

i am not sure what exactly is happening here

Now initially i was reading the csv file with inferSchema=true and thought may be that is the reason so i explicitly passed the schema and read the file but post writing into the parquet file and trying to read that, the result is still coming as null.

can somebody help me what i am missing here ?

score 2 · Answer 1 · answered Oct 18 '22 at 11:22

Hi i could solve the issue. What seems to be the problem is that for the parquet file the corresponding default timestamp type that it was expecting was INT96 which doesn’t support the format that i was passing in all the cases i tried. So we need to set the below config and then upon executing the code we are getting proper value

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

Probably this is what we have to do each time we are dealing with parquet.

Referred this link as well

How to save spark dataframe to parquet without using INT96 format for timestamp columns?

Pyspark - timestamp col being Null while reading from Parquet Value

1 Answers1