Interpret timestamp fields in Spark while reading json

Question

I am trying to read a pretty printed json which has time fields in it. I want to interpret the timestamps columns as timestamp fields while reading the json itself. However, it's still reading them as string when I printSchema

E.g. Input json file -

[{
    "time_field" : "2017-09-30 04:53:39.412496Z"
}]

Code -

df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')

Output of df.printSchema() -

root
 |-- time_field: string (nullable = true)

What am I missing here?

score 2 · Answer 1 · answered Dec 10 '18 at 07:57

My own experience with option timestampFormat is that it doesn't quite work as advertised. I would simply read the time fields as strings and use to_timestamp to do the conversion, as shown below (with slightly generalized sample input):

# /path/to/jsonfile
[{
    "id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
    "id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]

In Python:

from pyspark.sql.functions import to_timestamp

df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df = df.withColumn("timestamp", to_timestamp("time_field"))

df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field                 |timestamp          |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- time_field: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

In Scala:

val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df.withColumn("timestamp", to_timestamp($"time_field"))

score 1 · Answer 2 · answered Mar 05 '20 at 09:35

It's bug in Spark version 2.4.0 Issues SPARK-26325

For Spark Version 2.4.4

import org.apache.spark.sql.types.TimestampType

//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
  ("2019-06-24 12:01:19.000"),
  ("2019-11-16 16:44:55.406"),
  ("2019-11-16 16:50:59.406")).toDF("input_timestamp")

val df_mod = df.select($"input_timestamp".cast(TimestampType))

df_mod.printSchema

Output

root
 |-- input_timestamp: timestamp (nullable = true)

yeah, I'm the one who logged it as a bug there! – van_d39 Mar 06 '20 at 02:36 — van_d39, Mar 06 '20 at 02:36

Interpret timestamp fields in Spark while reading json

2 Answers2