I have an Excel-reader, where I put the results in sparks dataframes. I have problems with parsing the timestamps.
I have timestamps as strings like Wed Dec 08 10:49:59 CET 2021
. I was using spark-sql version 2.4.5
and everything worked fine until I recently updated to version 3.1.2
.
Please find some minimal code below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val oldfmt: String = "E MMM dd HH:mm:ss z yyyy"
val ttdf = Seq(ts)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"), fmt = oldfmt))
ttdf.show()
Running this code with spark version 2.4.5
works like expected and produces the following output:
+--------------------+-------------------+
| theTimestampColumn| parsedTime|
+--------------------+-------------------+
|Wed Dec 08 20:49:...|2021-12-08 20:49:59|
+--------------------+-------------------+
Now, executing the same code, just with spark version 3.1.2
, results in the following error:
Exception in thread "main" org.apache.spark.SparkUpgradeException:
You may get a different result due to the upgrading of Spark 3.0:
Fail to recognize 'E MMM dd HH:mm:ss z yyyy' pattern in the DateTimeFormatter.
1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
(clickable link: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)
This website doesn't help me further. I don't find any mistakes in my formatstring.
The symbol E
represents the day-of-week
as text like Tue; Tuesday
.
The symbol M
represents the month-of-year
like 7; 07; Jul; July
. The symbols H,m,s,y
are hours, minutes, seconds or years, respectively. The symbol z
denotes the time-zone name
like Pacific Standard Time; PST
.
Do I miss something obvious here?
Any help will be really appreciated. Thank you in advance.