While I try to cast a string field to a TimestampType in Spark DataFrame, the output value is coming with microsecond precision( yyyy-MM-dd HH:mm:ss.S
). But I need the format to be yyyy-MM-dd HH:mm:ss
ie., excluding the microsecond precision. Also, I want to save this as a time stamp field while writing into a parquet file.
So the datatype of my field should be a timestamp of format yyyy-MM-dd HH:mm:ss
I tried using TimestampType as
col("column_A").cast(TimestampType)
or
col("column_A").cast("timestamp")
to cast the field to timestamp. These are able to cast the field to timestamp but with the microsecond precision.
Can anyone help in saving the timestamp datatype to parquet file with the required format specification.
EDIT
Input:
val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).show(false)
+----+---------------------+-------------------+
|cola|colb |datetime |
+----+---------------------+-------------------+
|a |2017-01-01 12:02:00.0|2017-01-01 12:02:00|
|b |2017-02-01 11:22:30 |2017-02-01 11:22:30|
+----+---------------------+-------------------+
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).printSchema
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
|-- datetime: string (nullable = true)
In the above, we are getting the right timestamp format, but when we print the Schema, the datetime field is of type String, but I need a timestamp type here.
Now,if I attempt to cast the field to timestamp, the format is set to microsecond precision, which is not intended.
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
a: org.apache.spark.sql.DataFrame = [cola: string, colb: string]
scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).show(false)
+----+---------------------+---------------------+
|cola|colb |datetime |
+----+---------------------+---------------------+
|a |2017-01-01 12:02:00.0|2017-01-01 12:02:00.0|
|b |2017-02-01 11:22:30 |2017-02-01 11:22:30.0|
+----+---------------------+---------------------+
scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).printSchema
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
|-- datetime: timestamp (nullable = true)
What I am expecting is for the format to be in yyyy-MM-dd HH:mm:ss
and also the datatype of the field to be of timestamp
Thanks in advance