0

currently based on spark SQL in scala to calculate the diff between two columns

writingTime,time
 2020-06-25T13:29:33.415Z,2020-06-25T13:29:33.190Z

I am not sure if 415Z corresponds to the micro seconds I used the following code:

import org.apache.spark.sql.functions._

 val resultDf = df.withColumn("date_diff_seconds",
      $"writingTime".cast("timestamp").cast("long") -  $"time".cast("timestamp").cast("long"))

but this code doesn't go beyond seconds to calculate the difference in timestamp

any idea how to fix this, please?

scalacode
  • 1,096
  • 1
  • 16
  • 38

1 Answers1

0

Alternative to get difference of timestamp in milliseconds -

Load the test data

 val data =
      """
        |writingTime,time
        |  2020-06-25T13:29:33.415Z,2020-06-25T13:29:33.190Z
      """.stripMargin
    val stringDS1 = data.split(System.lineSeparator())
      .map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df1 = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS1)
    df1.show(false)
    df1.printSchema()
    /**
      * +-----------------------+----------------------+
      * |writingTime            |time                  |
      * +-----------------------+----------------------+
      * |2020-06-25 18:59:33.415|2020-06-25 18:59:33.19|
      * +-----------------------+----------------------+
      *
      * root
      * |-- writingTime: timestamp (nullable = true)
      * |-- time: timestamp (nullable = true)
      */

find difference in milliseconds

please note that the columns are already of type timestamp, so no need of casting to timestamp


    val millis = udf((start: java.sql.Timestamp, end: java.sql.Timestamp) => end.getTime - start.getTime)
    df1.withColumn("date_diff_millis", millis($"time",  $"writingTime"))
      .show(false)

    /**
      * +-----------------------+----------------------+----------------+
      * |writingTime            |time                  |date_diff_millis|
      * +-----------------------+----------------------+----------------+
      * |2020-06-25 18:59:33.415|2020-06-25 18:59:33.19|225             |
      * +-----------------------+----------------------+----------------+
      */

when you cast the timestamp to long, it converts it into seconds (unix epoch)

Observe that the difference is zero

df1.withColumn("date_diff_millis", $"time".cast("long") -  $"writingTime".cast("long"))
      .show(false)

    /**
      * +-----------------------+----------------------+----------------+
      * |writingTime            |time                  |date_diff_millis|
      * +-----------------------+----------------------+----------------+
      * |2020-06-25 18:59:33.415|2020-06-25 18:59:33.19|0               |
      * +-----------------------+----------------------+----------------+
      */
Som
  • 6,193
  • 1
  • 11
  • 22