1

I am having two fields of java.sql.timestamp type in my dataframe and I want to find number of days between these two column

Below is the format of my data : *2016-12-23 23:56:02.0 (yyyy-MM-dd HH:mm:ss.S)

I had tried lots of method but did not find any solution. So can any one help here.

Divas Nikhra
  • 91
  • 2
  • 12
  • 1
    Your format looks more like `yyyy-MM-dd HH:mm:ss.S`. – Elliott Frisch Jul 04 '17 at 18:07
  • it is not the duplicate question. in this I am asking about java.sql.timestamp format and the question which you marked is taking about dates. @ElliottFrisch – Divas Nikhra Jul 05 '17 at 04:06
  • yes.. you are right mu format is yyyy-MM-dd HH:mm:ss.S. I have done the changes in my question – Divas Nikhra Jul 05 '17 at 04:07
  • The parent class of [`java.sql.Timestamp`](https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) (per the linked Javadoc), **is** `java.util.Date` - therefore, a `Timestamp` **is-a** `Date`. – Elliott Frisch Jul 05 '17 at 04:07
  • But the answer which shared in the question is saying to use Days.daysBetween(firstdate, seconddate).getDays() but here I am getting error that is Days not found – Divas Nikhra Jul 05 '17 at 04:09

1 Answers1

0

org.apache.spark.sql.functions is a treasure trove. For example, there is the datediff method that does exactly what you want: here is the ScalaDoc.

An example:

val spark: SparkSession = ??? // your spark session
val sc: SparkContext = ??? // your spark context

import spark.implicits._ // to better work with spark sql

import java.sql.Timestamp

final case class Data(id: Int, from: Timestamp, to: Timestamp)

val ds =
  spark.createDataset(sc.parallelize(Seq(
    Data(1, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-11 00:00:00")),
    Data(2, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-21 00:00:00")),
    Data(3, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-23 00:00:00")),
    Data(4, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-07 00:00:00"))
  )))

import org.apache.spark.sql.functions._

ds.select($"id", datediff($"from", $"to")).show()

By running this snippet you would end up with the following output:

+---+------------------+
| id|datediff(from, to)|
+---+------------------+
|  1|               -10|
|  2|               -20|
|  3|               -22|
|  4|                -6|
+---+------------------+
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
  • datediff method requires the column but I am having fields So in my case it is not working .. below is the details where I am having two field and want the difference of days between these two fields. val firstDate : java.sql.TimeStamp = 2016-12-23 23:56:02.0 val secondDate : java.sql.TimeStamp = 2017-12-23 23:56:02.0 I want number of days between these two dates – Divas Nikhra Jul 06 '17 at 13:17
  • When I am giving the variable name I am getting the below error :291: error: type mismatch; found : java.sql.Timestamp required: org.apache.spark.sql.Column val diffOfDays= datediff(firstDate, secDate) – Divas Nikhra Jul 06 '17 at 13:53
  • There is no way for you to restructure the code so that you can work with columns instead of re-rewriting `datediff`? – stefanobaghino Jul 06 '17 at 14:12
  • 1
    actually I am getting these two field under map function and finally creating the Cassandra row for storing the data into Cassandra table. So it is not possible for me to work with full column :-( – Divas Nikhra Jul 06 '17 at 14:32