22

I have the following sample dataframe. The date_1 and date_2 columns have datatype of timestamp.

ID  date_1                      date_2                      date_diff
A   2019-01-09T01:25:00.000Z    2019-01-10T14:00:00.000Z    -1
B   2019-01-12T02:18:00.000Z    2019-01-12T17:00:00.000Z    0

I want to find the different between date_1 and date_2 in minutes.

When I use the code below, it gives me the date_diff column in whole integer values (days):

df = df.withColumn("date_diff", F.datediff(F.col('date_1'), F.col('date_2')))  

But what I want is for date_diff to take into consideration the timestamp and give me minutes back.

How do I do this?

pault
  • 41,343
  • 15
  • 107
  • 149
PineNuts0
  • 4,740
  • 21
  • 67
  • 112
  • 2
    Possible duplicate of [Spark Scala: DateDiff of two columns by hour or minute](https://stackoverflow.com/questions/37058016/spark-scala-datediff-of-two-columns-by-hour-or-minute) – pault Jan 28 '19 at 23:20

1 Answers1

37

Just convert the timestamps to unix timestamps (seconds since epoch), compute the difference, and divide by 60.

For example:

import pyspark.sql.functions as F
df.withColumn(
    "date_diff_min", 
    (F.col("date_1").cast("long") - F.col("date_2").cast("long"))/60.
).show(truncate=False)
pault
  • 41,343
  • 15
  • 107
  • 149
  • how can i get the difference in hours when start and end date are different times in the same day? with the above answer i get 0 if difference in hours is < 24 hrs – Arun Mohan Jul 27 '22 at 17:24
  • It is not necessary to cast `date_1` and `date_2` to long. It is enough to cast once after the subtraction: `(F.col("date_1") - F.col("date_2")).cast("long")/60.` – Péter Szilvási Feb 04 '23 at 11:45