1

I have DF comprise of two columns that have time and I want to compute the delta time between them.

the following DF is a sample of the original DF :

+-------------------+-------------------+
|               time|              time2|
+-------------------+-------------------+
|2017-01-13 00:17:21|2017-01-13 14:08:03|
|2017-01-13 14:08:08|2017-01-13 14:08:03|
|2017-01-13 14:08:59|2017-01-13 14:08:03|
|2017-01-13 04:21:42|2017-01-13 14:08:03|
+-------------------+-------------------+

the schema of the Df as follows:

root
 |-- time: string (nullable = true)
 |-- time2: string (nullable = true)

I used the following method:

import pyspark.sql.types as typ
import pyspark.sql.functions as fn
from pyspark.sql.functions import udf
import datetime
from time import  mktime, strptime

def diffdates(t1, t2):
    #Date format: %Y-%m-%d %H:%M:%S
    delta= ((mktime(strptime(t1,"%Y-%m-%d %H:%M:%S")) - mktime(strptime(t2, "%Y-%m-%d %H:%M:%S"))))
    return (delta)



dt = udf(diffdates, typ.IntegerType())
Time_Diff = df.withColumn('Diff',(dt(df.time,df.time2)))

The resulting new column has null value as follows:

+-------------------+-------------------+----+
|               time|              time2|Diff|
+-------------------+-------------------+----+
|2017-01-13 00:17:21|2017-01-13 14:08:03|null|
|2017-01-13 14:08:08|2017-01-13 14:08:03|null|
|2017-01-13 14:08:59|2017-01-13 14:08:03|null|
|2017-01-13 04:21:42|2017-01-13 14:08:03|null|
+-------------------+-------------------+----+

what shall I do?

zero323
  • 322,348
  • 103
  • 959
  • 935
Ahmad Senousi
  • 613
  • 2
  • 12
  • 24
  • check this , https://stackoverflow.com/questions/30283415/calculating-duration-by-subtracting-two-datetime-columns-in-string-format/30315921 – Suresh Jan 21 '18 at 07:37
  • I already used unix_timestamp method and got also null value. Furthermore, I used the method that described in the second answer it already gives correct value but when I filtered the resulting dataframe to obtain the rows that have minimum value existing in delta time (Diff) column for each group in time2 column. I got this error "TypeError: strptime() argument 1 must be string, not None" – Ahmad Senousi Jan 21 '18 at 10:04
  • In your dataframe, the columns are time and time1 , whereas here Time_Diff = df.withColumn('Diff',(dt(T_GPS_On_fi.time, T_GPS_On_fi.uptime))) you are using uptime. Are we missing something ? – Suresh Jan 21 '18 at 10:35

1 Answers1

3

This is what I tired and it's working for me. Let me know if I miss anything,

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> l = [('2017-01-13 00:17:21','2017-01-13 14:08:03'),('2017-01-13 14:08:08','2017-01-13 14:08:03'),('2017-01-13 14:08:59','2017-01-13 14:08:03'),('2017-01-13 04:21:42','2017-01-13 14:08:03')]
>>> df = spark.createDataFrame(l,['time1','time2'])
>>> df1 = df.select(df.time1.cast('timestamp'),df.time2.cast('timestamp'))
>>> df1.show()
+--------------------+--------------------+
|               time1|               time2|
+--------------------+--------------------+
|2017-01-13 00:17:...|2017-01-13 14:08:...|
|2017-01-13 14:08:...|2017-01-13 14:08:...|
|2017-01-13 14:08:...|2017-01-13 14:08:...|
|2017-01-13 04:21:...|2017-01-13 14:08:...|
+--------------------+--------------------+

>>> from pyspark.sql import functions as F
>>> timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
>>> timeDiff = (F.unix_timestamp('time1', format=timeFmt) - F.unix_timestamp('time2', format=timeFmt))
>>> df1 = df1.withColumn("delta",timeDiff) ## delta is in unit of seconds
>>> df1.show(truncate=False)
+---------------------+---------------------+------+
|time1                |time2                |delta |
+---------------------+---------------------+------+
|2017-01-13 00:17:21.0|2017-01-13 14:08:03.0|-49842|
|2017-01-13 14:08:08.0|2017-01-13 14:08:03.0|5     |
|2017-01-13 14:08:59.0|2017-01-13 14:08:03.0|56    |
|2017-01-13 04:21:42.0|2017-01-13 14:08:03.0|-35181|
+---------------------+---------------------+------+
>>> df1.groupby('time2').agg(F.min('delta')).show()
+--------------------+----------+
|               time2|min(delta)|
+--------------------+----------+
|2017-01-13 14:08:...|    -49842|
+--------------------+----------+
Suresh
  • 5,678
  • 2
  • 24
  • 40