I have DF comprise of two columns that have time and I want to compute the delta time between them.
the following DF is a sample of the original DF :
+-------------------+-------------------+
| time| time2|
+-------------------+-------------------+
|2017-01-13 00:17:21|2017-01-13 14:08:03|
|2017-01-13 14:08:08|2017-01-13 14:08:03|
|2017-01-13 14:08:59|2017-01-13 14:08:03|
|2017-01-13 04:21:42|2017-01-13 14:08:03|
+-------------------+-------------------+
the schema of the Df as follows:
root
|-- time: string (nullable = true)
|-- time2: string (nullable = true)
I used the following method:
import pyspark.sql.types as typ
import pyspark.sql.functions as fn
from pyspark.sql.functions import udf
import datetime
from time import mktime, strptime
def diffdates(t1, t2):
#Date format: %Y-%m-%d %H:%M:%S
delta= ((mktime(strptime(t1,"%Y-%m-%d %H:%M:%S")) - mktime(strptime(t2, "%Y-%m-%d %H:%M:%S"))))
return (delta)
dt = udf(diffdates, typ.IntegerType())
Time_Diff = df.withColumn('Diff',(dt(df.time,df.time2)))
The resulting new column has null value as follows:
+-------------------+-------------------+----+
| time| time2|Diff|
+-------------------+-------------------+----+
|2017-01-13 00:17:21|2017-01-13 14:08:03|null|
|2017-01-13 14:08:08|2017-01-13 14:08:03|null|
|2017-01-13 14:08:59|2017-01-13 14:08:03|null|
|2017-01-13 04:21:42|2017-01-13 14:08:03|null|
+-------------------+-------------------+----+
what shall I do?