0

I have two timestamp columns('tpep_pickup_datetime' and 'tpep_dropoff_datetime') and when I calculate the difference between them, I get an interval variable.

yellowcab = yellowcab \
.withColumn('tpep_pickup_datetime', to_timestamp('tpep_pickup_datetime','yyyy-MM-dd HH:mm:ss'))\
.withColumn('tpep_dropoff_datetime', to_timestamp('tpep_dropoff_datetime','yyyy-MM-dd HH:mm:ss'))
yellowcab = yellowcab \
.withColumn('total_time', col('tpep_dropoff_datetime')-col('tpep_pickup_datetime'))

The result looks like that:

enter image description here

I want to transform 'total_time' column to an 'int' variable with the time converted to seconds.

I have tried to extract the hours and the minutes from the interval variable and then multiply them in order to convert to seconds, but I have not been able to do it

Lamanus
  • 12,898
  • 4
  • 21
  • 47
Migue
  • 1
  • 1

1 Answers1

0

Cast the interval into int.


data = [['2020-08-01 00:02:53', '2020-08-01 00:28:54']]

df = spark.createDataFrame(data, ['t1', 't2']) \
  .withColumn('t1', f.to_timestamp('t1','yyyy-MM-dd HH:mm:ss')) \
  .withColumn('t2', f.to_timestamp('t2','yyyy-MM-dd HH:mm:ss')) \
  .withColumn('interval', (f.col('t2') - f.col('t1')).cast('int')) \
  .show()

+-------------------+-------------------+--------+
|                 t1|                 t2|interval|
+-------------------+-------------------+--------+
|2020-08-01 00:02:53|2020-08-01 00:28:54|    1561|
+-------------------+-------------------+--------+
Lamanus
  • 12,898
  • 4
  • 21
  • 47