4

There is a table with incidents and a specific timestamp. I struggle to calculate the number of days passed using the Pyspark 2.0 API. I managed to do the same thing when the timestamp followed another format (yyyy-mm-dd)

    +-------------------+------------------------+----------+--------------+
     | first_booking_date|first_booking_date_clean|  today   |customer_since|
    +-------------------+------------------------+----------+--------------+
     |02-06-2011 20:52:04|              02-06-2011|02-06-2011|          null|
     |03-06-2004 18:15:10|              03-06-2004|02-06-2011|          null|

I tried the following (nothing worked): - extract date with string manipulation and use datediff - cast to timestamp and then extract dd:MM:yy (->result null) - I prefer to use pyspark commands over any additional transformation with sql

Help is highly appreciated, Best and thanks a lot!!!

EDIT: Here is an example that did not work:

import datetime
today = datetime.date(2011,2,1)
today = "02-06-2011"
first_bookings = first_bookings.withColumn("today",F.lit(today))
first_bookings = first_bookings.withColumn("first_booking_date_clean",F.substring(first_bookings.first_booking_date, 0, 10))
first_bookings = first_bookings.withColumn("customer_since",F.datediff(first_bookings.today,first_bookings.first_booking_date_clean))
Niklas Brauer
  • 67
  • 1
  • 2
  • 7
  • Happy to add further information if needed! – Niklas Brauer Jun 30 '17 at 17:21
  • https://stackoverflow.com/a/36985244/4219202 Check out this answer to a similar question. In your case the timeFmt would be "dd-MM-yyyy" for the columns first_booking_date_clean and today – Rick Jul 01 '17 at 09:16
  • 2
    Possible duplicate of [Calculating duration by subtracting two datetime columns in string format](https://stackoverflow.com/questions/30283415/calculating-duration-by-subtracting-two-datetime-columns-in-string-format) – sdikby Feb 20 '18 at 10:27

1 Answers1

8

This answer is basically a copy of https://stackoverflow.com/a/36985244/4219202 In your case the timeFmt would be "dd-MM-yyyy" for the columns first_booking_date_clean and today

As of Spark 1.5 you can use unix_timestamp:

from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
            - F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)
Rick
  • 2,080
  • 14
  • 27