There is a table with incidents and a specific timestamp. I struggle to calculate the number of days passed using the Pyspark 2.0 API. I managed to do the same thing when the timestamp followed another format (yyyy-mm-dd)
+-------------------+------------------------+----------+--------------+
| first_booking_date|first_booking_date_clean| today |customer_since|
+-------------------+------------------------+----------+--------------+
|02-06-2011 20:52:04| 02-06-2011|02-06-2011| null|
|03-06-2004 18:15:10| 03-06-2004|02-06-2011| null|
I tried the following (nothing worked): - extract date with string manipulation and use datediff - cast to timestamp and then extract dd:MM:yy (->result null) - I prefer to use pyspark commands over any additional transformation with sql
Help is highly appreciated, Best and thanks a lot!!!
EDIT: Here is an example that did not work:
import datetime
today = datetime.date(2011,2,1)
today = "02-06-2011"
first_bookings = first_bookings.withColumn("today",F.lit(today))
first_bookings = first_bookings.withColumn("first_booking_date_clean",F.substring(first_bookings.first_booking_date, 0, 10))
first_bookings = first_bookings.withColumn("customer_since",F.datediff(first_bookings.today,first_bookings.first_booking_date_clean))