0

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.

The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.

ID Date
1 2020-09-03
1 2020-09-03
2 2020-09-02
1 2020-09-04
2 2020-09-06
2 2020-09-16

the needed outcome for this example will be:

ID average difference
1 0.5
2 7

thanks for helping!

kri
  • 95
  • 6
  • Does this help? https://stackoverflow.com/questions/38156367/date-difference-between-consecutive-rows-pyspark-dataframe – Emma Apr 18 '22 at 17:32
  • Thank you but it's is sql syntax, and I need to use pyspark as I have RDD. – kri Apr 18 '22 at 18:18
  • Please check the other answers than accepted answer. It shows how to do with dataframes. If you need to keep it as RDD without converting to dataframes, I think you need to write custom function. – Emma Apr 18 '22 at 18:21

1 Answers1

0

You can use datediff with window function to calculate the difference, then take an average.

lag is one of the window function and it will take a value from the previous row within the window.

from pyspark.sql import functions as F

# define the window
w = Window.partitionBy('ID').orderBy('Date')

# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
  .groupby('ID')    # aggregate over ID
  .agg(F.avg(F.col('diff')).alias('average difference'))
)
Emma
  • 8,518
  • 1
  • 18
  • 35
  • Thank you very much for the effort, but as I said I have to do this using the RDD functions, no Spark SQL. – kri Apr 18 '22 at 20:52
  • Do you really have to do in RDD as oppose to dataframe? Or open to convert? As I mentioned, If you need to do it in RDD, you might need a custom function. – Emma Apr 18 '22 at 21:44
  • @kri You never mentioned about RDD in your question. This answer has "pyspark syntax", which is what you were asking for. Are you sure that you're looking for RDD [`reduce`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html) functions instead of PySpark dataframe approach? – pltc Apr 19 '22 at 04:56