I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID | Date |
---|---|
1 | 2020-09-03 |
1 | 2020-09-03 |
2 | 2020-09-02 |
1 | 2020-09-04 |
2 | 2020-09-06 |
2 | 2020-09-16 |
the needed outcome for this example will be:
ID | average difference |
---|---|
1 | 0.5 |
2 | 7 |
thanks for helping!