I try to get the time difference "time_d" in seconds of a timestamp within "name" in Pyspark.
+-------------------+----+
| timestamplast|name|
+-------------------+----+
|2019-08-01 00:00:00| 1|
|2019-08-01 00:01:00| 1|
|2019-08-01 00:01:15| 1|
|2019-08-01 03:00:00| 2|
|2019-08-01 04:00:00| 2|
|2019-08-01 00:15:00| 3|
+-------------------+----+
Output should look like:
+-------------------+----+--------+
| timestamplast|name| time_d |
+-------------------+----+------- +
|2019-08-01 00:00:00| 1| 0 |
|2019-08-01 00:01:00| 1| 60 |
|2019-08-01 00:01:15| 1| 15 |
|2019-08-01 03:00:00| 2| 0 |
|2019-08-01 04:00:00| 2| 3600 |
|2019-08-01 00:15:00| 3| 0 |
+-------------------+----+--------+
In Pandas this would be:
df['time_d'] = df.groupby("name")['timestamplast'].diff().fillna(pd.Timedelta(0)).dt.total_seconds()
How would this be done in Pyspark?