I have time series data in CSV from vehicle with following information:
- trip-id
- timestamp
- speed
The data looks like this:
trip-id | timestamp | speed
001 | 1538204192 | 44.55
001 | 1538204193 | 47.20 <-- start of brake
001 | 1538204194 | 42.14
001 | 1538204195 | 39.20
001 | 1538204196 | 35.30
001 | 1538204197 | 32.22 <-- end of brake
001 | 1538204198 | 34.80
001 | 1538204199 | 37.10
...
001 | 1538204221 | 55.30
001 | 1538204222 | 57.20 <-- start of brake
001 | 1538204223 | 54.60
001 | 1538204224 | 52.15
001 | 1538204225 | 49.27
001 | 1538204226 | 47.89 <-- end of brake
001 | 1538204227 | 50.57
001 | 1538204228 | 53.72
...
A braking event occurs when there's a decrease in speed
in 2 consecutive records based on timestamp
.
I want to extract the braking events from the data in terms of event start timestamp
, end timestamp
, start speed
& end speed
.
+-------------+---------------+-------------+-----------+---------+
| breakID|start timestamp|end timestamp|start speed|end speed|
+-------------+---------------+-------------+-----------+---------+
|0011538204193| 1538204193| 1538204196| 47.2| 35.3|
|0011538204222| 1538204222| 1538204225| 57.2| 49.27|
+-------------+---------------+-------------+-----------+---------+
Here's my take:
- Defined a window spec with partition according to
trip-id
, ordered bytimestamp
. - Applied window
lag
to move over consecutive rows and calculate speed difference. - Filter out records which have positive speed difference, as i am interested in braking events only.
- Now that I only have records belonging to braking events, I want to group records belonging to same event. I guess i can do this based on the timestamp difference. If the difference between 2 records is 1 second, those 2 records belong to same braking event.
I am stuck here as i do not have a key
belonging to same group so i can apply key based aggregation.
My question is:
How can I map to add a
key
column based on the difference in timestamp? So if 2 records have a difference of 1 seconds, they should have a common key. That way, I can reduce a group based on the newly added key.Is there any better & more optimized way to achieve this? My approach could be very inefficient as it relies on row by row comparisons. What are the other possible ways to detect these kind of "sub-events" (e.g braking events) in a data-stream belonging to a specific event (data from single vehicle trip)?
Thanks in advance!
Appendix:
- Example data file for a trip: https://www.dropbox.com/s/44a0ilogxp60w...