I have a set of data similar to the following that I'm trying to find a way to reduce using spark dataframes in python.
uuid if_id start_time end_time ip_addr
1 03 2018/07/01 13:00:00 2018/07/01 13:00:01 1.1.1.1
1 03 2018/07/01 13:01:05 2018/07/01 13:02:00 1.1.1.1
1 03 2018/07/01 15:00:00 2018/07/01 15:00:30 1.1.1.1
1 03 2018/07/02 01:00:00 2018/07/02 01:00:07 1.2.3.4
1 03 2018/07/02 08:30:00 2018/07/02 08:32:04 1.2.3.4
1 03 2018/07/02 12:00:00 2018/07/02 12:01:00 1.1.1.1
1 05 2018/07/01 15:00:02 2018/07/01 15:00:35 2.2.2.2
1 05 2018/07/01 13:45:23 2018/07/01 13:45:40 2.2.2.2
I need to reduce the above data down to the following:
uuid if_id start_time end_time ip_addr
1 03 2018/07/01 13:00:00 2018/07/01 15:00:30 1.1.1.1
1 03 2018/07/02 01:00:00 2018/07/02 08:32:04 1.2.3.4
1 03 2018/07/02 12:00:00 2018/07/02 12:01:00 1.1.1.1
1 05 2018/07/01 13:45:23 2018/07/01 15:00:35 2.2.2.2
The final dataset should represent a table that shows during a given time period (start_time to end_time) what IP address was assigned to an interface (if_id) of a particular host identified by uuid.
If there wasn't the possibility of a given interface changing IP addresses over time, as is the case with uuid=1 and if_id=3, this could be handled using groupBy and a window spec to extract the minimum start_time and maximum end_time. However, given that the addresses can change I'm not sure how to approach this without making multiple passes over the data.
Any suggested approaches would be appreciated.