Reducing Dataframe based on time range and a second parameter

Question

I have a set of data similar to the following that I'm trying to find a way to reduce using spark dataframes in python.

uuid  if_id start_time           end_time             ip_addr
1     03    2018/07/01 13:00:00  2018/07/01 13:00:01  1.1.1.1
1     03    2018/07/01 13:01:05  2018/07/01 13:02:00  1.1.1.1
1     03    2018/07/01 15:00:00  2018/07/01 15:00:30  1.1.1.1
1     03    2018/07/02 01:00:00  2018/07/02 01:00:07  1.2.3.4
1     03    2018/07/02 08:30:00  2018/07/02 08:32:04  1.2.3.4
1     03    2018/07/02 12:00:00  2018/07/02 12:01:00  1.1.1.1
1     05    2018/07/01 15:00:02  2018/07/01 15:00:35  2.2.2.2
1     05    2018/07/01 13:45:23  2018/07/01 13:45:40  2.2.2.2

I need to reduce the above data down to the following:

uuid  if_id start_time           end_time             ip_addr
1     03    2018/07/01 13:00:00  2018/07/01 15:00:30  1.1.1.1
1     03    2018/07/02 01:00:00  2018/07/02 08:32:04  1.2.3.4
1     03    2018/07/02 12:00:00  2018/07/02 12:01:00  1.1.1.1
1     05    2018/07/01 13:45:23  2018/07/01 15:00:35  2.2.2.2

The final dataset should represent a table that shows during a given time period (start_time to end_time) what IP address was assigned to an interface (if_id) of a particular host identified by uuid.

If there wasn't the possibility of a given interface changing IP addresses over time, as is the case with uuid=1 and if_id=3, this could be handled using groupBy and a window spec to extract the minimum start_time and maximum end_time. However, given that the addresses can change I'm not sure how to approach this without making multiple passes over the data.

Any suggested approaches would be appreciated.

[How to aggregate over rolling time window with groups in Spark](https://stackoverflow.com/q/41711716) — Alper t. Turker, Jul 11 '18 at 14:07

Kickingbull · Answer 1 · 2018-07-12T13:49:43.987

Using the link suggested by user8371915 I was able to come up with the following solution.

import pyspark.sql.functions as func
from pyspark.sql.window import Window

df = spark.createDataFrame([Row(uuid=1, int_id='03', event_start=701130000, event_end=701130001, ip='1.1.1.1'),
                            Row(uuid=1, int_id='03', event_start=701130105, event_end=701130200, ip='1.1.1.1'),
                            Row(uuid=1, int_id='05', event_start=701134523, event_end=701134540, ip='2.2.2.2'),
                            Row(uuid=1, int_id='03', event_start=701150000, event_end=701150030, ip='1.1.1.1'),
                            Row(uuid=1, int_id='05', event_start=701150002, event_end=701150035, ip='2.2.2.2'),
                            Row(uuid=1, int_id='03', event_start=702010000, event_end=702010007, ip='1.2.3.4'),
                            Row(uuid=1, int_id='03', event_start=702083000, event_end=702083204, ip='1.2.3.4'),
                            Row(uuid=1, int_id='03', event_start=702120000, event_end=702120100, ip='1.1.1.1')])

window1 = Window.partitionBy('uuid', 'int_id').orderBy('event_start', 'event_end')

window2 = Window.partitionBy('uuid', 'int_id', 'time_group') \
                .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

# get previous row's ip address
prev_ip = func.lag('ip', 1).over(window1)

#indicate if an IP address change has occurred between current and previous rows
indicator = func.coalesce((col('ip') != prev_ip).cast('integer'), lit(1))

# Cumulative sum of indicators over the window
time_group = func.sum(indicator).over(window1).alias('time_group')

#Add time_group expression to the table:
df = df.select('*', time_group)

# Add begin and end time period for each interface ip address
df = df.select('uuid', 'int_id', 'ip', 
               func.min('event_start').over(window2).alias('period_begin'),
               func.max('event_end').over(window2).alias('period_end')) \
       .dropDuplicates() \
       .orderBy('uuid', 'int_id', 'period_begin', 'ip')

df.show(truncate=False)

The above produces the following results: +----+------+-------+------------+----------+ |uuid|int_id|ip |period_begin|period_end| +----+------+-------+------------+----------+ |1 |03 |1.1.1.1|701130000 |701150030 | |1 |03 |1.2.3.4|702010000 |702083204 | |1 |03 |1.1.1.1|702120000 |702120100 | |1 |05 |2.2.2.2|701134523 |701150035 | +----+------+-------+------------+----------+

Reducing Dataframe based on time range and a second parameter

1 Answers1