If I understand this correctly, this would be a follow-up to this question and naturally accomplished by adding the Group By Key step as I propose in my solution there.
So, referring to my previous explanation and focusing here on the changes only, if we have a pipeline like this:
events = (p
| 'Create Events' >> beam.Create(user1_data + user2_data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey() \
| 'analyze_session' >> beam.ParDo(AnalyzeSession()))
Now the elements are arranged as you describe in the question description so we can simply log them in AnalyzeSession
:
class AnalyzeSession(beam.DoFn):
"""Prints per session information"""
def process(self, element, window=beam.DoFn.WindowParam):
logging.info(element)
yield element
to obtain the desired results:
INFO:root:('Groot', [{'timestamp': 1554203778.904401, 'user_id': 'Groot', 'value': 'event_0'}, {'timestamp': 1554203780.904401, 'user_id': 'Groot', 'value': 'event_1'}])
INFO:root:('Groot', [{'timestamp': 1554203786.904402, 'user_id': 'Groot', 'value': 'event_2'}])
INFO:root:('Thanos', [{'timestamp': 1554203792.904399, 'user_id': 'Thanos', 'value': 'event_4'}])
INFO:root:('Thanos', [{'timestamp': 1554203784.904398, 'user_id': 'Thanos', 'value': 'event_3'}, {'timestamp': 1554203777.904395, 'user_id': 'Thanos', 'value': 'event_0'}, {'timestamp': 1554203778.904397, 'user_id': 'Thanos', 'value': 'event_1'}, {'timestamp': 1554203780.904398, 'user_id': 'Thanos', 'value': 'event_2'}])
If you want to avoid redundant information such as having the user_id
and timestamp
as part of the values they can be removed in the Map
step.
As per the complete use case (i.e. reducing the aggregated events on a per-session level) we can do stuff like counting the number of events or session duration with something like this:
class AnalyzeSession(beam.DoFn):
"""Prints per session information"""
def process(self, element, window=beam.DoFn.WindowParam):
user = element[0]
num_events = str(len(element[1]))
window_end = window.end.to_utc_datetime()
window_start = window.start.to_utc_datetime()
session_duration = window_end - window_start
logging.info(">>> User %s had %s event(s) in %s session", user, num_events, session_duration)
yield element
which, for my example, will output the following:
INFO:root:>>> User Groot had 2 event(s) in 0:00:07 session
INFO:root:>>> User Groot had 1 event(s) in 0:00:05 session
INFO:root:>>> User Thanos had 4 event(s) in 0:00:12 session
INFO:root:>>> User Thanos had 1 event(s) in 0:00:05 session
Full code here