0

I have a stream of user events. I've mapped them into KV{ userId, event }, and assigned timestamps.

This is to run in streaming mode. I would like to have be able to create the following input-output result:

session window gap=1

  • input: user=1, timestamp=1, event=a
  • input: user=2, timestamp=2, event=a
  • input: user=2, timestamp=3, event=a
  • input: user=1, timestamp=2, event=b
  • time: lwm=3
  • output: user=1, [ { event=a, timestamp=1 }, { event=b, timestamp=2 } ]
  • time: lwm=4
  • output: user=2, [ { event=a, timestamp=2 }, { event=a, timestamp=3 } ]

So that I can write my function to reduce thee list of events in the session window for the user as well as the start and end time of the session window.

How do I write this? (If you answer; "look at the examples", it's not a valid answer, because they never feed the list of events into the reducer with the window as a parameter)

Henrik
  • 9,714
  • 5
  • 53
  • 87

1 Answers1

1

If I understand this correctly, this would be a follow-up to this question and naturally accomplished by adding the Group By Key step as I propose in my solution there.

So, referring to my previous explanation and focusing here on the changes only, if we have a pipeline like this:

events = (p
  | 'Create Events' >> beam.Create(user1_data + user2_data) \
  | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
  | 'keyed_on_user_id'      >> beam.Map(lambda x: (x['user_id'], x))
  | 'user_session_window'   >> beam.WindowInto(window.Sessions(session_gap),
                                             timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
  | 'Group' >> beam.GroupByKey() \
  | 'analyze_session'         >> beam.ParDo(AnalyzeSession()))

Now the elements are arranged as you describe in the question description so we can simply log them in AnalyzeSession:

class AnalyzeSession(beam.DoFn):
  """Prints per session information"""
  def process(self, element, window=beam.DoFn.WindowParam):
    logging.info(element)
    yield element

to obtain the desired results:

INFO:root:('Groot', [{'timestamp': 1554203778.904401, 'user_id': 'Groot', 'value': 'event_0'}, {'timestamp': 1554203780.904401, 'user_id': 'Groot', 'value': 'event_1'}])
INFO:root:('Groot', [{'timestamp': 1554203786.904402, 'user_id': 'Groot', 'value': 'event_2'}])
INFO:root:('Thanos', [{'timestamp': 1554203792.904399, 'user_id': 'Thanos', 'value': 'event_4'}])
INFO:root:('Thanos', [{'timestamp': 1554203784.904398, 'user_id': 'Thanos', 'value': 'event_3'}, {'timestamp': 1554203777.904395, 'user_id': 'Thanos', 'value': 'event_0'}, {'timestamp': 1554203778.904397, 'user_id': 'Thanos', 'value': 'event_1'}, {'timestamp': 1554203780.904398, 'user_id': 'Thanos', 'value': 'event_2'}])

If you want to avoid redundant information such as having the user_id and timestamp as part of the values they can be removed in the Map step. As per the complete use case (i.e. reducing the aggregated events on a per-session level) we can do stuff like counting the number of events or session duration with something like this:

class AnalyzeSession(beam.DoFn):
  """Prints per session information"""
  def process(self, element, window=beam.DoFn.WindowParam):
    user = element[0]
    num_events = str(len(element[1]))
    window_end = window.end.to_utc_datetime()
    window_start = window.start.to_utc_datetime()
    session_duration = window_end - window_start

    logging.info(">>> User %s had %s event(s) in %s session", user, num_events, session_duration)

    yield element

which, for my example, will output the following:

INFO:root:>>> User Groot had 2 event(s) in 0:00:07 session
INFO:root:>>> User Groot had 1 event(s) in 0:00:05 session
INFO:root:>>> User Thanos had 4 event(s) in 0:00:12 session
INFO:root:>>> User Thanos had 1 event(s) in 0:00:05 session

Full code here

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • 1
    Thanks @Guillem — I've accepted this as the answer, but due to the experienced buggyness and lack of community around Beam, I've moved to Flink. Stats for me/someome who's never worked with either before: 16h Beam, failed to do the above, 6h Flink, succeeded with session windows in first try and could ALSO flush/purge the window and control the window 'lookahead' based on data. – Henrik Apr 06 '19 at 11:54