1

I've seen some similar questions around this issue which suggest low throughput from PubSub can cause issues; however I have more than enough data coming through to push things along...

This is a Python streaming pipeline, reading data from PubSub with the ultimate goal of writing records to Redis (Memorystore) to use as a cache.

 with beam.Pipeline(options=pipeline_options) as p:
        windowed_history_events = (p
         | "Read input from PubSub" >> beam.io.ReadFromPubSub(subscription=known_args.subscription)
         | "Parse string message to dict" >> beam.ParDo(ParseMessageToDict())
         | "Filter non-page views" >> beam.Filter(is_page_view)
         | "Create timestamp KV" >> beam.ParDo(CreateTimestampKV())
         | "Window for writes" >> beam.WindowInto(
                    window.GlobalWindows(),
                    trigger=trigger.Repeatedly(trigger.AfterCount(10)),
                    accumulation_mode=trigger.AccumulationMode.DISCARDING
                )
         | "Get user and content ID" >> beam.ParDo(ParseMessageToKV())
         | "Group by user ID" >> beam.GroupByKey()
         | "Create timestamp KV2" >> beam.ParDo(TmpDOFN())
         | "Push content history to Memorystore" >> beam.ParDo(
                    ConnectToRedis(known_args.host, known_args.port))
                                   )

The TmpDoFN() function after the GroupByKey step is just there as a debug step right now - it just prints out messages to make sure something is going through it:

class TmpDOFN(beam.DoFn):
    def process(self, message):
        print(message)
        yield message

However, this never gets called and nothing is printed (and PyCharm's debug point is never triggered). As I understand it, the window function/trigger I have set up at the moment should just output every 10 messages which are then Grouped and passed to the next step.

If I remove the GroupByKey step, messages are printed out as expected and the pipeline continues..

I tried this using FixedWindow previously and ran into the same problem.

Any ideas?

Thanks

O Bishop
  • 11
  • 2
  • Is it possible that you don't have 10 elements for a given key? Which causes your trigger to not fire when you add the GroupByKey? But since there are more than 10 elements across all keys, then the trigger is firing. Please consider logging the keys as well, reducing AfterCount to a lower number, etc. to debug and collect more information. You may also consider creating a CompositeTrigger using an afterProcessingTime and afterAny to allow your pipeline to emit elements for keys with fewer than 10 elements. https://beam.apache.org/documentation/programming-guide/#composite-triggers – Alex Amato Mar 18 '20 at 22:22
  • @AlexAmato thanks for the suggestion! I thought AfterCount would trigger on the total items in the window, not the items in the GroupBy so that definitely helped. I've tried using a composite AfterCount(1) with AfterProcessingTime(30) and still have the issue of nothing leaving the GroupByKey step, so back to square one unfortunately – O Bishop Mar 19 '20 at 12:09
  • Unfortunately, nothing else obvious stands out as an issue to me. Have you tried running it on direct runner instead of dataflow runner, just to see if that changes anything? Also, consider trying to use a Create.of to produce events from memory (using direct runner) instead of using the PubSub IO. Finally, the last thing you could change to help debug is changing the window to a fixed window instead of a global window. Note: GlobalWIndows can block streaming pipelines, but I believe your usage is safe. That is, you aren't trying to aggregate a never ending window (which is not possible). – Alex Amato Mar 19 '20 at 17:54
  • Can you share the code snippet for `beam.ParDo(ParseMessageToKV())` ? – Jayadeep Jayaraman Mar 28 '20 at 14:48

1 Answers1

1

I experienced a similar issue with GroupByKey giving an output of no results and also Latest.PerKey giving the a runtime error when I was windowing the data and then trying to aggregate the output.

The debug approach was just to do the GroupByKey on Global Window before I started window... which worked just fine for GroupByKey. This led me to a windowing problem with GroupByKey as it worked just find in Global Window, but not my Fixed or Sliding Windows.

  • Issue: Rookie Error on my part... with event time beyond late arrival

    • I was using event time with historical data to build my pipeline that was several weeks old. My window discarded the rest because the event time was so old beyond watermark / allowed lateness, hence no results / error.
  • Workaround: Temporarily using processing time (could have allowed for later arrival)

    • For purpose of building my pipeline on historical data, I switched to processing time for event timestamps and then aggregated just fine. In real-world, I don't expect this as a sunny-day scenario but a good lesson learnt to cater for very very late arrival records.
    • I could have also just allowed for this longer late arrival, although for a streaming pipeline to have a few weeks delay, I'd rather deal with the error in python as a try/except scenario.