Apache beam blocked on unbounded side input

Question

My question is very similar to another post: Apache Beam Cloud Dataflow Streaming Stuck Side Input.

However, I tried the resolution there (apply GlobalWindows() to the side input), and it did not seem to fix my problem.

I have a Dataflow pipeline (but I'm using DirectRunner for debug) with Python SDK where the main input are logs from PubSub and the side input is associated data from a mostly unchanging database. I would like to join the two such that each log is paired with side input data from the same approximate time. Excess side inputs without an associated log can be dropped.

The behavior I see is that the pipeline seems to be operating as a single thread. It processes the all side input elements first, then starts processing the main input elements. If the side input is bounded (non-streaming), this is fine, and the pipeline can merge inputs and run to completion. If the side input is unbounded (streaming), however, the main input is blocked indefinitely while apparently waiting for the side input processing to finish.

To illustrate the behavior, I made simplified test case below.

class Logger(apache_beam.DoFn):

  def __init__(self, name):
    self._name = name

  def process(self, element, w=apache_beam.DoFn.WindowParam,
              ts=apache_beam.DoFn.TimestampParam):
    logging.error('%s: %s', self._name, element)
    yield element

def cross_join(left, rights):
  for right in rights:
    yield (left, right)

def main():
  start = timestamp.Timestamp.now()

  # Bounded side inputs work OK.
  stop = start + 20

  # Unbounded side inputs appear to block execution of main input
  # processing.
  #stop = timestamp.MAX_TIMESTAMP

  side_interval = 5
  main_interval = 1

  side_input = (
      pipeline
      | PeriodicImpulse(
          start_timestamp=start,
          stop_timestamp=stop,
          fire_interval=side_interval,
          apply_windowing=True)
      | apache_beam.Map(lambda x: ('side', x))
      | apache_beam.ParDo(Logger('side_input'))
  )
  main_input = (
      pipeline
      | PeriodicImpulse(
          start_timestamp=start, stop_timestamp=stop,
          fire_interval=main_interval, apply_windowing=True)
      | apache_beam.Map(lambda x: ('main', x))
      | apache_beam.ParDo(Logger('main_input'))
      | 'CrossJoin' >> apache_beam.FlatMap(
          cross_join, rights=apache_beam.pvalue.AsIter(side_input))
      | 'CrossJoinLogger' >> apache_beam.ParDo(Logger('cross_join_output'))
  )
  pipeline.run()

Am missing something that is preventing main inputs from being processed in parallel with the side inputs?

score 1 · Answer 1 · answered May 07 '22 at 01:03

1

The main input can advance only when the watermark has passed the corresponding side input's windowing. See details in the programming guide. You likely need to window both the main and side input, and make sure PeriodicImpulse is correctly advancing the watermark.

answered May 07 '22 at 01:03

robertwb

4,891
18
21

Thanks. I tried adding WindowInto(FixedWindows(1)) on side and main inputs, but that gave same behavior. I also tried explicit triggers as in [similar question](https://stackoverflow.com/q/70561769) but same behavior. Do you have a code snippet showing how I can force watermark to advance past the side input window end timestamps? – Andy May 07 '22 at 01:23

Andy · Accepted Answer · 2022-05-27T15:38:43.820

Using the example from stackoverflow.com/q/70561769 I was able to get the side input and main input working concurrently as expected for certain cases. The answer there was to apply GlobalWindows() to the side_input.

side_input = ( pipeline
  | PeriodicImpulse(fire_interval=300, apply_windowing=False)
  | "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
      trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
      accumulation_mode=trigger.AccumulationMode.DISCARDING)
  | ...
)

Based on experimentation, my conclusion is there are cases when PeriodicImpulse on the side input causes the main input to block, such as below:

Case 1: GOOD
GlobalWindow
Main input = PubSub
Side input = PeriodicImpulse

Case 2: BAD
FixedWindow
Main input = PubSub
Side input = PeriodicImpulse

Case 3: BAD
GlobalWindow / FixedWindow
Main input = PeriodicImpulse
Side input = PeriodicImpulse

Case 4: GOOD
FixedWindow
Main input = PubSub
Side input = PubSub

My problem now is that the side input timestamps are not aligning with the main input properly stackoverflow.com/q/72382440.

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, May 26 '22 at 06:45

Apache beam blocked on unbounded side input

2 Answers2