1

I am trying to build a Dataflow pipeline in python. The main input stream is coming from Pub/Sub and the main processing function takes a side input that is updated from a Pub/Sub stream fairly irregularly. I have written the following code to test my design:

def print_and_return(x):
    print('---DEBUG---: ' + str(x))
    return x

def comb(x, y):
    return f'Main input: {x}, side input: {y}'

def load_side_input(pubsub_message):
    import json
    message = pubsub_message.decode("utf8")
    side_input = json.loads(message)
    return ('side', side_input)


def run(input_subscription, side_input_sub, pipeline_args=None):
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True
    )

    with Pipeline(options=pipeline_options) as pipeline:
        side_input = (
            pipeline
            | "Side impulse" >> io.ReadFromPubSub(subscription=side_input_sub)
            | "Window side" >> WindowInto(window.GlobalWindows(), trigger=trigger.Repeatedly(trigger.AfterCount(1)),
                                          accumulation_mode=trigger.AccumulationMode.DISCARDING)
            | "Parse side input" >> Map(load_side_input)
        )
        (
            pipeline
            | "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
            | "Window" >> WindowInto(window.FixedWindows(10))
            | "Add sideinput" >> Map(comb, y=pvalue.AsDict(side_input))
            | "Print" >> Map(print_and_return)
        )

I run it locally to test in debug mode, the load_side_input function triggers (I know because if I put a break point in it it gets hit) but the rest (comb and print_and_return) don't. My understanding is the the FixedWindow should trigger every 10 seconds on the main input and beam would match that window with the last firing of the trigger on the side input since it's in a global window, but in fact nothing happens.

What am I missing, why is there no output?

EDIT:

After days of trying out things and even asking on the beam users mailing group I had the idea that it might just be the local runner acting up, and sure thing, after deploying to Dataflow the pipeline works as expected. It's annoying to test this way though so it would still be nice to know what the problem is and how to solve it.

mani
  • 45
  • 5

1 Answers1

0

It's not triggering since the window of the main and side input don't match (fixed vs global), so it cannot fetch the side input data.

Since you are not using any aggregation, you can just take the main window out. If you plan to aggregate, rewindow after the join of side input.

You have an example here:

Apache Beam Cloud Dataflow Streaming Stuck Side Input

Iñigo
  • 2,500
  • 2
  • 10
  • 20
  • Hi, thanks for your answer. Unfortunately that didn't work either. My approach is based on the first patter in the following documentation: https://beam.apache.org/documentation/patterns/side-inputs/ Here they explicitly mention: "You can retrieve side inputs from global windows to use them in a pipeline job with non-global windows, like a FixedWindow." – mani Mar 28 '22 at 14:47