Apache Beam Streaming pipeline with sequential batches

Question

What I am trying to do:

Consume json messages from PubSub subscription using Apache Beam Streaming pipeline & Dataflow Runner
Unmarshal payload strings into objects.
- Assume 'messageId' is the unique Id of incoming message. Ex: msgid1, msgid2, etc
Retrieve child records from a database for each object resulted from #2. Same child can be applicable for multiple messages.
- Assume 'childId' as the unique Id of child record. Ex: cid1234, cid1235 etc
Group child records by their unique id as shown in example below
- KV.of(cid1234,Map.of(msgid1, msgid2)) and KV.of(cid1235,Map.of(msgid1, msgid2))
Write grouped result at childId level to the database

Questions:

Where should the windowing be introduced? we currently have 30minutes fixed windowing after step#1
How does Beam define start and end time of 30mins window? is it right after we start pipeline or after first message of batch?
What if the steps 2 to 5 take more than 1hour for a window and next window batch is ready. Would both windows batches gets processed in parallel?
How can make the next window messages wait until previous window batch is completed?
- If we dont do this, the result at childId level will be overwritten by next batches

Code snippet:

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);

score 2 · Answer 1 · answered Sep 08 '20 at 09:11

Addressing each of your questions.

Regarding questions 1 and 2 When you us Windowing within Apache Beam, you need to understand that the "windows existed before the job". What I mean is that the windows start at the UNIX epoch (timestamp = 0). In other words, your data will be allocated within each fixed time range, example with fixed 60 seconds windows:

  PCollection<String> items = ...;
    PCollection<String> fixedWindowedItems = items.apply(
        Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));

First window: [0s;59s) - Second : [60s;120s)...and so on Please refer to the documentation 1, 2 and 3

About question 3, the default of Windowing and Triggering in Apache Beam is to ignore late data. Although, it is possible to configure the handling of late data using withAllowedLateness. In order to do so, it is necessary to understand the concept of Watermarks before. Watermark is a metric of how far behind the data is. Example: you can have a 3 second watermark, then if your data is 3 seconds late it will be assigned to the right window. On the other hand, if it is passed the watermark, you define what it will happen with this data, you can reprocess or ignore it using Triggers.

withAllowedLateness

  PCollection<String> items = ...;
    PCollection<String> fixedWindowedItems = items.apply(
        Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
              .withAllowedLateness(Duration.standardDays(2)));

Pay attention that an amount of time is set for late data to arrive.

Triggering

PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))  
                            .triggering(AfterProcessingTime.pastFirstElementInPane()                                                                .plusDelayOf(Duration.standardMinutes(1)))          
                .withAllowedLateness(Duration.standardMinutes(30));

Notice that the window is re-processed and re-computed event time there is late data. This trigger gives you the opportunity to react to the late data.

Finally, about question 4, which is partially explained with the concepts described above. The computations will occur within each fixed window and recomputed/processed every time a trigger is fired. This logic will guarantee your data it is in the right window.

Apache Beam Streaming pipeline with sequential batches

1 Answers1