0

I have a Dataflow Pipeline with streaming data, and I am using an Apache Beam Side Input of a bounded data source, which may have updates. How do I trigger a periodic update of this side input? E.g. The side input should be refreshed once every 12 hours.

With reference to https://beam.apache.org/documentation/patterns/side-inputs/, this is how I implemented the pipeline with side input:

PCollectionView<Map<Integer, Map<String, Double>>> sideInput = pipeline
        // We can think of it as generating "fake" events every 5 minutes
        .apply("Use GenerateSequence source transform to periodically emit a value",
            GenerateSequence.from(0).withRate(1, Duration.standardMinutes(WINDOW_SIZE)))
        .apply(Window.into(FixedWindows.of(Duration.standardMinutes(WINDOW_SIZE))))
        .apply(Sum.longsGlobally().withoutDefaults()) // what does this do?
        .apply("DoFn periodically pulls data from a bounded source", ParDo.of(new FetchData()))
        .apply("Build new Window whenever side input is called",
            Window.<Map<Integer, Map<String, Double>>>into(new GlobalWindows())
                .triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
                .discardingFiredPanes())
        .apply(View.asSingleton());


pipeline
 .apply(...)
 .apply("Add location to Event",
            ParDo.of(new DoFn<>).withSideInputs(sideInput))
 .apply(...)

Is this the correct way of implementation?

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
yeong
  • 1
  • 1

1 Answers1

1

You can follow the "Slowly updating side input using windowing" part of the mentioned link. It suggests PeriodicImpulse, which can be used to produce a sequence of elements at fixed runtime intervals.

Sakshi Gatyan
  • 1,903
  • 7
  • 13
Bruno Volpato
  • 1,382
  • 10
  • 18