0

I am streaming messages from Kafka topic using KafkaIO API https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

The pipeline flow is as follows:

KafkaStream --> Decode Message using transformer -->Save to BigQuery

I decoding the message and save to BigQuery using BigQueryIO. I would like to know do I need to use window or not.

Window.into[Array[Byte]](FixedWindows.of(Duration.standardSeconds(10)))
        .triggering(
          Repeatedly
            .forever(
              AfterProcessingTime
                .pastFirstElementInPane()
                .plusDelayOf(Duration.standardSeconds(10))
            )
        )
        .withAllowedLateness(Duration.standardSeconds(0))
        .discardingFiredPanes()
    )

as per documenattion Window is require in case we are doing any computation like GroupByKey,etc. Since I am just decoding Array Byte message and storing them into BigQuery, it may not require.

Please let me know, do I need to use window or not?

ASe
  • 535
  • 5
  • 15

1 Answers1

0

There is an answer already posted to a similar question, where the data is being stream from PubSub. The main ideas is that it is impossible to collect all of the elements of an unbounded PCollections since new elements are being constantly added, and therefore one of two strategies must be implemented:

  • Windowing: you should first set a non-global windowing function.
  • Triggers: you can set up a trigger for an unbounded PCollection in such a way that it provides periodic updates on an unbounded dataset, even if the data in the subscription is still flowing

It might also be necessary to enable Streaming in the Pipeline by setting the appropriate arg parameter of the option using the following command:

pipeline_options.view_as(StandardOptions).streaming = True
Philipp Sh
  • 967
  • 5
  • 11