0

I'm working on my first Apache beam pipeline to process the data streams from AWS Kinesis. I'm familiar with concepts of Kafka on how it handles the consumers' offset/state and have experience in implementing apache storm/spark processing.

After going through the documentation I was successful in creating a working beam pipeline using KinesisIO Java SDK to listen to AWS Kinesis data streams to transform and print the messages. However, would like to know any reference implementation or pointer on how the below areas are handled in apache beam w.r.t. KinesisIO -

  1. How a consumer application is uniquely identified in Kinesis streams (similar to consumer group Id in Kafka) - Am I right to say that it's based on the apache beam's application name and any consumer which uses KCL tracks its state in DynamoDB; is it true always & with apache beam KinesisIO as well?

  2. How to enforce a consumer to start processing the data streams w.r.t. its shards from where it is left off earlier i.e. in case of consumers is restarted or any error exception in processing (similar to offset management w.r.t. each consumer groupId in Kakfa). InitialPositionInStream.TRIM_HORIZON is always starting from the earliest available data stream even if I restart the pipeline after processing a handful of data from Kinesis streams.

  3. How the ack works in Kinesis data streams i.e. how the consumer acks/updates the checkpoint that data streams pulled using getRecords() are processed before incrementing the sequence/position in the shards further? is there any way to control these behavior in consumer application on when to successfully ack the message to save the application state & start from these positions whenever the consumer is restarted?

  4. Impact of business exception (in any of stages in pipeline) while processing data streams on subsequent data pull from Kinesis streams i.e. whether application continues to pull the data or halt the process.

Neel
  • 1

1 Answers1

0
  1. KinesisIO.Read utilise AWS SDK under the hood to read from Kinesis and it retrieves periodically the updates of Shard Iterator to fetch records from Kinesis shard.

  2. Did you try ShardIteratorType#LATEST for that?

  3. See my answer here: https://stackoverflow.com/a/62349838/10687325

  4. If it's unknown exception then the pipeline will be stopped.

Alexey Romanenko
  • 1,353
  • 5
  • 11
  • Thank you! For #1, how does it maintain the updates to Shard Iterator on the last read record/ sequence number in Shard.. is it in DynamoDB or in Memory (Tried with Direct-runner and GCP Dataflow runner)? For #2, yes, I've tried that but my requirement is to make sure the consumer application always reads (resumes) from the last read position in case if the consumer app is down for sometime and restarted. – Neel Aug 25 '20 at 05:34
  • In case of Direct runner there is no durable storage for checkpoints anywhere. Moreover, that runner is not intended to be used in production environments, its most purpose is to run test pipelines. – PMvn Apr 13 '23 at 11:25
  • How does it maintain the updates to Shard Iterator? It changes the internal checkpoints accordingly - new checkpoint parts are added when new shards appear and old parts are discarded when shard is closed. I am not aware of GCP runner internals, but snapshots should keep Kinesis consumer checkpoints - https://cloud.google.com/dataflow/docs/guides/using-snapshots - in that case, the application will be (re)restated from the point where it left off. – PMvn Apr 13 '23 at 11:28