5

I am trying to implement a spring boot aws kinesis consumer that is capable of being auto-scaled in order to share the load (split processing shards) with the original instance.

What I have been able to do: using the well defined read me and examples available hereKinesis binder docs I have been able to start up multiple consumers that actually divide the shards for processing by supplying these properties.

on the producer, I supply partitionCount: 2 via an application property. and on the consumers, I supply both the instanceIndex and the instanceCount.

on consumer 1 i have instanceIndex=0 and instantCount =2 , on consumer 2 i have instanceIndex=1 and instantCount=2

this works fine and I have two spring boot applications dealing with their specific shards. But in this case, I have to have a pre-configured properties file per boot application that needs to be available upon load for them to split the load. and if I only start up the first consumer(non auto-scaled) I only process shards specific to index 0, leaving other shards unprocessed.

What I would like to do but not sure if it is possible is to have a single consumer deployed (that processes all shards). if I deploy another instance I would like that instance to relive the first consumer of some of the load, in other words, if I have 2 shards and one consumer it would process both, if I then deploy another app I would like that first consumer to now only processes from a single shard leaving the second shard to the second consumer.

I have tried to do this by not specifying instanceIndex or instanceCount on the consumers and only supplying the group name, but that leaves the second consumer idle until the first is shut down. FYI I have also created my own metadata and locking table, preventing the binder from creating the default ones.

Configurations: Producer -----------------

originator: KinesisProducer
server:
 port: 8090

    spring: 
      cloud: 
        stream: 
          bindings:
            output: 
              destination: <stream-name> 
              content-type: application/json
              producer: 
                headerMode: none
                partitionKeyExpression: headers.type

consumers-------------------------------------

originator: KinesisSink
server:
 port: 8091

spring:
  cloud:
    stream:
      kinesis:
        bindings:
          input:
            consumer:
              listenerMode: batch
              recordsLimit: 10
              shardIteratorType: TRIM_HORIZON
        binder:
          checkpoint:
            table: <checkpoint-table>
          locks:
            table: <locking-table
      bindings:
        input:
          destination: <stream-name>
          content-type: application/json
          consumer:
            concurrency: 1
            listenerMode: batch
            useNativeDecoding: true
            recordsLimit: 10
            idleBetweenPolls: 250
            partitioned: true
          group: mygroup
Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
czgaljic
  • 77
  • 1
  • 5

1 Answers1

2

That’s correct. That’s how it works for now: if one consumer is there, it takes all the shards for processing. The second one will take an action only if the first one is broken somehow for at least one shard.

The proper Kafka-like rebalancing is on our roadmap. We don’t have the solid vision yet, so issue on the matter and subsequent contribution are welcome!

Artem Bilan
  • 113,505
  • 11
  • 91
  • 118
  • Thanks for the quick response. sorry but one last question. If another consumer (backup consumer) would be created, would the second instance have a chance to claim lock to any new shards? I ask this because in the time i have spent testing this i have been able to have situations where starting up two consumers in race condition would allow both to claim individual shards. – czgaljic Sep 19 '18 at 17:52
  • When resharding happens on AWS Kinesis, all the consumers are involved in the distribution process, but only one can obtain a distributed lock for one shard. – Artem Bilan Sep 19 '18 at 17:55
  • The issue on the matter for traceability: https://github.com/spring-projects/spring-integration-aws/issues/99 – Artem Bilan Sep 20 '18 at 18:42
  • I am also trying to get a consumer configuration like what @czgaljic describes in the original post. With binder version 1.2.0, does this work differently now? – Keith Bennett Aug 30 '19 at 19:51
  • 1
    It has to. Yo also may consider to use a Kinesis Client Library as an option for different consumption approach: https://github.com/spring-cloud/spring-cloud-stream-binder-aws-kinesis/blob/master/spring-cloud-stream-binder-kinesis-docs/src/main/asciidoc/overview.adoc#kinesis-binder-properties. See `kplKclEnabled` option. – Artem Bilan Aug 30 '19 at 20:01
  • By using the KCL, are we able to configure a single consumer configuration where multiple shards can be processed concurrently without the instanceCount/instanceIndex configured per service? If so, is there anything additional we need to do beyond setting kplKclEnabled to true? – Keith Bennett Aug 30 '19 at 20:08
  • 1
    I don't think so. You should be good with a default KCL options to distribute shards between consumer using its DynamoDB capabilties. – Artem Bilan Aug 30 '19 at 20:13
  • OK. So, by using KCL, is the consumer group configuration still needed? As described at https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-kcl.html, "You can run a KCL application on any number of instances. Multiple instances of the same application coordinate on failures and load balance dynamically. You can also have multiple KCL applications working on the same stream, subject to throughput limits." Based on this description, it appears that the KCL takes care of what the consumer group would otherwise do. – Keith Bennett Aug 30 '19 at 20:19
  • 1
    Correct, but how that KCL would know how to manage your instances, if you don't provide a consumer group? In terms of Spring Cloud Stream it is a `consumer group`, but it is fully mapped to the `applicationName` in KCL. That's the same what we called in the past as `cluster`. – Artem Bilan Aug 30 '19 at 20:23
  • Just to make sure I understand, we still need to specify the consumer group, even when we use KCL, right? If so, are you saying that when we specify a consumer group, Spring Cloud Stream takes this value and maps it to KCL's applicationName property? – Keith Bennett Sep 03 '19 at 17:55
  • 1
    Your understanding is fully correct. We need this value to propagate down to the client and make sure that several instances of our KCL are managed in the same group. – Artem Bilan Sep 03 '19 at 18:02
  • Is there a suggested checkpointMode that you recommend we use when using the KCL? According to the AWS documentation, the KCL takes care of checkpointing processed records, but I am still concerned about the edge case, when an exception occurs and we want to guarantee that the Kinesis record that encountered the exception is eventually successfully processed. Today, we're using the checkpointMode=manual to ensure this happens, only checkpointing as the last step of our methods if no runtime exceptions were thrown out of the methods. – Keith Bennett Sep 03 '19 at 18:29
  • 1
    That's one of the way to go. Another one is retry advice downstream around your handler to ensure that you don't fail with an exception to lose the message. See Spring Integration Docs for more info: https://docs.spring.io/spring-integration/reference/html/#retry-advice – Artem Bilan Sep 03 '19 at 18:33
  • As more information, we're using the @ServiceActivator with the built in .errors channel from Spring Integration to shutdown our microservice after retrying to process the failed message a number of times using Spring Retry. Of course, our service orchestration software will start a new instance for us, and the microservice will keep trying using this process message/encounter failure/retry a number of times/shutdown/restart cycle. – Keith Bennett Sep 03 '19 at 18:35
  • 1
    So, you have everything you need. Another way is to send the record back to the stream, when you encounter an error. – Artem Bilan Sep 03 '19 at 18:37
  • One last question on this so as to avoid an extended discussion here. Do you see a downside to using the checkpointMode=manual from a performance standpoint given the solution we've got in place? It's not clear to me what criteria to use when determining to use batch vs. manual vs. any other value for checkpointMode. – Keith Bennett Sep 03 '19 at 18:40
  • 1
    No difference if you use a batch-based VS manual checkpointing. Also no difference if you use record-based VS manual checkpointing. – Artem Bilan Sep 03 '19 at 18:43