Camel kafka - Consumer removed from consumer group during DR test

Question

I am connecting to 12 kafka brokers (6 in each data center)

Kafka : Confluent 7.3.1
Java:11
Camel: 3.16.0
Spring-boot: 2.7.13

auto.commit.interval.ms = 2000
auto.offset.reset = latest
connections.max.idle.ms = 540000
enable.auto.commit = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
heartbeat.interval.ms = 3000
max.poll.interval.ms = 300000
max.poll.records = 10
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 40000

Camel setup

@Autowire DeadLetterChannelBuilder deadLetterHandler;

@Override
public void configure() {

  errorHandler(deadLetterHandler);

  from(ENDPOINT)
      .routeId(Constants.ROUTE_NAME)
      .process(entityMappingProcessor)
      .log("Persisted message into database")
}

@Bean
public RedeliveryPolicy redeliveryPolicy() {
  RedeliveryPolicy rp = new RedeliveryPolicy();
  rp.setMaximumRedeliveries(0);
  rp.setRedeliveryDelay(500);
  rp.setRetryAttemptedLogLevel(LoggingLevel.ERROR);
  return rp;
}

@Bean
public DeadLetterChannelBuilder bridgeDeadLetterHandler(RedeliveryPolicy redeliveryPolicy) {
  DeadLetterChannelBuilder bridgeDeadLetterHandler = new DeadLetterChannelBuilder();
  bridgeDeadLetterHandler.setDeadLetterUri(EXCEPTION_URI);
  bridgeDeadLetterHandler.setUseOriginalMessage(true);
  bridgeDeadLetterHandler.setRedeliveryPolicy(redeliveryPolicy);
  return bridgeDeadLetterHandler;
}

@Override
public void configure() {
  from(EXCEPTION_URI)
      .routeId("rt.generic.exception.route")
      .process(exceptionProcessor)
      .to(DEAD_LETTER_QUEUE);
}

When conduction a DR test, where the brokers on 1 data center and shutdown, we are seeing the below logs. The consumer does not reconnect and either the other consumer gets assigned the partition or the same happens on the other consumer and no messages are consumed from then on (Until a service restart).

2023-08-08-15.18.48.348 [Camel (camel-1) thread #1 - KafkaConsumer[TOPIC-2]] ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - handle - AL0812120F - [Consumer clientId=consumer-2, groupId=my-consumer-group] Offset commit failed on partition TOPIC-2-1 at offset 588742: The coordinator is not aware of this member.
2023-08-08-15.18.48.348 [Camel (camel-1) thread #1 - KafkaConsumer[TOPIC-2]] WARN org.apache.camel.component.kafka.KafkaFetchRecords - startPolling - AL0812120230808T1ption org.apache.kafka.clients.consumer.CommitFailedException caught while polling TOPIC-2-Thread 0 from kafka topic TOPIC-2 at offset {TOPIC-2/1=588741}: Commity rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spendddress this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2023-08-08-15.18.48.348 [Camel (camel-1) thread #1 - KafkaConsumer[TOPIC-2]] WARN org.apache.camel.component.kafka.KafkaFetchRecords - handlePollErrorHandler - AL081210XF - Deferring processing to the exception handler based on polling exception strategy
2023-08-08-15.18.48.348 [Camel (camel-1) thread #1 - KafkaConsumer[TOPIC-2]] ERROR org.apache.camel.processor.errorhandler.DeadLetterChannel - log - AL0812120230808T15d delivery for (MessageId: 41555711D74864E-000000000002364E on ExchangeId: 41555711D74864E-000000000002364E). On delivery attempt: 0 caught: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completedigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time mey increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2023-08-08-15.18.48.349 [Camel (camel-1) thread #1 - KafkaConsumer[TOPIC-2]] WARN org.apache.camel.impl.engine.DefaultReactiveExecutor - schedule - AL0812120230808T151executing reactive work due to null. This exception is ignored.
java.lang.NullPointerException: null
        at org.apache.camel.processor.errorhandler.RedeliveryErrorHandler$SimpleTask.handleException(RedeliveryErrorHandler.java:489) ~[camel-core-processor-3.16.0.jar!/:3.16.0]
        at org.apache.camel.processor.errorhandler.RedeliveryErrorHandler$SimpleTask.run(RedeliveryErrorHandler.java:461) ~[camel-core-processor-3.16.0.jar!/:3.16.0]
        at org.apache.camel.impl.engine.DefaultReactiveExecutor$Worker.schedule(DefaultReactiveExecutor.java:193) ~[camel-base-engine-3.16.0.jar!/:3.16.0]
        at org.apache.camel.impl.engine.DefaultReactiveExecutor.scheduleMain(DefaultReactiveExecutor.java:64) ~[camel-base-engine-3.16.0.jar!/:3.16.0]
        at org.apache.camel.processor.Pipeline.process(Pipeline.java:184) ~[camel-core-processor-3.16.0.jar!/:3.16.0]
        at org.apache.camel.impl.engine.CamelInternalProcessor.process(CamelInternalProcessor.java:399) ~[camel-base-engine-3.16.0.jar!/:3.16.0]
        at org.apache.camel.impl.engine.DefaultAsyncProcessorAwaitManager.process(DefaultAsyncProcessorAwaitManager.java:83) ~[camel-base-engine-3.16.0.jar!/:3.16.0]
        at org.apache.camel.support.AsyncProcessorSupport.process(AsyncProcessorSupport.java:41) ~[camel-support-3.16.0.jar!/:3.16.0]
        at org.apache.camel.support.BridgeExceptionHandlerToErrorHandler.handleException(BridgeExceptionHandlerToErrorHandler.java:79) ~[camel-support-3.16.0.jar!/:3.16.0]
        at org.apache.camel.support.BridgeExceptionHandlerToErrorHandler.handleException(BridgeExceptionHandlerToErrorHandler.java:57) ~[camel-support-3.16.0.jar!/:3.16.0]
        at org.apache.camel.support.BridgeExceptionHandlerToErrorHandler.handleException(BridgeExceptionHandlerToErrorHandler.java:52) ~[camel-support-3.16.0.jar!/:3.16.0]
        at org.apache.camel.component.kafka.KafkaFetchRecords.handlePollErrorHandler(KafkaFetchRecords.java:399) ~[camel-kafka-3.16.0.jar!/:3.16.0]
        at org.apache.camel.component.kafka.KafkaFetchRecords.handleAccordingToStrategy(KafkaFetchRecords.java:343) ~[camel-kafka-3.16.0.jar!/:3.16.0]
        at org.apache.camel.component.kafka.KafkaFetchRecords.startPolling(KafkaFetchRecords.java:323) ~[camel-kafka-3.16.0.jar!/:3.16.0]
        at org.apache.camel.component.kafka.KafkaFetchRecords.run(KafkaFetchRecords.java:181) ~[camel-kafka-3.16.0.jar!/:3.16.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:866) ~[?:?]

Is the errorHandler causing the issue in how camel/ kafka handle the re-connection ? Do I need to remove the redelivery handler as this is confusing the flow? Any guidance would be helpful.

Looks like the issue could be related to spring transactions being used in one of the processes(Persisting to DB using JPA). I suspect the transactional nature of the flow does not work with the deadletterchannel. I have opted to handle the exception in the route itself as I only have one route. Also, theres the TransactionErrorHandler, but I couldn't get the deadletterqueue to work with this. — Jack Bourner, Aug 25 '23 at 14:44

Camel kafka - Consumer removed from consumer group during DR test

0 Answers0