Kafka connect constant rebalancing

Question

I am deploying a kafka connect cluster, consisting of 4 workers using docker swarm. There are some cases at the initial deployment (when no other kafka connect cluster ever existed within the environment) and only then so far, that the workers cannot communicate to each other, and a constant rebalancing takes places.

The logs, that are being produced repeatedly, are the following ones:

[2023-07-19T10:53:34.399Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Rebalance started
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] (Re-)joining group
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully joined group with generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Successfully synced group in generation Generation{generationId=11, memberId='connect-1-a0bd7da2-7235-4fcc-a0f9-83921b3e5a0c', protocol='sessioned'}
[2023-07-19T10:53:34.400Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 11 with protocol version 2 and got assignment: Assignment{error=0, leader='connect-1-1c92ee2f-e894-475f-b330-ec2215e4611b', leaderUrl='http://10.0.50.95:8083/', offset=819, connectorIds=[], taskIds=[], revokedConnectorIds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0
[2023-07-19T10:53:34.401Z] WARN [Worker clientId=connect-1, groupId=connect-cluster] Catching up to assignment's config offset.
[2023-07-19T10:53:34.401Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 is behind group assignment 819, reading to end of config log
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished reading to end of log and updated config snapshot, new config log offset: -1
[2023-07-19T10:53:34.402Z] INFO [Worker clientId=connect-1, groupId=connect-cluster] Current config state offset -1 does not match group assignment 819. Forcing rebalance.

I have seen the question here and several others. All of them mention that there is something wrong with connect configs topic (either it does not have the proper configuration or it consists of more that 1 partitions). However, in my case this is not the issue since my connect configs topic has only 1 partition, and if I try to redeploy the cluster with a new group id (without deleting kafka connect topics in prior), it works. I know this shouldn't be a major issue, since it only happens at the initial deployment of it. However, I am trying to find out what is the root cause, since I am afraid that this might also happen in a later restart of the cluster. In that case, I could not create a new cluster with a new group id from scratch, since this may lead in losing the offsets related to my deployed connectors and jeopardise my data integrity.

Update: This is the configuration that we use. This is a part of our docker-compose.yml for a worker instance. The same applies for the rest of the workers that are deployed as separate services.

kafka-connect-worker-1:
  networks:
    - monitoring
  image: <custom_kafka_connect_image>:<version>
  entrypoint: /etc/confluent/docker/entrypoint.sh
  hostname: "kafka-connect-worker-1"
  environment:
    CONNECT_BOOTSTRAP_SERVERS: <kafka_brokers_list>
    CONNECT_SECURITY_PROTOCOL: SSL
    CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
    CONNECT_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_PRODUCER_SECURITY_PROTOCOL: SSL
    CONNECT_PRODUCER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM: 
    CONNECT_PRODUCER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_CONSUMER_SECURITY_PROTOCOL: SSL
    CONNECT_CONSUMER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM:
    CONNECT_CONSUMER_SSL_TRUSTSTORE_LOCATION: /usr/lib/jvm/jre/lib/security/cacerts
    CONNECT_REST_PORT: 8083
    CONNECT_GROUP_ID: connect-cluster
    CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
    CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
    CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: false
    CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: false
    CONNECT_OFFSET_STORAGE_TOPIC: _connect-offsets
    CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_CONFIG_STORAGE_TOPIC: _connect-configs
    CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_STATUS_STORAGE_TOPIC: _connect-status
    CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 2
    CONNECT_CONSUMER_AUTO_OFFSET_RESET: latest
    CONNECT_PLUGIN_PATH: /usr/local/share/kafka/plugins
    CONNECT_CONSUMER_MAX_POLL_RECORDS: 1000
    CONNECT_REST_ADVERTISED_HOST_NAME: "kafka-connect-worker-1"
  deploy:
    replicas: 1
    placement:
      constraints:
        - node.role==worker
    update_config:
      parallelism: 1
      order: start-first
  resources:
    limits:
      cpus: '1.5'
      memory: 2G
    reservations:
      cpus: '1'
      memory: 1G
  volumes:
    - ./connector-plugins:/usr/local/share/kafka/plugins

Note: <custom_kafka_connect_image> is our custom docker image having confluentinc/cp-kafka-connect:7.4.0 as the base one. The difference from the base image is that it also enables jmx prometheus metrics exposing, adds custom logging configuration, has some changes in the jvm arguments and exports some secrets as env vars in the entrypoint.sh script.

Can you show your Docker configs? By default, the Connect workers do not advertise themselves to one another, therefore seems like you have a leadership conflict where they override one another — OneCricketeer, Jul 19 '23 at 21:30
@OneCricketeer I updated the post by adding the configuration. If you need anything extra please let me know. — Alexandros Mavrommatis, Jul 20 '23 at 07:47
I don't understand the issue if you only have `replicas: 1`. You said you have 4 workers? You need to use one file, not 4 separate ones, especially if they all say `worker-1` — OneCricketeer, Jul 20 '23 at 22:04
There's also no reason to run one connect image per node/worker. I have an example that works on one machine, and should work across several - https://github.com/OneCricketeer/apache-kafka-connect-docker/blob/master/docker-compose.cluster.yml#L74-L99 But you need an overlay network https://docs.docker.com/engine/swarm/networking/#create-an-overlay-network — OneCricketeer, Jul 20 '23 at 22:07
Regarding all your modifications: if you use Kubernetes instead of Swarm, then Strimzi already offers JMX exporter, log customization, and env-var injection. — OneCricketeer, Jul 20 '23 at 22:59
@OneCricketeer Unfortunately, currently we have only swarm as an infrastructure and we cannot move to kubernetes at the moment. I have 4 workers as separate services and not replicas of the same one. The connect image is mutual for all worker services (not one separate per worker). I have one configuration for each separate service kafka-connect-worker-1 , kafka-connect-worker-2 etc — Alexandros Mavrommatis, Jul 21 '23 at 14:32
And can the services communicate with each other? E.g. if you docker-exec into `worker-1`, can you `curl http://worker-2:8083`? In other words, have you checked that the `CONNECT_REST_ADVERTISED_HOST_NAME` is able to be used between each service? — OneCricketeer, Jul 21 '23 at 16:31
@OneCricketeer sure they communicate with each other. CONNECT_REST_ADVERTISED_HOST_NAME for each worker shares the corresponding host name of the worker. As I mention in the description, once I create a new cluster with a new group id, the issue is resolved and the cluster is up and running. — Alexandros Mavrommatis, Jul 24 '23 at 07:41

Kafka connect constant rebalancing

0 Answers0