We have a running KafkaConnect cluster (Strimzi distribution), deployed in an Openshift (Kubernetes for the matter) cluster that is showing an erratic behaviour.
- The REST API of Kafka Connect is randomly slow, very slow for some endpoints, even when the cluster is not under heavy load
- Query a connector
- Delete a connector
- Create a connector
- But always work perfectly when querying the list of connectors
- Connectors and tasks appear as UNASSIGNED but the logs show them running when we query the list of connectors in the cluster
- /connectors?expand=info&expand=status
We have checked the communication between workers
There are 5 workers in the cluster, each one consuming about 12Gb of RAM and 1.5 cores
There are 1000 Connectors running in the cluster, all of them CloudantSourceConnector
(Cloudant is a CouchDB implementation by IBM)
Is it normal that amount of consumed resources?
What could be causing the REST API timeouts?
Thanks a lot.
Cluster configuration
version: 2.6.0
replicas: 5
config:
group.id: prod-cluster-group
config.storage.replication.factor: 3
config.storage.topic: prod-cluster-configs
key.converter.schemas.enable: false
key.converter: org.apache.kafka.connect.json.JsonConverter
max.poll.interval.ms: 600000
max.poll.records: 10
offset.storage.replication.factor: 3
offset.storage.topic: prod-cluster-offsets
status.storage.replication.factor: 3
status.storage.topic: prod-cluster-status
value.converter.schemas.enable: false
value.converter: org.apache.kafka.connect.json.JsonConverter
Connector configuration
Each connector has this configuration, all of them write to the same topic, and read from a different Cloudant database
"config": {
"connector.class": "com.ibm.cloudant.kafka.connect.CloudantSourceConnector",
"cloudant.omit.design.docs": "true",
"cloudant.db.username": "__REDACTED__",
"topics": "prod-topic",
"cloudant.db.password": "__REDACTED__",
"connection.timeout.ms": "5000",
"cloudant.value.schema.struct": "true",
"name": "connector-0001", // (0000...1000)
"read.timeout.ms": "5000",
"cloudant.db.url": "__REDACTED__"
}
UNASSIGNED Connector running in the cluster
"status": {
"name": "connector-0001",
"connector": {
"state": "UNASSIGNED",
},
"tasks": [
{
"id": 0,
"state": "UNASSIGNED",
}
],
"type": "source"
}
Cluster log showing the task getting records from Cloudant
(com.ibm.cloudant.kafka.connect.CloudantSourceTask) [task-thread-connector-0001-0]
2022-09-29 08:38:56,624 INFO Return 4 records with last offset ...