We writing from around 1000 ec2 machines to AWS MSK using kafka fluentd plugin. They write continuously throughout the day.
The aws.kafka.connection_creation_rate metric graph and the aws.kafka.connection_close_rate metric graph are pretty high (Around 500 connections are made and killed every 1 minute). Ideally, one would expect a connection to be made and never close as the servers are writing continuously to Kafka.
Tried to reproduce the same on another similar Kafka cluster by using the "kafka-producer-perf-test.sh" script that comes with Kafka. (basically ran the script in a for-loop for 1000 times so that it simulates 1000 servers) I made sure that the "message_in_per_sec" and "bytes_in_per_sec" was, if not the same, at least more than my older (problematic) cluster. But the "connection_close_rate" is just 25 (not as high as the problematic cluster). The CPU is also half of what is seen on the production cluster.
Unable to understand why we see such a high connection_close_rate on the production cluster, there are no logs on fluentd that says new connections or opened or any other warn statemetns ?
I am running Kafka 3.3.1 on AWS MSK with m4.xlarge 3 nodes, replication factor 3 with multi AZ. Here is my plugin config:
<label @default_kafka >
<match *.**>
@type copy
<store>
@type kafka2
# list of seed brokers
brokers {{ fluentd_env_vars[env]["xyz"] }}
use_event_time true
ssl_ca_cert "/path/CA_cert.pem"
ssl_client_cert "/path/client_cert.pem"
ssl_client_cert_key "/path/client_key.pem"
# buffer settings
<buffer topic>
@type file
path /var/log/td-agent/buffer/kafka
flush_interval 1s
</buffer>
# data type settings
<format>
@type json
</format>
# topic settings
topic_key topic
default_topic log-messages
# producer settings
required_acks -1
compression_codec gzip
</store>
</match>
</label>