2

We writing from around 1000 ec2 machines to AWS MSK using kafka fluentd plugin. They write continuously throughout the day.

The aws.kafka.connection_creation_rate metric graph and the aws.kafka.connection_close_rate metric graph are pretty high (Around 500 connections are made and killed every 1 minute). Ideally, one would expect a connection to be made and never close as the servers are writing continuously to Kafka.

Tried to reproduce the same on another similar Kafka cluster by using the "kafka-producer-perf-test.sh" script that comes with Kafka. (basically ran the script in a for-loop for 1000 times so that it simulates 1000 servers) I made sure that the "message_in_per_sec" and "bytes_in_per_sec" was, if not the same, at least more than my older (problematic) cluster. But the "connection_close_rate" is just 25 (not as high as the problematic cluster). The CPU is also half of what is seen on the production cluster.

Unable to understand why we see such a high connection_close_rate on the production cluster, there are no logs on fluentd that says new connections or opened or any other warn statemetns ?

I am running Kafka 3.3.1 on AWS MSK with m4.xlarge 3 nodes, replication factor 3 with multi AZ. Here is my plugin config:

<label @default_kafka >
  <match *.**>
    @type copy
  <store>
    @type kafka2

    # list of seed brokers
    brokers {{ fluentd_env_vars[env]["xyz"] }}
    use_event_time true

    ssl_ca_cert "/path/CA_cert.pem"
    ssl_client_cert "/path/client_cert.pem"
    ssl_client_cert_key "/path/client_key.pem"
    
  # buffer settings
    <buffer topic>
      @type file
      path /var/log/td-agent/buffer/kafka
      flush_interval 1s
    </buffer>
      # data type settings
      <format>
        @type json
      </format>

      # topic settings
      topic_key topic
      default_topic log-messages

      # producer settings
      required_acks -1
      compression_codec gzip
      </store>
  </match>
  </label>
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71

0 Answers0