2

We need to Monitor disk space usage for Kafka Brokers running in AWS MSK cluster.

There're several metrices emitted by Kafka which can be used to monitor various aspects. But I was unable to find any specific metric that monitors "Disk Usage" for each broker.

Although, it depends on message and log retention policy and the rate at which new events are coming in various topics, how we can predict if our brokers go out of disk in next 1 days (or whatever duration we want as safe threshold).

If we can monitor the average size of event payload and events per minute (or hour), it can help in making this calculation. I was referring to Apache Kafka documentation for available metrices, but I was unable to find this as well.

avg(rate(kafka_server_BrokerTopicMetrics_FifteenMinuteRate{ name="BytesInPerSec"}[1h]))/avg(rate(kafka_server_BrokerTopicMetrics_FifteenMinuteRate{ name="BytesOutPerSec"}[1h]))

Tried above PQL. If anyone can suggest a healthy range for ByteIn/ByteOut, it may be used with confidence.

All pointers are highly appreciated.

Himanshu Singh
  • 199
  • 3
  • 15

3 Answers3

1

There are proper, first-class CloudWatch metrics for disk usage in MSK now. Typically you'll want to use KafkaDataLogsDiskUsed and filter by cluster name and broker ID. See https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html for more details.

If you're using Datadog, this metric is exposed as aws.kafka.kafka_data_logs_disk_used.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Alex Glover
  • 759
  • 5
  • 15
0

available metrices for node filesystem can be used directly. Kafka does not expose any specific metrics for this purpose. So I re-used following metrices used for eks cluster:

node_filesystem_free_bytes / node_filesystem_size_bytes < 0.2

We used similar metrics for EKS cluster node file system monitoring. This serves the same purpose and gives an idea of available disk space on any kafka broker in MSK cluster (just add filters inside each metrics)

Himanshu Singh
  • 199
  • 3
  • 15
0

MSK exposes metrics using two ways.

Both allow for the metrics you are looking for.

floating_hammer
  • 409
  • 3
  • 10
  • 1
    Prometheus-based monitoring for the node is done using the node_exporter. You will find more information about this on https://prometheus.io/docs/guides/node-exporter/#node-exporter-metrics – floating_hammer May 31 '21 at 14:34