0

I collect metrics from systems with node exporter or from applications' own endpoints. Some of these are ever-increasing metrics like sales volume (counter), and some are metrics that change (gauge) like CPU load. I collect metrics with 400 different names from about 100 VMs.

There are several ways to keep metrics for a long time, but I'm sending them to InfluxDB via Telegraf with remote_write. Of course, this also unnecessarily loads InfluxDB. My goal is to be able to see the summaries on a daily basis when I go back 2 years later. So I don't need to keep 5 minutes of data. For example, if I get a summary every 6 hours, it is enough. I can get the final values of the counter type metrics but I need to average the gauge type metrics. Do I need to create rules for 200 different metrics for this? How do I design such a system?

Does the same difficulty apply if I use Cortex, Thanos or Mimir and not InfluxDB? In the end, I think it would be enough if I did it as a summary rather than keeping instant data for a long time. However, summing up the metrics and summarizing them with the rate function is not a correct solution either.

OmerFaruk
  • 48
  • 5
  • I don't know about telegraf and not sure what exactly you are need in terms of queries, but using pure promQL, you can get average over time for multiple metrics with something like this: `label_replace(avg_over_time(label_replace({__name__!~".*_(total|bucket)"}, "old_name", "$1", "__name__", "(.*)") [5m:]), "__name__", "$1", "old_name", "(.*)")` – markalex Jul 04 '23 at 13:55
  • What I actually want to do is this: I want to see the metrics in detail during the day. So scrape time = 5s. But when I look back 2 years later, hourly or even daily averages of metrics are enough for me. That's why Prometheus can store metrics on a monthly basis as well as verbose. However, when storing for a long time (InfluxDB, Mimir, etc.), there is no need for such detailed collection. I'm looking for a solution to this. – OmerFaruk Jul 05 '23 at 11:55

0 Answers0