Problem in getting Nan values when monitoring summary metric of java application via prometheus and grafana when handling maxAgeSeconds
Hello everyone, I am using prometheus and grafana for monitoring my java application. I am writing a Summary metric of received message length. All I want to do is to show in grafana the maximum received message length within interval of 1 second (there can be multiple messages in 1 second as my application is real-time messaging gateway application). So at first I configured my prometheus server scrap_interval to be 1 second, for sampling data from my exported http server every 1 second.
In grafana, for getting the max value I just showed the metric when quantile is 1.
First problem I faced was that the value of maximum message length from previous intervals wasn’t changed until i get bigger value, so the maximum value within specific second is not shown if it is smaller than previous maximum values from previous intervals. My solution for that problem was changing the maxAgeSeconds of the Summary to be 1 (default is 10 minutes) and in that way- every 1 second the historical values were cleared, and I have succeeded to fetch the relevant maximum value every interval of 1 second.
After that I got a new problem- because of the clearing of the data every 1 second, when I send messages less frequently, i.e. 1 message every second, sometimes the prometheus sever “catch” Nan (null) value (Nan that comes right after the clear when getting to maxAgeSeconds).
So in a result, my visualisation in Grafana shows sometimes holes of message length of 0, respectively for the Nan values it catches.
I tried to figure out all over the internet how can I handle these Nan values, I want my visualization in Grafana to show real data, it is important for my real time java application to show every second the real maximum value of message length it got.
I want to emphasize that there can be a lot of scenarios of messaging frequency: multiple messages within 1 second, 1 message every 1 second, 1 message every minute, etc..
So I need a real solution that supports all that cases which handles this issue with Nan values.