5

I have an API that that processes collections. The execution time of this API is related to the collection size (the larger the collection, the more it will take).

I am researching how can I do this with prometheus but am unsure whether I am doing things correctly (documentation is a bit lacking in this area).

the first thing I did is define a Summary metric to measure execution time of the API. I am using the canonical rate(sum)/rate(count) as explained here.

Now, since I know that the latency may be affected by the size of the input, I also want to overlay the request size on the avg execution time. Since I dont want to measure each possible size, I figured I'd use a histogram. Like so:

Histogram histogram = Histogram.build().buckets(10, 30, 50)
        .name("BULK_REQUEST_SIZE")
        .help("histogram of bulk sizes to correlate with duration")
        .labelNames("method", "entity")
        .register();

Note: the term 'size' does not relate to the size in bytes but to the length of the collection that needs to be processed. 2 items, 5 items, 50 items...

and in the execution I do (simplified):

@PUT
void process(Collection<Entity> entitiesToProcess, string entityName){
   Timer t = summary.labels("PUT_BULK", entityName).startTimer()

      // process...

   t.observeDuration();
   histogram.labels("PUT_BULK", entityName).observe(entitiesToProcess.size())
}

Question:

  • Later when I am looking at the BULK_REQUEST_SIZE_bucket in Grafana, I see that all buckets have the same value, so clearly I am doing something wrong.
  • Is there a more canonical way to do it?
Vitaliy
  • 8,044
  • 7
  • 38
  • 66

1 Answers1

1

Your code is correct (though bulk_request_size_bytes would be a better metric name).

The problem is likely that you've suboptimal buckets, as 10, 30 and 50 bytes are pretty small for most request sizes. I'd try larger bucket sizes that cover more typical values.

brian-brazil
  • 31,678
  • 6
  • 93
  • 86
  • Thanks Brian! But the term 'size' does not relate to the size in bytes but to the length of the collection that needs to be processed. 2 items, 5 items, 50 items... edited the question. – Vitaliy Sep 18 '17 at 11:38
  • The issue is still with your choice of buckets, try larger values. – brian-brazil Sep 18 '17 at 12:18
  • I will. What about the approach itself? Would it be better to have a summary where the label is range of a bucket? for example: Summary.build().labels("size range", "method") and than simply measure durations of these dimensions? – Vitaliy Sep 18 '17 at 12:29
  • I think the approach is fine. – brian-brazil Sep 18 '17 at 14:51
  • 1
    I realized the problem - I misunderstood the histogram data. I *expected* the buckets to represent disjoint ranges (0-10), (10-30), (30-50), (50,+inf) whereas they represent <= ranges. So when I observe a size of, say 6- it bumps the counter in all buckets because 6<10 and 6<30 and 6<50 and 6<+inf. Is there a way to get the data on the disjoint ranges anyway? – Vitaliy Sep 19 '17 at 06:33
  • 1
    I think I got it - I can subtract the buckets like: BULK_REQUEST_SIZE{le="20"} - ignoring(le) BULK_REQUEST_SIZE{le="10"}. This should give me the increase in the range (10,20) if I understad correctly. – Vitaliy Sep 19 '17 at 07:17
  • Yes, that's the way it works. It's intended that you use histograms mainly with `histogram_quantile` which deals with this. – brian-brazil Sep 19 '17 at 10:37
  • @brian-brazil could you give an example of how Vitaliy's solution is converted to a query with `histogram_quantile`? – Pithikos Jun 05 '20 at 18:42
  • @Pithikos or someone else, do you finally find the query please ? – user1853984 May 18 '21 at 14:15