2

I do not understand how the charts that show percentiles are calculated inside the Google Cloud Platform Monitoring UI.

Here is how I am creating the standard chart:

Example log events

enter image description here

Creating a log-based metric for request durations

Here I have configured a histogram of 20 buckets, starting from 0, each bucket taking 100ms.

  • 0 - 100,
  • 100 - 200,
  • ... until 2 seconds

enter image description here

Creating a chart to show percentiles over time

enter image description here

I do not understand how these histogram buckets work with "aggregator", "aligner" and "alignment period".

The UI forces using an "aligner" and "alignment period".

Questions

  • A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?

  • B. Do the histogram buckets configured for the log-based metric affect these sums?

zino
  • 1,222
  • 2
  • 17
  • 47

2 Answers2

2

I've been looking into the same question for a couple of days and found the Understanding distribution percentiles section in official docs quite helpful.

The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.

They have a good example with buckets [0, 1) [1, 2) [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [64, 128) [128, 256) and only one measurement in the last bucket [128, 256) (none in all other buckets).

  1. You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
  2. You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.

A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?

I find the GCP Console UI for Metrics explorer a little misleading/confusing with the wording as well (but maybe it's just me unfamiliar with their terms). The key concepts here are Alignment and Reduction, I think.

The aligner produces a single value placed at the end of each alignment period.

A reducer is a function that is applied to the values across a set of time series to produce a single value.

The difference between the two are horizontal vs. vertical aggregations. With the UI, Aggregator (both primary and secondary) are reducers.

Back to the question, a sum alignment applying before a percentile reducer seems more useful in other use cases than yours. In short, a mean or max aligner may be more useful to your "duration_ms" metric, but they're not available in the dropdown on UI, and to be honest I haven't figured out how to implement them in MQL Editor either. Just referencing from the docs here. There are other aligners that may also be useful, but I'm just gonna leave them out for now.


B. Do the histogram buckets configured for the log-based metric affect these sums?

Same as @Anthony, I'm not quite sure what the question is implying either. Just going to assume you're asking if you can align/reduce log-based metrics using these aligners/reducers, and the answer would be yes. However, you'll need to know what metric type you're using (counter vs distribution) and aggregate them in corresponding ways as you need.

yut6CUZg
  • 41
  • 3
1

Before we look at your questions, we must understand Histograms.

By using the documentation you had provided in the post, there is a section in the document that explains Histogram Buckets. Looking at this section and reflecting your setup, we can see that you are using the Linear type to specify the boundaries between histogram buckets for distribution metrics.

Furthermore, the Linear type has three values for calculations:

  1. offset value (Start value [a])
  2. width value (Bucket width [b])
  3. I value (Number of buckets [N])

Every bucket has the same width and the boundaries are calculated using the following formula: offset + width x I (Where I = 0,1,2,...,∞).

For example, if the start value is 5, the number of buckets is 4, and the bucket width is 15, then the bucket ranges are as follows: [-INF, 5), [5, 20), [20, 35), [35, 50), [50, 65), [65, +INF]

Now we understand the formula, we can look at your questions and answer them:

  1. How are percentile charts calculated?

If we look at this documentation on Selecting metrics, we can see that there is a section that speaks about how Aggregation works. I would suggest looking into this part to understand how Aggregation works in GCP

The formula to calculate the Percentile is the following:

R = P / 100 (N + 1)

Where R represents the rank order of the score. P represents the percentile rank. N represents the number of scores in the distribution.

  1. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?

In the same section, it also explains what the Alignment Period is, but for the most part, the alignment period determines the length of time for subdividing the time series. For example, you can break a time series into one-minute chunks or one-hour chunks. The data in each period is summarized so that a single value represents that period. The default alignment period is one minute.

Although you can set the alignment interval for your data, time series might be realigned when you change the time interval displayed on a chart or change the zoom level.

  1. Do the histogram buckets configured for the log-based metric affect these sums? I am not too sure on what you are applying here, are you asking if when logs are created, the sums would be altered via by the logs being generated?

I hope this helps!

Anthony Leo
  • 453
  • 3
  • 6
  • 1
    To be honest its not clear but thanks. I understand histogram buckets. And how percentiles work. What I do not understand is how my stream of log events (added to original post) get charted. It seems there are three steps which I do not understand how they work together: 1. Place into histogram buckets. "log-based distribution metric" 2. Apply `sum` over 1 minute. "alignment period". 3. Aggregate to 99 percentile. What I want is to compute "99th percentiles every 1 minute" to chart. But why is there step 1 (histogram) and step 2 (sum)? – zino Jan 08 '20 at 16:44
  • Sorry for the confusion for the previous answer, but to get back on track, the important part here is that the Aligner different options including the sum, the mean, and so forth. Its not necessary for you to choose the sum if you do not want to. Furthermore, for the stream of log events that get charted, the system only looks at the amount of the same logs being produced. So for example, if the system generated 40 logs that were the same (i.e. Example log event you had included in your question) it would only sample those 40 logs. – Anthony Leo Jan 08 '20 at 18:17