Assume platform performance based on Nginx - ngx_http_stub_status_module

Question

Nginx is placed in front of a microservice architecture on which we don't have any insight. We retrieve the metrics exposed by the http stub status and would like to compute an indicator of the platform performance: We cannot use latency on a load test as we want to compare geographically different sites.

What we tried so far:

Compute a delta of total requests per unit of time. Problem: it doesn't reflect the performance, all sites treat the same requests amount (100req per 100ms)
Use waiting connections gauge*

*With this indicator, we observe different behaviors. The two extremes are:

2012 server (E5-2620 v1, 24 threads) : an average of 68,62 Waiting connections per 100ms

2021 server (AMD EPYC 7642, 96 threads) : an average of 91,96 Waiting connections per 100ms

First question. It seems that the gauge should be read as "the higher the better". Why? The documentation doesn't give details but, to our knowledge, a connection waiting for an answer should appear here. Or does this gauge is only composed of idle connections (i.e. already served ones)?

Second question. On a same load test, accepted/handled connection metrics are much higher on the most recent server (around the double). Why? Both served the same number of requests sent by a pool of 100 connections. What we see is that the amount of handled connections progress very quickly at the beginning, up to a different ceiling value depending the architecture, and afterward the progression is quite linear. We cannot find any explanation of this behavior shown on this graph: graph of handled connections

score 0 · Accepted Answer · answered Dec 22 '21 at 04:17

We cannot use latency on a load test as we want to compare geographically different sites.

Really? Response time for requests is a metric that actually corresponds to how slow a thing is to a user. Different geo regions might result in a more complex statistical distribution, sure, but its still useful to analyze.

[Waiting connections] gauge should be read as "the higher the better". Why?

Reading and Writing active connections are doing I/O, doing work. Waiting is keep alives, waiting for the client, after they already have completed a request.

At the same requests per second level, lower reading and writing is good, because that correlates to connections being serviced quickly. Probably that means more waiting on clients, so higher waiting numbers, but there are limits to the number of connections.

Second question. On a same load test, accepted/handled connection metrics are much higher on the most recent server (around the double). Why?

First few seconds of both connections over time are a bit of outlier, jumping up near instantly. I'm not entirely clear on why this happens, but perhaps nginx was running for longer before the test so the counters are higher.

I would ignore the first few seconds as a warm-up. And possibly graph requests per second over time, as it may be easier to see trends in what should be a straight line.

You are right, response time is a useful metric. We managed to avoid a most complex statistical distribution by computing the response time from nginx perspective (instead of the client perspective) using this [module](https://github.com/knyar/nginx-lua-prometheus) — PierreJ, Dec 29 '21 at 08:19

Assume platform performance based on Nginx - ngx_http_stub_status_module

1 Answers1