Why is throughput for a server a function of upstream response time?

Question

I have an application server which does nothing but send requests to an upstream service, wait, and then respond to the client with data recieved from the upstream service. The microservice takes Xms to respond, or sometimes Yms, where X<<Y. The client response time is (in steady state) essentially equal to the amount of time the upstream microservice takes to process the request - any additional latency is negligible, as the client, application server, and upstream microservice are all located in the same datacenter, and communicate over private IPs with a very large network bandwidth.

When the client starts sending requests at a rate of N, the application server becomes overloaded and response times spike dramatically as the server becomes unsteady. The client and the microservice have minimal CPU usage, and the application server is at maximum CPU usage. (The application server is on a much weaker baremetal than the other two services - this is a testing environment used to monitor the application server's behavior under stress.)

Intuivetly, I would expect N to be the same value, regardless of how long the microservice is taking to respond, but I'm finding that the maximum throughput in steady state is significantly less when the microservice takes Yms then when it's only taking Xms. The number of ephemeral ports in use when this happens is also significantly less than the limit. Since the amount of reading and writing being done is the same, and memory usage is the same, I can't really figure out why N is a factor of the microservice's execution time. Also, no, the input/output of the services is the same regardless of the execution time, so the amount of bytes being written is the same regardless. Since the only difference is the execution time, which only requires more TCP connections to be used when responses are taking a while, I'm not sure why maximum throughput is affected? From my understanding, the cost of a TCP connection is negligible once it has already been established.

Am I missing something?

Thanks,

Additional details:

The services use HTTP/1.1 with keepalive, with no pipelining. Also should've mentioned that I'm using an IO-Thread model. If I were using a thread per request I could understand this behavior, but with only a thread per core it's confusing.

The answer to the question in your title is zero, apart from a bit of memory at each end. I think you need to look at a bit of queuing theory. The processor at the end of a queue should never be more than about 70% busy, otherwise response times will skyrocket wih increasing load. — user207421, Feb 10 '20 at 02:27
I have no idea what the title has to do with the body of your question. And to understand why your client can do less question if the server is responding slower one would need to have way more information about the design of these services and what actually happens instead of just information about the meta effects you see. — Steffen Ullrich, Feb 10 '20 at 06:42
The client sends `N` `GET` requests every second, where I have control over `N` and I can change it dynamically as required. The appserver recieves requests and immediately sends a `GET` request to the upstream microservice. The microservice does nothing but sleep for `X` seconds and then responds with `200`. The appserver then does nothing but respond with the status code from the microservice. — BullardLA, Feb 10 '20 at 14:13
I'm not sure what additional information I can provide? I observe this effect regardless of language (Go, Java, Rust), and regardless of framework (net/http, fasthttp, Vert.x, Tokio, and several others) — BullardLA, Feb 10 '20 at 14:17

Why is throughput for a server a function of upstream response time?

0 Answers0