I have an application server which does nothing but send requests to an upstream service, wait, and then respond to the client with data recieved from the upstream service. The microservice takes X
ms to respond, or sometimes Y
ms, where X<<Y
. The client response time is (in steady state) essentially equal to the amount of time the upstream microservice takes to process the request - any additional latency is negligible, as the client, application server, and upstream microservice are all located in the same datacenter, and communicate over private IPs with a very large network bandwidth.
When the client starts sending requests at a rate of N
, the application server becomes overloaded and response times spike dramatically as the server becomes unsteady. The client and the microservice have minimal CPU usage, and the application server is at maximum CPU usage. (The application server is on a much weaker baremetal than the other two services - this is a testing environment used to monitor the application server's behavior under stress.)
Intuivetly, I would expect N
to be the same value, regardless of how long the microservice is taking to respond, but I'm finding that the maximum throughput in steady state is significantly less when the microservice takes Y
ms then when it's only taking X
ms. The number of ephemeral ports in use when this happens is also significantly less than the limit. Since the amount of reading and writing being done is the same, and memory usage is the same, I can't really figure out why N
is a factor of the microservice's execution time. Also, no, the input/output of the services is the same regardless of the execution time, so the amount of bytes being written is the same regardless. Since the only difference is the execution time, which only requires more TCP connections to be used when responses are taking a while, I'm not sure why maximum throughput is affected? From my understanding, the cost of a TCP connection is negligible once it has already been established.
Am I missing something?
Thanks,
Additional details:
The services use HTTP/1.1 with keepalive, with no pipelining. Also should've mentioned that I'm using an IO-Thread model. If I were using a thread per request I could understand this behavior, but with only a thread per core it's confusing.