Problem Description
Currently, I see SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time
(full error log is below) exception from Lambda SDK 2.0 (with Netty client) in a service, where several nodes poll SQS messages from N queues and try to invoke lambdas at very high (unlimited) rate.
I tried to apply back-pressure based on CPU usage per node. This didn't really help, since consuming SQS messages at high rate still produced a lot of networking connections per host, keeping CPU usage low, resulting in the same error.
Also, increasing connection acquisition timeout doesn't help (even makes it worse), since a backlog of connection acquisitions is building up, while new Lambda invocation requests are coming in. Similar applies to increasing number of max connections (currently, I have 120000 max connections value).
Thus, I'm building an SQS back-pressure mechanism, which prevents a node from polling for more messages based on number of networking connections open on this node.
The questions are:
- How can I get the number of open connections on a host? (in addition to the solutions below)
- Are there any Java libs/frameworks that can be used without implementing custom code for the options mentioned below?
Considered Solutions
- Get based on
LeasedConcurrency
metric (throughCloudWatchMetricPublisher
) emitted as part of SDK metrics - Get based on JMX
FileDescriptorUse
metric
Full Error Log
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.
Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.
If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) ~[AwsJavaSdk-Core-2.0.jar:?]
PS
Links to any related networking/OS/backpressure resources (including low-level details like why CPU is low, while there is high number of connections to handle on a host) would be appreciated