Put AWS Lambda function into a VPC and then "IOException: Connection reset by peer" started happening, but only occasionally

Question

I have a Java AWS Lambda function serving as an API via API Gateway. For the past few months, it's been running 24/7 and hasn't had this particular error before.

Today, I did an update to add Elasticache, which required me to put the Lambda into the same VPC as the Elasticache. Before this, the Lambda was not assigned to any VPC, just running as normal.

After lots of config adjustments, it seemed like I finally got it working - the Lambda JAR is now able to connect to Elasticache while still having connectivity to the other things it needs.

But, a few minutes after deployment, I started getting this error from an Algorithmia call:

java.util.concurrent.ExecutionException: java.io.IOException: Connection reset by peer
at org.apache.http.concurrent.BasicFuture.getResult(BasicFuture.java:71)
at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:102)
at com.algorithmia.algo.FutureAlgoResponse.get(FutureAlgoResponse.java:41)
at <place that we invoke it>

The invoking code where the error occurs is very straightforward:

        FutureAlgoResponse futureAlgoResponse = algo.pipeAsync(<stuff>);
        AlgoResponse result = futureAlgoResponse.get(3L, TimeUnit.SECONDS);

And more importantly, it has been in production for nearly a year without ever having this error.

So I guess it must have something to do with the VPC! But, it works most of the time. We're running that code every few seconds, and it only fails every few minutes. When it fails, it usually fails for 1-3 requests in a row.

Our Lambda is set to 15s timeout and the requests that fail are responding after ~1s, and to reiterate, we've never seen this error until we moved the Lambda into a VPC today.

The Lambda VPC configuration felt fairly messy and involved, so I'm sure I messed up something somewhere. But the fact that it only happens a few times every few minutes makes it hard for me to debug with my limited AWS knowledge. I'm hoping someone can share some possible causes!

Here is how I did the setup:

Create a new VPC
Create 2 subnets (and corresponding route tables) in the VPC, one public and one private
Create an internet gateway for the VPC and a NAT gateway for the public subnet.
Assign an elastic IP to the NAT gateway.
Enable all incoming and outgoing for the security group (incoming might not be needed but we'll go back and fix that)
Spin up an Elasticache in that VPC
Assign the Lambda to that VPC - specifically the private subnet + aforementioned security group

I honestly haven't the slightest clue how to investigate this further, so I'm really hoping someone just knows "oh yeah connections can time out in a VPC because _____". Alternatively, would appreciate any tips on how to debug this better.

Edit: Some more searching suggests it may have to do with the NAT setup? I basically just did a default "Create NAT gateway" and threw it onto the private subnet.

NAT gateway should be in public subnet. Can you double check where did you create the NAT? From your description it seems as it is in private subnet. — Marcin, Aug 24 '20 at 02:00
this is not the answer you want to hear but I feel like I have seem these types of network blips many times in the past the way that we've dealt with it is by adding retry logic — JD D, Aug 24 '20 at 03:11
Whoops, I had misremembered @Marcin, it was indeed on the public subnet. — rococo, Aug 24 '20 at 03:47
Thanks for the suggestion @JD D, we do have retry logic (so we haven't been in a major panic rolling back our production deployment...) but since there were apparently no blips before we're hoping we can prevent them entirely. — rococo, Aug 24 '20 at 03:47
Found a possibly relevant hint here: https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html. Quote: "Problem: Your instances can access the internet, but the connection drops after 350 seconds. Cause: If a connection that's using a NAT gateway is idle for 350 seconds or more, the connection times out. Solution: To prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can enable TCP keepalive on the instance with a value less than 350 seconds.". — rococo, Aug 24 '20 at 04:05
350s seems to roughly match our times, though sometimes we get another error sooner than 350 seconds. — rococo, Aug 24 '20 at 04:05

score 2 · Answer 1 · answered Aug 24 '20 at 16:26

Amazon support comes through with a diagnosis and solution!

tl;dr Yes, timeouts were the issue. Suggested fix is to implement a TCP keep-alive to make the 350-second idle timeout isn't reached (or just have more traffic, which doesn't really work for us).

What we actually did in the end is just move off of Elasticache. That was the only reason we needed to put our Lambda in a VPC, and after thinking about it, we decided it's going to be a while before our traffic reaches levels where Elasticache's benefits are really tangible to us (vs. a simple EC2-hosted Redis instance). So now our cache is just a regular Redis instance running on EC2.

Here's the full response:

"<first talking through each step of my setup and how those appear to be correct>... However, for the past two days, I do see a number of NAT gateway idle timeouts, which you suspect could be the issue. Please refer to the NAT gateway metrics below.

With this said, the IdleTimeoutCount metric counts the number of connections that transitioned from the active state to the idle state. An active connection transitions to idle if it was not closed gracefully and there was no activity for the last 350 seconds. A value greater than zero indicates that there are connections that have been moved to an idle state. If the value for IdleTimeoutCount increases, it may indicate that clients behind the NAT gateway are re-using stale connections.

As mentioned in the troubleshooting documentation, to prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can also enable TCP keepalive on the instance with a value less than 350 seconds, if possible. Sending keepalive probes at a fixed interval will ensure there is some traffic going through the connection between the NAT gateway and the remote end server. The keepalive packets will reset the 350 seconds idle timeout counters, causing the connection to stay alive for as long as needed by the application.

To answer your question: “Is this what's going on here?”

Answer: After verifying that everything from a VPC perspective is in order for the Lambda functions (SG, NACLs, route tables), the NAT gateway idle timeouts are definite possibility here. This is also confirmed by the IdleTimeoutCount metric provided above showing that connections are timing out due to inactivity."

Put AWS Lambda function into a VPC and then "IOException: Connection reset by peer" started happening, but only occasionally

1 Answers1