I have a Java AWS Lambda function serving as an API via API Gateway. For the past few months, it's been running 24/7 and hasn't had this particular error before.
Today, I did an update to add Elasticache, which required me to put the Lambda into the same VPC as the Elasticache. Before this, the Lambda was not assigned to any VPC, just running as normal.
After lots of config adjustments, it seemed like I finally got it working - the Lambda JAR is now able to connect to Elasticache while still having connectivity to the other things it needs.
But, a few minutes after deployment, I started getting this error from an Algorithmia call:
java.util.concurrent.ExecutionException: java.io.IOException: Connection reset by peer
at org.apache.http.concurrent.BasicFuture.getResult(BasicFuture.java:71)
at org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:102)
at com.algorithmia.algo.FutureAlgoResponse.get(FutureAlgoResponse.java:41)
at <place that we invoke it>
The invoking code where the error occurs is very straightforward:
FutureAlgoResponse futureAlgoResponse = algo.pipeAsync(<stuff>);
AlgoResponse result = futureAlgoResponse.get(3L, TimeUnit.SECONDS);
And more importantly, it has been in production for nearly a year without ever having this error.
So I guess it must have something to do with the VPC! But, it works most of the time. We're running that code every few seconds, and it only fails every few minutes. When it fails, it usually fails for 1-3 requests in a row.
Our Lambda is set to 15s timeout and the requests that fail are responding after ~1s, and to reiterate, we've never seen this error until we moved the Lambda into a VPC today.
The Lambda VPC configuration felt fairly messy and involved, so I'm sure I messed up something somewhere. But the fact that it only happens a few times every few minutes makes it hard for me to debug with my limited AWS knowledge. I'm hoping someone can share some possible causes!
Here is how I did the setup:
- Create a new VPC
- Create 2 subnets (and corresponding route tables) in the VPC, one public and one private
- Create an internet gateway for the VPC and a NAT gateway for the public subnet.
- Assign an elastic IP to the NAT gateway.
- Enable all incoming and outgoing for the security group (incoming might not be needed but we'll go back and fix that)
- Spin up an Elasticache in that VPC
- Assign the Lambda to that VPC - specifically the private subnet + aforementioned security group
I honestly haven't the slightest clue how to investigate this further, so I'm really hoping someone just knows "oh yeah connections can time out in a VPC because _____". Alternatively, would appreciate any tips on how to debug this better.
Edit: Some more searching suggests it may have to do with the NAT setup? I basically just did a default "Create NAT gateway" and threw it onto the private subnet.