0

I'm a n00b to AWS.

I have a Lambda written in Java that processes S3 events from an SQS queue. The events are triggered by the creation of files in a specified directory in the S3 bucket.

The Lambda's processing of single S3 event received from the queue (i.e. creating one file) works as expected.

If I create a batch of between 5 and 10 files at the same time, multiple instances of the Lambda - usually between 3 and 5 in number - are initiated to process the resulting events. Some will work without issue but at least one of these (and some times more than one) will fail. The behaviour is (somewhat frustratingly) inconsistent.

During the execution of a Lambda that fails, the first error occurs when it tries to connect to the AWS Secrets Manager:

com.amazonaws.http.conn.ssl.SdkTLSSocketFactory - connecting to secretsmanager.ap-southeast-2.amazonaws.com/<ip>:<port>
c.a.http.conn.ClientConnectionManagerFactory - java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
... stack trace...
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to secretsmanager.ap-southeast-2.amazonaws.com:<port> [secretsmanager.ap-southeast-2.amazonaws.com/<ip>, secretsmanager.ap-southeast-2.amazonaws.com/<ip>, secretsmanager.ap-southeast-2.amazonaws.com/<ip>] failed: connect timed out
... stack trace...
Caused by: java.net.SocketTimeoutException: connect timed out

The connection is retried a couple of further times by the Lambda but always fails. The Lambda code catches the exception and tries to do some cleaning up but then also cannot connect to the S3 bucket:

com.amazonaws.http.conn.ssl.SdkTLSSocketFactory - Connecting socket to <s3 bucket>.s3.ap-southeast-2.amazonaws.com/<ip>:<port> with timeout 10000
c.a.http.conn.ClientConnectionManagerFactory - java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
... stack trace...
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to <s3 bucket>.s3.ap-southeast-2.amazonaws.com:<port> [<s3 bucket>.s3.ap-southeast-2.amazonaws.com/<ip>] failed: connect timed out
... stack trace...
Caused by: java.net.SocketTimeoutException: connect timed out

As this behaviour is inconsistent, I am not sure of an approach to identifying what the issue is - I can't work out why some instances of the Lambda would fail completely when others running at the same time work without any problems.

I am using the following libraries from com.amazonaws in my Java project:

aws-lambda-java-core: 1.2.0
aws-java-sdk-s3: 1.11.714
aws-java-sdk-events: 1.11.714
aws-java-sdk-secretsmanager: 1.11.718
aws-java-sdk-sqs: 1.11.719

Thanks in advance for any assistance.

GarlicBread
  • 1,671
  • 8
  • 26
  • 49
  • 1
    Is the Lambda function configured to use a VPC, or is it set to "No VPC"? If it is set to "VPC", take a look at the Subnets that are configured. Is there one subnet, or multiple subnets? If there are multiple subnets, it is possible that they are a mix of public / private subnets and the behaviour is inconsistent because it is using different subnets, some of which work and some of which do not work. – John Rotenstein Feb 17 '20 at 23:14
  • @JohnRotenstein Thanks for your thoughts. The Lambda uses a VPC with three subnets, all of which are private. – GarlicBread Feb 17 '20 at 23:22
  • As an experiment, can you change that to use just one subnet, and see whether that fixes things? – John Rotenstein Feb 17 '20 at 23:27
  • @JohnRotenstein: The Lambda works perfectly processing a large number of files when using a single subnet. If I add a second subnet to it, the timeout issue is re-introduced. – GarlicBread Feb 18 '20 at 00:21
  • This suggests that the subnets are configured differently. Check the Route Tables on both subnets to look for differences. You could rotate through the subnets (one at a time) to identify which one(s) is causing the problem, then drill-down to investigate the cause. – John Rotenstein Feb 18 '20 at 00:27
  • @JohnRotenstein I believe you are correct - if I can add 2 of the subnets (private subnets 1 and 3) and process all files correctly. There's something wrong with private subnet 2. – GarlicBread Feb 18 '20 at 00:30

1 Answers1

0

The issue was a networking one - one of the private subnets that the Lambda's VPC uses had a mis-configured route table that was assigned to a non-existent NAT gateway.

Once the correct NAT gateway was added, the Lambda worked as expected.

Many thanks to John Rotenstein for his help with diagnosing this issue.

GarlicBread
  • 1,671
  • 8
  • 26
  • 49