2

I have a set of lambda functions that processes messages on an SQS stack. They take data sets, process them and store the results in an RDS MySQL database, which it connects to via VPC. Both the Lambda functions and the RDS database are in the same availability zone.

This has been working for the last couple of months without any issues, but early this morning (2019-01-12) at 01:00 I started seeing lambda timeouts and messages being moved into the dead letter queue.

I've done some troubleshooting and confirmed the reason for the timeouts is the inability for Lambda to establish a connection to the database server.

The RDS server is public, but locked down to allow access only through VPC and 2 public IPs.

I've taken the following steps so far to try and resolve the issue:

  • Given the lambda service role admin rights to rule out IAM issues
  • Unassigned VPC from the lambda functions and opened up RDC inbound access from 0.0.0.0/0 to rule out VPC issues.
  • Restarted the RDS hosts, the good ol' off'n'on again.
  • Used serverless to invoke the lambda functions locally with test data (worked). My local machine connects to the public RDS IP, not through VPC.
  • Changed the runtime environment from 3.6 to 3.7

It doesn't appear to be a code issue, as it's been working flawlessly for the past couple of months and I can invoke locally without issue and my Elastic Beanstalk instance, which sits on the same VPC subnet continues to connect through VPC without issue.

Here's the code I'm using to connect:

connectionString = 'mysql+pymysql://{0}:{1}@{2}/{3}'.format(os.environ['DB_USER'], os.environ['DB_PASSWORD'], os.environ['DB_HOST'], os.environ['DB_SCHEMA'])
        engine = create_engine(connectionString, poolclass=NullPool)
        with engine.connect() as con: <--- breaking here
            meta = MetaData(engine, reflect=True) <-- never gets to here

I double checked the connection string & user accounts, both are correct/working locally.

If someone could point me in the right direction, I'd be grateful!

  • *"...inability for Lambda to establish a connection to the database server."* What is the exact error message? You may need to increase your Lambda function timeout in order to discover what that is... but without it, you are guessing. – Michael - sqlbot Jan 13 '19 at 00:13
  • @Michael-sqlbot we are getting a connection timeout after 10 seconds (default for pyMySQL) – user10905510 Jan 13 '19 at 00:26
  • Okay, so not a Lambda execution timeout, but an actual exception? What is the *exact* wording of the exception (including any alpha/numeric codes)? Is this RDS instance standard MySQL or Aurora for MySQL? – Michael - sqlbot Jan 13 '19 at 00:47

2 Answers2

0

My first guess is that you've hit a connection limit on the RDS database. Because Lambdas can be executed concurrently (this could easily be the case if there were suddenly a lot of messages in your SQS queue), and each execution opens a new connection to your DB, the connection pool can get saturated.

If this is the case, you can set a concurrent execution limit on your Lambda function to prevent this.

A side note - it is not recommended to use a database with a persistent connection in a serverless architecture exactly for this reason. AFAIK, AWS is working on a better solution to use RDS from Lambda, but it's not available yet.

Milan Cermak
  • 7,476
  • 3
  • 44
  • 59
  • Connection count is very low, mainly because the Lambda workers can't connect. I cleared down the connections anyway, and no luck unfortunately. Thanks though. – user10905510 Jan 12 '19 at 19:19
0

So...

I was changing security groups and it was having no effect on the RDS host, at one point I removed all access and I could still connect, which is crazy. At this point I started to think the outage on Friday night put the underlying RDS host into a weird state. I put the Security Groups back to the way they should be, stopped & started (restart had no effect) the RDS host and everything started to work again.

Very frustrating, but happy it's finally resolved.