I have SQS -> Lambda -> SNS (my messages end up in the dead letter queue)

Question

I have a Lambda function that has a fixed concurrency of 1 that has an SQS trigger configured with a batchSize of 10. This Lambda function only publishes whatever it encounters to an SNS topic (the code is just a couple of lines). I'm using it to throttle a massive amount of messages that I get so that my backend can process them without choking.

Theoretically this Lambda should never send anything to the SQS's dead letter queue but 80% of the messages end up there! I don't understand why since the Lambda logs show that no execution fails at all. There are no exceptions thrown and only successful executions are being shown in the logs.

At which point does Lambda decide that a particular message should go to the dead letter queue? (my redrive policy has a maximum receive of 3).

Hi Julian, if one of the responses below answered your question please upvote and accept it. That's the ServerFault's way to say thank you for the time and effort someone took to help you. Thanks! — MLu, Nov 03 '18 at 03:55
Hi, @Julian. Have you found any solution? By the way, here's a great [article](https://www.jeremydaly.com/10-things-to-know-when-building-serverless/) about this problem (see the comments too). Unfortunately, they don't have any solution too. — Vladyslav Turak, Nov 14 '18 at 11:52

score 0 · Answer 1 · answered May 02 '23 at 19:37

Julian, the messages end up in the DLQ despite the lambda not having any errors or timeouts because the lambda is being throttled.

One possible solution to your problem is to use the queue's maximum concurrency value, which is a setting that was introduced just a few months ago. See Introducing maximum concurrency of AWS Lambda functions when using Amazon SQS as an event source.

score 0 · Answer 2 · answered Oct 06 '18 at 05:27

Not directly answering your question but why doesn't your backend poll the SQS and process one message at a time at its own pace? That would be a more common pattern.

You could then also scale the backend processing (if applicable) by adding more nodes based on the SQS queue depth. If your messages arrive more often for example during business hours and less often at night your backend should be able to catch up with the stream during the quieter times.

Alternatively if you are only interested in the newest messages you can set the expiry time to something like 1 minute after which the message will disappear from the queue and your backend will retrieve the more recent one.

I think that's a better architecture than trying to rate-limit the messages through Lambda to SNS and hope that the backend keeps up.

If doing the SQS polling in the backend isn't possible let us know and we'll revisit your Lambda / DLQ issue ;)

Hope that helps :)

I can't change the backend. Whatever I do needs to solve the problem without touching the backend code. The *actual* problem is not the backend but the database. The backend floods the RDS database with requests that's why I need to throttle requests. I need to keep every single one of the SQS messages but as I said, I can't change anything in either the database nor the backend code. — Julian, Oct 06 '18 at 06:24

score 0 · Answer 3 · answered Oct 06 '18 at 05:57

0

Sounds like you are not deleting the retrieved messages in your Lambda function after processing.

I assume this is what happens:

message M1 arrives to SQS,
your Lambda picks it up, sends to SNS, doesn't delete it from SQS and exits.
after a while (after Default Visibility Timeout = 30s) the same message M1 is re-inserted into the queue, because it was retrieved but not deleted after processing.
that happens 3-times (due to your Redrive Policy) and then it is sent to Dead Letter Queue.

Ultimately all messages eventually end up in DLQ because of this.

Am I right? :)

answered Oct 06 '18 at 05:57

MLu

24,849
5
59
86

This is not the case with Lambda which deletes messages from the queue once you return successfully from the handler. https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html Even if that were the case, 100% of messages would go to the dead letter queue. One odd thing is that if I increase the maximum receive on the redrive policy to something like 10, everything works fine for queues of up to 1000k or so messages. – Julian Oct 06 '18 at 06:21
@Julian hmm, I guess it's got something to do with the concurrency=1 and batchSize=10 - perhaps Lambda runtime is still too slow to deal with the demand? – MLu Oct 06 '18 at 07:48
_It appears that the “Lambda service” polls the queue, and puts messages “in flight” without consideration of the concurrency limits._ - please, see this [thread](https://www.jeremydaly.com/serverless-consumers-with-lambda-and-sqs-triggers/#comment-3482). It looks like Lambda's throttling is counted like a failed processing attempt. If we have redrive policy set to 1, we will face this issue. – Vladyslav Turak Nov 14 '18 at 12:03

score 0 · Answer 4 · answered Oct 06 '18 at 08:05

Another idea - since the validity of messages in SQS is up to 4 days you can have some process polling the SQS at some sustainable rate (as dictated by your RDS throughput) and resending to SNS. That process will implement the required throttling - keeping a counter of messages processed over the last minute and delaying the next SQS poll until the throughput is below the limit. Simple sliding window algorithm should do the trick. You may get some inspiration from network rate limiting that's got the same goal - limit the throughput to the recipient.

That will be much easier to implement than having the Lambda triggered by SQS and trying to throttle it by way of concurrency and batch size limits - such a method may have quite an unpredictable throughput profile.

You can do the polling in a long-running Lambda (up to 10 minutes per run I believe) or maybe better as a service in a container running on Fargate or ECS. Whatever is cheaper.

Could this be an answer?

I have SQS -> Lambda -> SNS (my messages end up in the dead letter queue)

4 Answers4