SQS Lambda - retry logic?

Question

When the message has been added to an SQS queue and it is configured to trigger a lambda function (nodejs).

When a lambda function is triggered - I may want to retry same message again after 5 minute without deleting the message from the Queue. The reason I want to do this if Lambda could not connect external host (eg: API) - i like to try again after 5 minutes for 3 attempts only.

How can that be written in node js?

For example in Laravel, we can Specifying Max Job Attempts functionality. The number of times the job may be attempted using public $tries = 5;

Source: https://laravel.com/docs/5.7/queues#max-job-attempts-and-timeout

How can we do similar fashion in node.js?

I am thinking adding a message to another queue (for retry). A lambda function read all the messages from that queue after 5 minutes and send that message back to main Queue and it will be trigger a lambda function.

https://www.lucidchart.com/blog/cloud/5-reasons-why-sqs-lambda-triggers-are-a-big-deal — Spiff, Jan 30 '19 at 17:40

Onema · Answer 1 · 2021-05-07T18:18:39.637

Re-tries and re-tries "timeout" can all be configured directly in the SQS queue.

When you create a queue, set up the following attributes:

The Default Visibility Timeout will be the time that the message will be hidden once it has been received by your application. If the message fails during the lambda run and an exception is thrown, lambda will not delete any of the messages in the batch and all of them will eventually re-appear in the queue.

If you only want to try 3 times, you must set the SQS re-drive policy (AKA Dead Letter Queue)

The re-drive policy will enable your queue to redirect messages to a Dead Letter Queue (DLQ) after the message has re-appeared in the queue N number of times, where N is a number between 1 and 1000.

It is essential to understand that lambda will continue to process a failed message (a message that generates an exception in the code) until:

It is processed without any errors (lambda deletes the message)
The Message Retention Period expires (SQS deletes the message)
It is sent to the DLQ set in the SQS queue re-drive policy (SQS "moves" the message to the DLQ)
You delete the message from the queue directly in your code (User deletes the message)

Lambda will not dispose of this bad message otherwise.

Important observations

Lambda will not deal with failed messages

Based on several experiments I ran to understand the behavior of the SQS integration (the documentation on re-tries can be ambiguous).

Lambda will not delete failed messages and will continue to re-try them. Even if you have a Lambda DLQ setup, failed messages will not be sent to the lambda DLQ. Lambda fully relies on the configuration of the SQS queue for this purpose as stated in the lambda DLQ documentation.

Recommendation:

Always use a re-drive policy in your SQS queue.

Exceptions will fail a whole batch of messages

As I stated earlier if there is an exception in your code while processing a message, the whole batch of messages is re-tried, it doesn't matter if some of the messages were processed correctly. If for some reason a downstream service is failing you may end up with messages that were processed in the DLQ.

Recommendation:

Manually delete messages that have been processed correctly
Ensure that your lambda function can process the same message more than once

Lambda concurrency limits and SQS side effects

The blog post "Lambda Concurrency Limits and SQS Triggers Don’t Mix Well (Sometimes)" describes how, if your concurrency limit is set too low, lambda may cause batches of messages to be throttled and the received attempt to be incremented without ever being processed.

Recommendation:

The post and Amazon's recommendations are:

Set the queue’s visibility timeout to at least 6 times the timeout that you configure on your function.

The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.

Set the maxReceiveCount on the queue’s re-drive policy to at least 5. This will help avoid sending messages to the dead-letter queue due to throttling.

Configure the dead-letter to retain failed messages long enough so that you can move them back later to be reprocessed

You can fail just some of the messages from a batch by using this: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting — Francisco Cardoso, Aug 26 '22 at 16:00

score 16 · Accepted Answer · answered Sep 30 '18 at 22:56

16

Here is how I did it.

Create Normal Queues (Immediate Delivery), Q1
Create Delay Queues (5 mins delay), Q2
Create DLQ (After retries), DLQ1

(Q1/Q2) SQS Trigger --> Lambda L1 (if failed, delete on (Q1/Q2), drop it on Q2) --> On Failure DLQ

When messages arrive on Q1 it triggers Lambda L1 if success goes from there. If fails, drop it to Q2 (which is a delayed queue). Every message that arrives on Q2 will have a delay of 5 minutes.

If your initial message can have a delay of 5 mins, then you might not need two queues. One queue should be good. If the initial delay is not acceptable then you need two queues. One another reason to have two queues, you will always have a way for new messages that comes in the path.

If you have a code failure in handling Q1/Q2 aws infrastructure will retry immediately for 3 times before it sends it to DLQ1. If you handle the error in the code, then you can get the pipeline to work with the timings you mentioned.

SQS Delay Queues:

https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html

SQS Lambda Architecture:

https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/

Hope it helps.

answered Sep 30 '18 at 22:56

Kannaiyan

12,554
3
44
83

Great answer. When you mention to drop it to Q2 (Delay Queue) - this has to be done manually in the node script Lambda? – I'll-Be-Back Oct 01 '18 at 19:59
You wrote `Lambda L1 (if failed, delete on (Q1/Q2)` Do you mean delete on Q1, not Q2? – I'll-Be-Back Oct 01 '18 at 20:00
1

Yes. Deletion need to be handled manually in the Lambda code. You need to delete on Q1 on the first retry/success and delete on Q2 after retries. Set a variable in the SQS payload to indicate the number of retries. If you have a bug in the code, it will get to DLQ. Which you need a different process to move the messages from DLQ to Q2 for further processing. – Kannaiyan Oct 01 '18 at 20:04
2

I am pretty sure Lambda will delete the SQS message automatically on success if there is no error? When you mention about Set a variable in the SQS payload to indicate the number of retries - doesn't SQS have retry option using `Maximum Receives` ? – I'll-Be-Back Oct 01 '18 at 20:08
3

https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html `Lambda takes care of: Deleting them once your Lambda function successfully completes.` – I'll-Be-Back Oct 01 '18 at 20:19
Good one. You still need to backup the messages until you are sure, the process did it work. If you have a bug, you have lost your messages and cannot recover. – Kannaiyan Oct 01 '18 at 20:22
1

Isn't better to use the VisibilityTimeout property to space out retries? That way you wouldn't need an intermediary queue as far as I understand this. – Julian Go Apr 08 '20 at 13:53
@JulianGo That is one another possible option. – Kannaiyan Apr 09 '20 at 06:07
1

The advantage of the multi-queue approach is it adds a delay. You can also push on the same queue and add a delay. DelaySeconds is a parameter to sendMessage. Otherwise the next time the message will be retried is after Default Visibility Timeout. – Todd Hoff May 07 '20 at 21:55
I like this idea. Just bear in mind that it will process messages out of order. Previous messages will still be in Q2 when the issue is resolved and messages in Q1 will start to process immediately. – Lee Oades Apr 12 '21 at 10:55

David Gatti · Answer 3 · 2020-07-03T07:18:49.987

4

Fairly simple (if you execute the Lambda in a Async way) and without the need to do any coding. First of all: if you code will throw an error, AWS Lambda will retry 3 more times to execute you code. In this case if the external API was not accessible, there is a big change that by the third time AWS retries – the API will work. Plus the delay between the re-tries is random-ish meaning, there a is a delay between the re-tries.

If the worst happens, and the external API is not yet up, you can take advantage of the dead-letter queue (DLQ) feature that each lambda have. Which will push to SQS a message saying what went wrong, so you can take additional actions. In this case, keep re-trying until you make it.

You can read more here: https://docs.aws.amazon.com/lambda/latest/dg/dlq.html

edited Jul 03 '20 at 07:18

answered Sep 30 '18 at 20:08

David Gatti

3,576
3
33
64

What the timing for each retry? I like to do it after 5 minute of retry – I'll-Be-Back Sep 30 '18 at 20:25
It is random, can be 10 sec, can be 30 the first time, then the time increases at each retry. But you have no control over it. And you can't disable this, so each time there is an Error Lambda dose this if you like it or not. If you'd like to use DLQ, you can add the message to a new queue, which you process with a Cron job (Cloud Watch) every 5 min, and check the date of the message and skip the fresh one, and process only the 5 min and older. – David Gatti Oct 01 '18 at 07:28
1

For the SQS integration, lambda will only re-try as many times as configured in the SQS queue re-drive policy. Also, the lambda DQL setup is not used. The documentation link you provided states that `If you are using Amazon SQS as an event source, configure a DLQ on the Amazon SQS queue itself and not the Lambda function`. – Onema Mar 27 '19 at 21:07
If you click the link at the end of the sentence, your read in the first sentence the following: `You can use an AWS Lambda function to process messages in a standard Amazon Simple Queue Service (Amazon SQS) queue`. Meaning this is when you use SQS to collect messages processed by Lambda, and not the other way around. You got tricked by the poor AWS sentence :) – David Gatti Mar 29 '19 at 08:13

score 1 · Answer 4 · answered Jan 30 '19 at 17:41

According this blog:

https://www.lucidchart.com/blog/cloud/5-reasons-why-sqs-lambda-triggers-are-a-big-deal

Leverage existing retry logic and dead letter queues. If the Lambda function does not return success, the message will not be deleted from the queue and will reappear after the visibility timeout has expired.

SQS Lambda - retry logic?

4 Answers4

Important observations

Lambda will not deal with failed messages

Exceptions will fail a whole batch of messages

Lambda concurrency limits and SQS side effects

Linked