In case of lambda workers processing batches from an SQS queue, Is there an option to monitor the worker's failure rate (wrt processing job) and block further dequeueing (and as a result, lambda invocations) in case failure rate crosses a threshold? I can monitor lambda's error/invocation rate, but how would the dequeue halting be implemented? I don't want to empty the queue and lose the data.
1 Answers
First thing is to understand why your Lambda could (possibly) be failing:
1) If they are failing because of throttling (more messages to be processed than available Lambda functions), the message (or the whole batch) will be sent back to the Queue and will be tried again once the Visibility Timeout expires, so the retry logic is already built-in for you and scales well.
2) If they are failing because of bad messages or some error in the code, you can configure a DLQ to send the failed messages to. This is easy to setup as you only need to tell your Lambda function which DLQ to connect to in case of failure.
If you scenario is 1), rest assured your messages won't be lost. If your scenario is 2), just configure a DLQ for further analysis of the failed messages.
You can also check the official docs to understand Lambda's Retry Behaviour

- 6,965
- 1
- 30
- 48
-
13) my workers can fail because of a 3rd party. At that point, I need a cooldown period and there is more point in processing further. – susdu Mar 05 '19 at 09:35
-
By default, they will be retried three times. But if you're afraid one of your workers could be down for a long time, you could then configure a Delay Queue (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html) – Thales Minussi Mar 05 '19 at 09:36
-
Let's then say it failed because one of them is down, configure a Delay Queue (which has the same retry behaviour built-in). If it fails for three times, it will then be sent to a DLQ for further analysis – Thales Minussi Mar 05 '19 at 09:37
-
@susdu you can also check the first answer on this question (which is close to what you want to achieve): https://stackoverflow.com/questions/52581618/sqs-lambda-retry-logic – Thales Minussi Mar 05 '19 at 09:55
-
1Thanks, the link you provided indeed looks similar to my problem. – susdu Mar 05 '19 at 11:21
-
*"By default, they will be retried three times."* I'm not sure of your source, here. Async Lambda events are tried three times but SQS Lambda invocations are synchronous. – Michael - sqlbot Mar 05 '19 at 16:27