2

Reading a lot about error handling for AWS Lambdas and nothing covers to topic of a running Lambda container just crashing.

Is this a possibility because it seems like one? I'm building an event driven system using Lambdas, triggered by a file upload to S3 and am uncertain if I should bother building in logic to pickup processing if a lambda has died.

e.g. File object is create on S3 -> S3 notifies Lambda of the event -> Lambda instance happens to crash before it can start processing -> Event is now gone forever* (assumption here, I'm unsure if that's true, but can't find anything to say the contrary).

I'm debating building in logic to reconcile what is on S3 and what was processed each day so I can detect the (albeit rare) scenario where a Lambda died (died and couldn't write a failure to a DLQ) and we need to process these files. Is this worth it? Would S3 somehow know that the lambda died and it needs to put the event on a DLQ of its own?

AfterWorkGuinness
  • 1,780
  • 4
  • 28
  • 47
  • I think its a great question, and we can have different opinions about it. What's the ROI on it? How much time do you need to invest? On the other hand, lets quote the [12 factor app](https://12factor.net/) "Maximize robustness with fast startup and `graceful shutdown`" – Sándor Bakos Sep 04 '20 at 19:48

1 Answers1

0

From https://docs.aws.amazon.com/fr_fr/lambda/latest/dg/with-s3.html AWS S3 are async.

Next from https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html, Async lambda invocation are retries twice without any queuing.

I guess if more tries are needed, better to setup a SNS/SQS queuing.

  • 1
    Hi Jérémie, I'm familiar with this, my question is: is it possible for a lambda function to crash (container process dies) and thus it cannot gracefully error out by retrying / writing to a DLQ? If the container process died, would the lambda run time / platform / whatever still be able to do a retry with another hot container or a newly initialized container ? Or is what I'm talking about so far in left field that it doesn't warrant the effort to handle? – AfterWorkGuinness Sep 05 '20 at 14:57
  • At least, i just do a test with a very basic bootstrap which simply download the invocation then directly exit (simulating a hard crash without sending any response) In the cloud watch i see that the async invocation are resending every minute, so a container crash are included in this case. The container has been reused by restarting bootstrap. In any case, if there is critical buisness data for that, i will still setup a DLQ in order to avoir any data loss. – Jérémie Leclercq Sep 05 '20 at 18:29
  • A scenario in my case: running a NodeJS lambda with a library that has a C++ addon. If there's a segmentation fault when using the library (no exception), it crashes the process. I'm looking into the possibility of the lambda itself being able to respond to events within its runtime... – glimmbo Nov 10 '22 at 18:53