0

I'm working on a simple microservice we've put together to queue and send emails. In live environments the queue uses SQS at the moment, via the latest Symfony Messenger component (v5.2.x) and its SQS bridge.

This mostly works nicely, but occasionally (every few weeks roughly) we've seen SQS return a rogue 500 server error to the consumer/worked, which is an ECS Service running Messenger's off-the-shelf ConsumeMessagesCommand. The error causes the consumer to exit completely – not the end of the world as ECS spins up another, but it feels like we should be able to do better!

The last trace I looked at was with Messenger v5.1.5 but I don't think the Messenger code involved has changed substantively since. The error is from AmazonSqsReceiver::get() on this line and the consumer app crashes reporting a PHP Fatal error: Uncaught AsyncAws\Core\Exception\Http\ServerException. I've pasted the full trace with log timestamps at the bottom of this question.

Since ServerException implements HttpException which is caught, as far as I can tell the code is throwing a Symfony-native TransportException next, but passes in the original AWS exception for Messenger to handle as it sees fit – and then something (I've not managed to figure this out exactly yet) seems to re-throw that later, leading to the fatal unhandled exception.

It feels like maybe there is a different behaviour that could be used instead of forcing an exit to ConsumeMessagesCommand, perhaps by configuring a slightly different Receiver, or by proposing a change to how the SQS one handles this out the box if there's agreement that something else is better for most use cases. I'm happy to attempt to work on the latter but feel my understanding of some of Messenger's classes and their intended use is a bit tenuous for that so far. I noticed the new RecoverableExceptionInterface added recently, but I don't know if using it for a Receiver like this is within the intended scope.

I've had a quick look at extending AmazonSqsReceiver to tweak only get() without maintaining a totally separate Receiver, but since properties like Connection are private this gets messy fast.

I think my ideal outcome in the error case would be either:

  • a single HttpException from SQS would lead to the get() being retried X times, perhaps with a configurable pause Y ms in between
  • only after X successive failures would we throw a TransportException

or

  • failures would throw a TransportException but this would put message IDs into some kind of failure/dead letter queue for X retries via the same Connection which the original Receiver was using – but I'm not sure Messenger's re-queueing model works this way around, when the message fetch itself in the consumer is what failed? It feels like we probably don't have the requisite information to re-queue it in the organised Messenger queues way, if we were unable to read the message details beyond what ID we asked SQS for.

Any ideas much appreciated – on whether this is behaving as designed, and what I might do to work around it if so!

2020-11-29T09:14:04.977+02:00   [29-Nov-2020 07:14:04 UTC] PHP Fatal error: Uncaught AsyncAws\Core\Exception\Http\ServerException: HTTP 500 returned for "https://sqs.eu-west-1.amazonaws.com/".
2020-11-29T09:14:04.977+02:00   Code: InternalError
2020-11-29T09:14:04.977+02:00   Message: We encountered an internal error. Please try again.
2020-11-29T09:14:04.977+02:00   Type: Receiver
2020-11-29T09:14:04.977+02:00   Detail:
2020-11-29T09:14:04.977+02:00   in /var/www/html/vendor/async-aws/core/src/Response.php:358
2020-11-29T09:14:04.977+02:00   Stack trace:
2020-11-29T09:14:04.977+02:00   #0 /var/www/html/vendor/async-aws/core/src/Response.php(117): AsyncAws\Core\Response->getResolveStatus()
2020-11-29T09:14:04.977+02:00   #1 /var/www/html/vendor/async-aws/core/src/Result.php(63): AsyncAws\Core\Response->resolve(0.1)
2020-11-29T09:14:04.977+02:00   #2 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(202): AsyncAws\Core\Result->resolve(0.1)
2020-11-29T09:14:04.977+02:00   #3 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(193): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->fetchMessage()
2020-11-29T09:14:04.977+02:00   #4 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(165): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->getNewMessages()
2020-11-29T09:14:04.977+02:00   #5 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(152): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->getNextMessages()
2020-11-29T09:14:04.977+02:00   #6 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/AmazonSqsReceiver.php(44): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->get()
2020-11-29T09:14:04.977+02:00   #7 /var/www/html/vendor/symfony/messenger/Worker.php(74): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\AmazonSqsReceiver->get()
2020-11-29T09:14:04.977+02:00   #8 /var/www/html/vendor/symfony/messenger/Command/ConsumeMessagesCommand.php(197): Symfony\Component\Messenger\Worker->run(Array)
2020-11-29T09:14:04.977+02:00   #9 /var/www/html/vendor/symfony/console/Command/Command.php(258): Symfony\Component\Messenger\Command\ConsumeMessagesCommand->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #10 /var/www/html/vendor/symfony/console/Application.php(916): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #11 /var/www/html/vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application->doRunCommand(Object(Symfony\Component\Messenger\Command\ConsumeMessagesCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #12 /var/www/html/vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #13 /var/www/html/mailer-cli.php(18): Symfony\Component\Console\Application->run()
2020-11-29T09:14:04.977+02:00   #14 {main}
2020-11-29T09:14:04.977+02:00   thrown in /var/www/html/vendor/async-aws/core/src/Response.php on line 358
2020-11-29T09:14:04.981+02:00
Script php mailer-cli.php messenger:consume -vv --time-limit=86400 handling the messenger:consume event returned with error code 1
NoelLH
  • 361
  • 6
  • 15
  • To me it makes sense to use `RecoverableExceptionInterface` there. My understanding is, that it will not be used when the outcome is expected to be the same anyway, e.g. an unprocessable message (e.g. invalid json) does not need to be retried, because it will remain unprocessable. This is not necessarily the case here, so it would make sense. I am not sure if its intentionally not used. I think you should definitely open an issue in symfony/symfony. Any chance you can figure out what the underlying server error was, that AsyncAWS reported? – dbrumann Jan 07 '21 at 20:25
  • Thanks @dbrumann, good to hear I might be on the right track! I've made https://github.com/symfony/symfony/issues/39784 Re the underlying issue, I'm not sure there is any more info available. It looks like the "We encountered an internal error. ..." note comes from AWS and is about all the info they give in these cases – https://forums.aws.amazon.com/message.jspa?messageID=899968 – NoelLH Jan 11 '21 at 10:52
  • I did try to make a PR to implement this (with some tweaks initially recommended by maintainers) – https://github.com/symfony/symfony/pull/39813/files – but on further review it didn't look like the Worker would catch that new exception type. I guess it's designed to be thrown somewhere else in the stack. What does look like it'll work is updating the aws-async lib, will post as an answer. – NoelLH Jan 13 '21 at 12:59
  • 1
    Oh, yes. Jeremy Derusse definitely knows whats happening there in detail. I only looked at custom Handlers and Middleware before and missed the bit where Senders/Receivers do not really care about Retryable, but makes sense. If an update of aws-async solves the issue I guess it's fine. Thanks for taking the time investigating and contributing. – dbrumann Jan 13 '21 at 13:25

1 Answers1

2

It's now looking like this can be fixed by using the latest stable aws-async/sqs and aws-async/core (in particular v1.7.0 or newer of the latter), without changes to Symfony Messenger itself.

After I tried to patch Messenger in a PR, @jderusse – who I think has worked on the above libraries – suggested this should resolve the blips by using RetryableHttpClient.

Since the lib's standard retry strategy includes repeating failed calls that get HTTP 500 responses this seems like it should catch the edge case and is likely the best fix.

We already had the library updates on our development branch, so will prioritise releasing the changes live to verify.

Edit: I can confirm this sorted it with no app code changes.

NoelLH
  • 361
  • 6
  • 15