2

Problem Statement

Informal State

We have some scenarios where the integration layer (a combination of AWS SNS/SQS components and etc.) is also responsible for the data distribution to target systems. Those are mostly async flows. In this case, we send a confirmation to a caller that we have received the data and will take a responsibility for the data delivery. Here, although the data is not originated from the integration layer we are still holding it and need to make sure that the data is not lost, for example, if the consumers are down or if messages, on-error, are sent to the DLQs and hence being automatically deleted after the retention period.

Solution Design

Currently my idea was to proceed with a back-up of the SQS/DLQ queues based upon CloudWatch configured alerts using ApproximateAgeOfOldestMessage metric with some applied threshold (something like the below):

Msg Expiration Event if ApproximateAgeOfOldestMessage / Message retention > Threshold

Now, more I go forward with this idea and more I doubt that this might be actually the right approach…

In particular, I would like to build something unobtrusive that can be "attached" to our SQS queues and dump the messages that are about to expire in some repository, like for example the AWS S3. Then have a procedure to recover the messages from S3 to the same original queue.

The above procedure contains many challenges like: message identification and consumption (receive message is not design to "query" for specific messages), message dump in the repository with a reference to the source queue, etc. which would suggest to me that the above approach might be a complex over-kill.

That being said, I'm aware of other "alternatives" (such as this) but I would appreciate if you could answer to the specific technical details described above, without trying to challenge the "need" instead.

Indrit
  • 99
  • 1
  • 9
  • Do you have the ability to change the code that inserts into the queue, and the code that deletes messages from the queue? If so I would also insert a copy of the message in a DynamoDB table at the time you add it to the queue, and delete the record from the DynamoDB table when you delete it from the queue. Then all messages that timed out and were deleted from SQS would still be in the DynamoDB table and you could query the table for records older than 14 days to find them. – Mark B Sep 17 '19 at 16:32
  • Hi @MarkB, thanks for the answer. Yes, I have that ability or option possibility, indeed I have full control on the component/s which write in the SQS queues. This indeed is the "alternative" approach which is similar to the one suggested in the link I mentioned above though your suggested approach is something slightly different and I like it. The only concern that I have with this approach, is that it is a little more intrusive since the code that is handling business logic, should now take care also about, another, orthogonal concern. – Indrit Sep 17 '19 at 17:03
  • I understand, but I'm not aware of any other way to detect that an SQS message has been deleted due to the 14 day expiration. An expired message isn't moved into the DLQ is it? It's just deleted I think. So something has to happen at the time your code deletes the message from the queue to keep your tracking accurate. You could create an API that the business logic calls instead of deleting the SQS message directly. The API it calls could handle both steps, deleting the message from the queue as well as the DynamoDB table. – Mark B Sep 17 '19 at 17:09
  • Hi @MarkB, yes, I could build a sort of a wrapper component that could encapsulate the double-call logic, still I need to place this component at the front of the communication with queue, intercepting 'delete' message. I'm afraid this would put this new, custom build, component in the critical path as far as reliability is concerned...which I would like to avoid. Yes, an expired msg is just deleted and if I'm not wrong, the age is cumulative when moved into DLQ. Not sure if message-deletion(or expiration) itself could be intercepted, until now I haven't found any evidence though. – Indrit Sep 17 '19 at 17:15
  • Moreover what happens if the delete from DynamoDB fails? It will inject further complexity in the system. – Indrit Sep 17 '19 at 17:20
  • Unfortunately I see no way to add a backup of your expired SQS messages without injecting further complexity into the system. – Mark B Sep 17 '19 at 17:25
  • Can you change the code in your message consumer so that it messages can be pushed to it via a direct API call rather than polling SQS? – Matthew Pope Sep 18 '19 at 20:03
  • Sorry @MatthewPope, I'm afraid I didn't understood...may I kindly ask you to explain a little more? Please note that I full control on the consumer code; probably I can also control the way the producer is sourcing messages; I can do whatever I want. – Indrit Sep 19 '19 at 10:15
  • @MarkB there is no issue with complexity in itself if the task at hand intrinsically requires it. The problem with complexity starts when the solution complexity is not aligned to the expected problem complexity. Sometimes this could depend upon initial missing knowledge about the problem, but it could also be an alarming bell which could provide valuable information about the solution design or even in general about the approach. In this case might be for example related to the fact that SQS queues are not supposed to be backed-up? – Indrit Sep 23 '19 at 09:29
  • @Indrit SQS queues are definitely not designed to be backed up. I think you should probably be triggering an alarm off the `ApproximateAgeOfOldestMessage` metric instead, and fixing your queue consumer if that alert happens. You should also have a consumer for the DLQ that alerts you that a message was never consumed properly. – Mark B Sep 23 '19 at 12:55

1 Answers1

0

Similar to Mark B's suggestion, you can use the SQS extended client (https://github.com/awslabs/amazon-sqs-java-extended-client-lib) to send all your messages through S3 (which is a configuration knob: https://github.com/awslabs/amazon-sqs-java-extended-client-lib/blob/master/src/main/java/com/amazon/sqs/javamessaging/ExtendedClientConfiguration.java#L189).

The extended client is a drop-in replacement for the AmazonSQS interface so it minimizes the intrusion on business logic - usually it's a matter of just changing your dependency injection.

rsalkeld
  • 69
  • 2
  • Thank you @rsalkeld. If I understand correctly this would dump all the messages to S3 regardless if about to expire or not, right? If yes, how could I identify the messages that are already processed from the ones that are stuck in the queue? – Indrit Sep 27 '19 at 13:42
  • 1
    That's correct. The extended client will also delete the S3 object before deleting the SQS message (unless you configure it not to), so any S3 objects left behind will be ones that haven't been successfully processed. – rsalkeld Sep 30 '19 at 21:51