0

The setup is pretty simple:

  • one backend service (e.g. Java)
  • one lambda (e.g. JavaScript)
  • one SQS

backend service sends a message to SQS -> SQS triggers lambda -> lambda code is executed

I added a log right after the backend service sent the message to SQS.
Then I viewed the logs for the lambda using CloudWatch.

To my surprise there is a gap of around 10 seconds between the backend service log and first lambda log, which means that SQS took around 10s to receive, process and send the message out again.

Question

Is it normal that SQS takes around 10s to process an incoming message? If yes, can it be made faster?

UPDATE

Thanks to the comment I was able to track this down a bit more. In my case we use TypeScript CDK and the lambda is created with following SqsEventSource:

import eventSources = require("@aws-cdk/aws-lambda-event-sources")

const eventSource = new eventSources.SqsEventSource(
  sqs.Queue.fromQueueAttributes(scope, "SomeId", {
    queueArn: "someArn",
    queueName: "someQueueName",
  }),
  {
    batchSize: 100,
    maxBatchingWindow: cdk.Duration.seconds(5),
  },
)

As far as I understand

  • batchSize: 100 and
  • maxBatchingWindow: cdk.Duration.seconds(5)

together essentially mean: "Execute lambda immediately if there are 100 events or at latest after 5s".

So 10s is still 5s too much (it sometimes even takes 15-20s).

Is this a bug?

UPDATE 2

To make the hole process more testable I decided to directly send the message to the queue via CLI (so the backend service is not relevant from here anymore):

aws sqs --endpoint <endpoint> send-message --queue-url <queue-url>  --message-body '{"test":"test1"}'

I executed the above command 10 times in a row (it took max. 7s to execute all of them) and then checked the lambda logs in CloudWatch:

enter image description here

(I could check how many messages were processed by each request with application logs, which are not visible here)

So as you can see there is a gap of 16s between the first and second lambda execution (the rest looks fine). Since I sent all 10 SQS message within max. 7 seconds, that should never be possible. There should be a gap of maximum 5 seconds or worst 10s.

What could be the cause here? Is this a bug?

Side note:

I actually came across this because a system test of my application was failing since it was so slow. For the system test it happens like 50% of the time that it's slow and the other 50% it's working as expected.

ysfaran
  • 5,189
  • 3
  • 21
  • 51
  • 1
    SQS doesn't send messages anywhere. The bottleneck isn't SQS here. Lambda polls SQS for messages. You might try changing the batch size in the Lambda configuration to `1` so it doesn't spend time waiting for more messages. – Mark B Jul 14 '21 at 20:36
  • @MarkB thanks for your hint! I added an update to my question – ysfaran Jul 14 '21 at 21:05
  • 2
    English note: "fast" has two meanings, the more common "rapid" and also "securely attached". The word "fasten" implies the second meaning. For the first meaning, you can say "make faster", "speed up", etc. – Nate Eldredge Jul 14 '21 at 21:12
  • I've never experienced a delay like this with SQS, and we process thousands of messages every hour. Are you sure you aren't just seeing a delay in the logs showing up in CloudWatch? Are you saying the time on the log is 10 seconds after the message is sent? Are you sending from a machine in AWS? Are you sure the clocks are synced? There are things that will lead to delays, like the batching, but in general SQS is pretty fast. I don't believe it has an SLA, but I'd be willing to bet that what you believe you are seeing isn't an issue with SQS. – Jason Wadsworth Jul 14 '21 at 21:39
  • 1
    If you send multiple messages at once, do all of them experience this delay or only the first message does? – jingx Jul 15 '21 at 03:01
  • @JasonWadsworth update my questions (UPDATE 2 and Side note) – ysfaran Jul 15 '21 at 13:56
  • @jingx update my questions (UPDATE 2 and Side note) – ysfaran Jul 15 '21 at 13:56
  • My guess is that this is Lambda scaling up. Are those invocations coming from one instance of the Lambda, or multiples? What is your typical cold start time on your Lambda? – Jason Wadsworth Jul 15 '21 at 14:53
  • It is puzzling that the gap exists between the first and second message. I would have expected it to be before the first, which would be attributable to lambda cold start. – jingx Jul 16 '21 at 03:57
  • @JasonWadsworth if you mean the "Initialization" phase in a lambda trace it's round about 400ms. It even delays with one request sometimes so I don't think it's scaling at all. I just rewrote my test a bit to get it fixed for now, which obviously not an answer to this issue. – ysfaran Jul 19 '21 at 19:02

1 Answers1

0

I could finally identify the issue for me, although I could not identify why there is 16s diff as seen in my question.

The general problem is that if you execute a lambda function with a batch of SQS messages and one of those messages can't be processed (-> lambda throws an error) all SQS messages are retryed again with a delay of few seconds.

So if there is a valid SQS message in the same batch as an invalid/failing message it might not be processed at all because the failing message is processed first. That can happen multiple times in a row which led to delays of more than 1 minute in my case.

Unfortunatly there is no way to tell AWS that few messages of a batch are valid and others aren't. If the lambda throws an error all messages in the batch are retryed always.

There are some workarounds and I actually chose to surround the processing of each SQS message with a try/catch and only throw an error after all messages have been processed and if one of them failed previously. This still means that successful messages are retried but at least they are processed in time (note that your lambda needs to be idempotent for this approach). I would not suggest this workarround at all and it was just a hotfix with a todo comment :)

ysfaran
  • 5,189
  • 3
  • 21
  • 51