My understanding of the rate limit and burst limit differs a bit from what is being explained by Tobias Geiselmann (the most upvoted answer).
I don't think there is any concept of concurrency per se in the way throttling works in API Gateway. Requests just get processed as fast as possible and if your API implementation takes long to process a request, there will just be more concurrent processes executing those requests, and the amount of concurrent processes may very well be way more than the limits you would have set for throttling in API Gateway.
The rate limit determines the maximum amount of requests that can be made before the burst starts taking effect, filling up your "burst bucket". The bucket acts like a FIFO, filling up with tokens as requests are coming, and "emptying" itself from those tokens at the rate you have set as the rate limit.
So if more requests keep coming at a faster rate than the "output" of that bucket, then it will eventually become "full", and then throttling will start to happen with "too many requests" errors.
For example, if you set a limit rate of 10
requests per second (RPS), with a burst limit of 100
:
If requests keep coming at 10
RPS or lower, the burst bucket just remains empty. Its input and output are below the set rate limit.
Let's now say the amount of requests is beyond 10
RPS:
The first second, 18
requests come in. The bucket can output 10
RPS, so 18 - 10 = 8
tokens accumulate in the bucket.
The second second, 34
more requests come in the bucket. The bucket can still take out 10
RPS, so 34 - 10 = 24
more tokens accumulate in the bucket. The bucket now contains 8 + 24 = 32
tokens.
The third second, 85
more requests are made, and they are added the bucket. Again 10
requests are taken out. This means 85 - 10 = 75
more tokens accumulate in the bucket. But it had already 32
tokens in there. Because 32 + 75 = 107
is higher than 100
, the 7
last requests are throttled and get a "Too many requests" response. The bucket is full and contains 100
tokens.
The fourth second, 5
more requests come in. The bucket can take out 10
tokens, ending up with 100 + 5 - 10 = 95
tokens. No more throttling happens.
And so on.
So concurrency is not really relevant here. If the requests take 15 seconds each to execute, you could very well end up with 10 RPS * 15 seconds = 150 concurrent requests even if your set limit is just 10 RPS with a burst limit of 100.