Why do I get ProvisionedThroughputExceededException in DynamoDB if AWS SDK for go implements exponential back-off?

Question

I'm sorry that I can't post my code snippets. I have a Go script that scans through the DynamoDB database and makes modifications to the entries. Everything is done sequentially (no go routines are involved). However, when I was running this on a large database, I got a ProvisionedThroughputExceededException. I'm running the script locally.

I'm using aws-sdk-go-v2, which should have a 20-second exponential back-off implementation when this error is triggered. Since provisioned write capacities are on a per-second basis, shouldn't the SDK automatically make the script wait when the capacity is reached, until the next second when newer capacities are allocated? I'm using UpdateItem, PutItem, and DeleteItem operations.

One guess I have is that when I have many requests in a short amount of time, it actually consumes capacity in the future, when the database is busy processing requests made in the past. However, I got the exception after a few seconds of execution, which was way shorter than 20 seconds.

What's the proper way of handling this exception? Catching it, waiting a few seconds and retrying it feels a bit arbitrary. I don't understand why the SDK isn't taking care of this already.

In one place in your question you are saying "Everything is done sequentially", and in another place you are saying "I have many requests at once". So it's not clear what you are really doing. Do you actually have many threads sending these requests in parallel? How many? Is there a fixed number of such threads or can it grow? — Nadav Har'El, Jul 24 '22 at 09:12
Sorry for the confusion. My code is indeed sequential, and no threads are involved. By "at once", I mean making a lot of requests in a short amount of time. I'll rephrase the question! — Steve Han, Jul 24 '22 at 18:19

score 0 · Answer 1 · answered Jul 24 '22 at 08:32

0

You can implement a token bucket system in your script to keep you RCU and WCU units within an acceptable range based on your table configuration and other client usage of the table. If you are processing every item and speed is not a concern, try not to exceed 1_000 WCU and 3_000 RCU per second to ensure you won't get throttled at the partition level.

The reason it is not in the SDK is that there is no universal best way to handle this situation. Having the SDK "wait" might mean there is too much work in the queue and it won't get processed before a lambda timeout. Or, the thottling is happening at a partition level and not at the table level, so the SDK should not wait, as future requests may not hit that partition. Or, it is not clear how long to wait, as other clients are also consuming capacity. Or the throttling is happening at a GSI level, and future requests may not impact the GSI.

answered Jul 24 '22 at 08:32

Ross Williams

507
1
9

Ok, so you are saying that the reason I got this exception is that the SDK doesn't handle ProvisionedThroughputExceededException at a per-partition level? It is weird because my script is the only client accessing the database, so the SDK should definitely try exponential backoff when encountering this error. There's definitely a hot partition situation going on since the script only modifies a part of the table at once, but I expect to see the SDK wait 20 seconds before giving up. In that time, there would be more WCUs allocated right? – Steve Han Jul 24 '22 at 18:23
I'm saying the reason why you don't see this functionality is that there is no universal best way to handle these errors. For your issue - DynamoDB does not expose partition internals to customers, so the API does not have a way of knowing that a specific partition is throttling and whether your next request will use that same partition. – Ross Williams Jul 24 '22 at 18:45
That makes sense, and I don’t expect the SDK to predict whether or not a partition would throttle in the future. However, based on my understanding, the SDK should try exponential back-off for up to 20 seconds whenever it sees any ProvisionedThroughputExceededException - whether it’s because of hot partition or not. When this happens, since nothing else is accessing the database, I expect the retries to succeed, as I would have more WCU in the next second. I still don’t understand why I would get that exception after 5 seconds of executing that script. – Steve Han Jul 24 '22 at 23:57
The SDK will retry a certain number of these, but it has a limit of how many requests it will retry before just giving up and failing immediately. It is likely you have exceeded the retry queue limit. – Ross Williams Jul 25 '22 at 09:45

score 0 · Accepted Answer · answered Jul 25 '22 at 06:46

The Go API (e.g., see https://github.com/aws/aws-sdk-go/blob/main/service/dynamodb/errors.go) claims that "The Amazon Web Services SDKs for DynamoDB automatically retry requests that receive this exception [ProvisionedThroughputExceededException]. Your request is eventually successful, unless your retry queue is too large to finish.". In your case, there is no parallelism and just one outstanding request at each time, so the retry queue only has one item. So with all of this, you are right and should not be seeing ProvisionedThroughputExceededException at all - or at least, not without a 20 second delay first.

My only guess on why you're seeing is caused by the parameter DefaultMaxAttempts int = 3 . My guess (which I can't base on any code - I'm not familiar with this Go library) is that the code does not actually reach a full 20 second wait, and during three retry attempts it only covers much less than 20 seconds. If this is the case, can you please try increasing this "max attempts" parameter and seeing if it helps (at least to increase the retry period to the full 20 seconds)?

Why do I get ProvisionedThroughputExceededException in DynamoDB if AWS SDK for go implements exponential back-off?

2 Answers2