DynamoDB best practice for querying attribute with few values

Question

I have a table in DynamoDB which contains attributes like this:

OrderId, OrderJson,OrderStatus.

The value of order status can be 0 or 1. I need to be able to update the status of the specified order and also to fetch orders based on the status field. one of the options is to use scan , the other one is to have a secondary index with status as partition key, but status field has small range of values. Please suggest what is the best practice for described requirements? Thanks!

tleef · Accepted Answer · 2018-07-17T15:26:53.897

I wouldn't go with scan since it's not cost effective or particularly efficient unless you have very few Orders.

In short, you were on the right track with Global Secondary Indexes. (I assume you were talking about Global Secondary Indexes. There are Local Secondary Indexes also but I don't see how those will be of much help for this case.

Anyway, I would create a GSI with OrderStatus as the Hash key and OrderId as the Range key. There are a couple of things you need to be careful of though.

1) Write throughput. Remember that Orders with the same OrderStatus will be written to the same disk on the GSI. This is just the way Dynamo works, documents with the same Hash key go to the same place. This means that no matter what your write throughput it set to for the table, there is an upper limit to the write throughput on a single disk. Make sure you won't exceed that upper limit.

2) Read throughput. Pretty much the same thing as write throughput but for reads. the read limit is higher than the write limit but it is still something to be aware of.

3) Paging. Whenever you Query a Dynamo table using a Hash key, in this case, OrderStatus, it will automatically limit the size of the response to 1 MB. Because of this, you might need to make multiple sequential Query requests to read all of the Orders for a particular OrderStatus.

The nice thing is all of these problems have basically the same solution, "sharding". What I mean by sharding in this case is adding a suffix to your OrderStatus. For example, if OrderStatus can be either 1 or 0, you would create another field, e.g. OrderShard, which can be 1_0, 1_1, 1_2, ..., 1_9, 0_0, 0_1, 0_2, ..., 0_9. We basically just add a random integer between 0 and 9 to the end of the OrderStatus to create more possible Hash key values on the GSI. This will mean that your data gets spread out over more disks, solving 1 and 2, and you can make parallel Query requests, solving 3 for the most part.

Instead of using OrderStatus as your Hash key on the GSI, now you will use OrderShard. Still use OrderId as the Range key. Also, if 10 shards per OrderStatus value isn't enough, just increase the number of shards. For example, add a random number between 0-99. How many shards you will need depends of your scale and throughput.

@tleef If I have to query based on one status value, how will the query be performed? e.g. If I wanted to get data of all 0 status, what will be the filter expression then? I can not query directly using 0. The data is spread from 0_1, 0_2....0_10 etc. — pango89, Feb 10 '23 at 15:03
@pango89 I believe you would need to make a separate query for each OrderShard value starting with "0" and then aggregate the results. — mrwnt10, Jun 04 '23 at 20:53
@pango89 You would query the index with "Begins with" instead of "Equal to" 0. — Mathieu Gilbert, Jun 06 '23 at 21:47

DynamoDB best practice for querying attribute with few values

1 Answers1