DynamoDB best practice for retrieving a large subset of data

Question

Say I have one table with 50,000 items and the PK for each record is a unique number. Half of these items has a property "published" set to "1" and the other set to "0".

Most of the time I'll be retrieving individual items using the hash key, but sometimes I want to be able to get ALL the items where published = 1 or 0 (ideally paginated batches).

I could have a GSI with a PK on the "published" attribute but then I'd have 25,000 records per value, which I understand would be bad because PKs should be more unique than this (please let me know if I've understood this incorrectly).

I could have separate tables for published/unpublished but in my common use case of retrieving individual items, I'd rather not have to know in advance if the item was published or not (also Amazon says that well-designed apps usually have just one table).

Any advise or suggestions would be much appreciated.

What would you do with 25k items when you retrieve them. Its too much. It would consume extremely expensive read capacity. It doesn't need to be unique. Such GSI's are called sparse indexes. Instead of 0 just write null and it won't be in indexed. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html — Can Sahin, Sep 04 '18 at 12:04
@CanSahin 1. I was hoping I could "paginate" over them (get them in batches) - have updated the question to include this. 2. Good idea thanks - if I can use a sparse index to get the unpublished items too (the inverse of getting the published items), I could do that. — rangfu, Sep 04 '18 at 14:27

score 1 · Answer 1 · answered Sep 04 '18 at 19:44

DynamoDB is not to be used for bulk updates or bulk reads. It is meant for transactional reads or writes. If you are dealing with bulk updates RDS will be a good choice for transactional data.

If you want to use only a limited set of data, you can read a set at a given time, but the number you will be requesting will not be honored all the time. Whatever is available at that time will be delivered to you along with a marker known as lastEvaluatedKey.

Also alrernatively, you can use published as a range key, will can help to read by the partition, still reading bulk read / write to dynamodb will be taking long time and not a good architecture.

Hope it helps.

score 1 · Answer 2 · answered Sep 05 '18 at 04:41

A couple of things:

25,000 is not that many items to have in a single partition. But if your table grows to tens of millions of items you will have a problem
Don’t be scared of scans - if you expect to be retrieving half the items in your table a scan really is very efficient!
If you know that only a small fraction of item will be published (or unpublished) then a sparse GSI will be very efficient - but if the distribution is about half and half, then it doesn’t make a lot of sense: just scan the table!

DynamoDB best practice for retrieving a large subset of data

2 Answers2