3

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.

Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.

Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.

If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.

Any suggestions are welcome.

Syed Waqas
  • 2,576
  • 4
  • 29
  • 36

2 Answers2

2

This is a great question. I'm also interested to hear what others are doing to solve this problem.

If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.

I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.

One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.

You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.

Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.

There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.

Seth Geoghegan
  • 5,372
  • 2
  • 8
  • 23
1

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions

Quickly highlighting the two things mentioned here.

  1. Definitely even distribution of partitions keys is a best practice.
  2. Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.

Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that?

Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.

And creating multiple tables is a bad practice

I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.

Having said that, what would be a better approach to follow.

  1. The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
  2. The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
  3. Do I always have the partition key in hand to perform the select that I need to perform?
  4. What would the update statements look like, will it have again the partition key to perform updates?
  5. Do I need to further filter by additional columns and can that be the default sort order?

As you start answering some of these questions, a better model might appear altogether.

dilsingi
  • 2,938
  • 14
  • 24
  • The different entities are going to have no relationship between each other and there are no JOINS between them. We are going to have the partition key id when we are going to update a said item. I think in that case, going for multiple tables isn't a bad choice afterall. – Syed Waqas Sep 11 '20 at 06:46
  • 1
    Yeah @SyedWaqas in that case, multiple tables is the right choice. In fact it will be more flexible from dynamodb perspective to choose different Read and write units per table and adjust accordingly. – dilsingi Sep 11 '20 at 14:36