49

I'm trying to understand how the partition created for DynamoDB tables.

According to this blog, "All items with the same partition key are stored together", so if I have a table with user id from 1 to 1000, does that mean I will have 1000 partition? Or it's up to the "internal hash function", but how do we know how many partitions there will be?

It later suggested using random suffix from 1-10 to evenly distribute data for each partition, but how does it know it will query 10 times for a given invoice number? Is that only when you have 10 partitions? but in this case you could have thousands of invoice numbers, that means the same amount of partitions will be created, and query made to query an invoice number

Ash Oldershaw
  • 302
  • 2
  • 13
user1883793
  • 4,011
  • 11
  • 36
  • 65
  • 2
    I'm also confused with this comment, "All items with the same partition key are stored together" If the partition key is a primary key(and they say so in the article), you cannot have multiple items with the same value for the partition key. They must be talking about the result of the hash function. – AlbertoAndreotti Aug 27 '21 at 20:27
  • 1
    I'm glad to know that I'm not alone re-learning data modeling in NoSql, quite hard to get around coming from years of relation modeling – JobaDiniz Feb 26 '22 at 17:23

3 Answers3

76

When an Amazon DynamoDB table is created, you can specify the desired throughput in Reads per second and Writes per second. The table will then be provisioned across multiple servers (partitions) sufficient to provide the requested throughput.

You do not have visibility into the number of partitions created -- it is fully managed by DynamoDB. Additional partitions will be created as the quantity of data increases or when the provisioned throughput is increased.

Let's say you have requested 1000 Reads per second and the data has been internally partitioned across 10 servers (10 partitions). Each partition will provide 100 Reads per second. If all Read requests are for the same partition key, the throughput will be limited to 100 Reads per second. If the requests are spread over a range of different values, the throughput can be the full 1000 Reads per second.

If many queries are made for the same Partition Key, it can result in a Hot Partition that limits the total available throughput.

Think of it like a bank with lines in front of teller windows. If everybody lines up at one teller, less customers can be served. It is more efficient to distribute customers across many different teller windows. A good partition key for distributing customers might be the customer number, since it is different for each customer. A poor partition key might their zip code because they all live in the same area nearby the bank.

The simple rule is that you should choose a Partition Key that has a range of different values.

See: Partitions and Data Distribution

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • What if I want to get the records of how much work a single bank teller has done? Can that be calculated as well (from the same table) having Customer Number as partition keys. – Mukesh Kumar Oct 14 '17 at 07:09
  • 1
    So this implies that any particular partition key value is on one partition as long as there are not so many records with that partition key that it blows through the max partition size, right? So is it best to have a unique partition key per record? And where does the sort key enter in? Apparently not for partition reasoning. – Samantha Atkins Jun 30 '19 at 19:04
  • 1
    @SamanthaAtkins SortKey is optional. However, the rule is that either the partitionKey or the combination of partitonKey+SortKey should be unique. So, if a table has SortKey, then it is allowed to have multiple items with the same partitionKey (provided the sortKey is different). If there's no sortKey, then the partitionKey should be unique. – Nishit Mar 18 '20 at 16:14
36

Point of confusion:

Other answers already have detailed explanation of how partitions are created by DynamoDB. So with out going into that details, let me explain the root cause of confusion while trying to understand the relationship between Partition Keys and Partitions in DynamoDB.

  • IMHO, naming the key as "Partition Key" is the cause of confusion. It should just be called Primary Key. By hearing Partition Key, our mind start relating each Partition Key to one Partition. One-to-one relationship. Which is not the case. As mentioned in the question itself, the key is an input for the "internal hash function". The output of the function is the actual reference to the partition.

  • Thus, for a table having 1000 user ids ( Partition Keys), DynamoDB need not have 1000 partitions. It may have 1/5/10 any numbers of partitions, that is decided by the throughput(capacity unit) setting you have specified.

  • Partitions may increase when you increase the throughput setting.

  • The number of partitions can also increase with increasing volume of your data, when the existing partitions can not handle it.

  • Hence, what we call Partition Key in DynamoDB is nothing but Primary Key representing unique item in the table (with the help of sort key, in case of composite key). It does not relate one-to-one to a partition (which is a storage allocation unit for table backed by SSD) directly. Actual key to a partition is obtained by passing this partition key to an internal has function.

More details here.

Dexter
  • 4,036
  • 3
  • 47
  • 55
  • 2
    This explains my struggles yesterday with trying to model my "partition keys" like you would do in other NoSQL databases. It should be the best answer. – DaJackal Mar 11 '21 at 08:24
  • Agree with DaJackal, this is probably the most intuitive and succinct explanation – Wilson Urdaneta Feb 23 '22 at 00:25
  • The number of partitions is given by the size of the table not by the cardinality of the partition key In general all the items with the same partition key will be in stored in the same partition. This is not quite right, because when a partition reach a certain limit, it will be split even within the same partition key – Filippo De Luca Apr 07 '23 at 20:09
25

As Per AWS DynamoDB Blog Post : Choosing the Right DynamoDB Partition Key

Choosing the Right DynamoDB Partition Key is an important step in the design and building of scalable and reliable applications on top of DynamoDB.

What is a partition key?

DynamoDB supports two types of primary keys:

Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.

Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. Here is an example:

enter image description here

Why do I need a partition key?

DynamoDB stores data as groups of attributes, known as items. Items are similar to rows or records in other database systems. DynamoDB stores and retrieves each item based on the primary key value which must be unique. Items are distributed across 10 GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in Figure 2. For more information, see the Understand Partition Behavior in the DynamoDB Developer Guide.

DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item will be stored. Each item’s location is determined by the hash value of its partition key.

All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB will split partitions by sort key if the collection size grows bigger than 10 GB.

enter image description here

Recommendations for partition keys

Use high-cardinality attributes. These are attributes that have distinct values for each item like e-mail id, employee_no, customerid, sessionid, ordered, and so on.

Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.

Cache the popular items when there is a high volume of read traffic. The cache acts as a low-pass filter, preventing reads of unusually popular items from swamping partitions. For example, consider a table that has deals information for products. Some deals are expected to be more popular than others during major sale events like Black Friday or Cyber Monday.

Add random numbers/digits from a predetermined range for write-heavy use cases. If you expect a large volume of writes for a partition key, use an additional prefix or suffix (a fixed number from predeternmined range, say 1-10) and add it to the partition key. For example, consider a table of invoice transactions. A single invoice can contain thousands of transactions per client.

Read More @ Choosing the Right DynamoDB Partition Key

LuFFy
  • 8,799
  • 10
  • 41
  • 59
  • 12
    This page seems quite contradictory to me. It recommends using "high-cardinality attributes" such as employeeID, customerID, orderID etc … however further down the page it describes sequential IDs or unique IDs generated by the relational DB engines (which seem to me like perfect high cardinality attributes) as an "Antipatterns for partition keys". I'm confused! – Mike Jun 12 '19 at 14:59
  • 1
    The examples given on the linked material seem to be worried about dependency on some other generating mechanism during porting that would be messy to support after porting or that are particular to say some external transactions. I don't see why an app generated unique id would not work though. – Samantha Atkins Jun 30 '19 at 19:11
  • I don't understand how to model with `one-table` only all (or several data types). It seems to me that I should have a `uuid` for Partition Key and `type` for Sort Key. For example, `PK=uuid-1 | SK=Book`, `PK=uuid-2 | SK=User`, `PK=uuid-3 | SK=Invoice`. Any real examples? – JobaDiniz Feb 26 '22 at 17:35