16

Our Azure Cosmos DB collection has gotten large enough to require a partition key. In doing some reading about this, I get the impression that the best partition key is one that provides for even distribution and higher cardinality. This article from Microsoft discusses it.

Using a primary key as a partition key provides for even distribution, but a cardinality of only 1. If this is my only option, is this a bad thing? The aforementioned article gives a few examples and seems to indicate that the primary key should be used as a partition key in those instances. In the case of Azure Cosmos DB, the partitions are logical, not physical. So it wouldn't lead to having each document on its own disk, but it seems like it could lead to a bloated index.

Is using a primary key as a partition key a common practice? Are there any downsides to it?

Scotty H
  • 6,432
  • 6
  • 41
  • 94

4 Answers4

6

Actually , the choice of partition key is a question that deserves to be weighed repeatedly. Since choosing primary key to be the partition key is your only option, I just discuss some of the possible negative things as your references.

In terms of performance, if your query's field is not partition key, your query will definitely reduce query performance by crossing partitions. Arguably, if the amount of data is small, it won't have much effect.

In terms of cost, cosmos db is charged primarily by storage space and RUs consumption.As you said, choosing primary key as partition key will lead more indexes storage. If mostly queries are cross-partition, it also leads more RUs consumption.

In terms of using of stored procedure, triggers or UDF, you can't use cross-partition transactions via stored procedures and triggers. Because then are partitioned so that you need to specify the partition key(cardinality is only 1) when you use them.

Just note that if partition key is created, it cannot be deleted or modified later. So consider it before you choose and make sure you do the data backup.

More details, still refer to the official doc.

Ruben Bartelink
  • 59,778
  • 26
  • 187
  • 249
Jay Gong
  • 23,163
  • 2
  • 27
  • 32
4

No, there is no downside to it. Strive to have partition key with high cardinality. Don't worry about indexes or physical partitions etc.

You can have million of partition keys and 10 physical partitions. Physical partitions are created behind the scene by CosmosDB. You should never worry about physical partitions.

Rafat Sarosh
  • 989
  • 7
  • 16
3

You could say that the primary key is the safest and probably, most appropriate choice for a partition key.

It guarantees uniqueness of the value, which other than unique keys, is the only way to achieve. The distribution will be even and because the primary key will be your partition key, you will be able to use it in order to retrieve the document by reading it, instead of querying, which reduces operation speed and cost.

Nick Chapsas
  • 6,872
  • 1
  • 20
  • 29
1

I think that MS does not do a great job of describing how to best determine a partition key for Cosmos DB - especially if folks are generally suggesting to use the Primary Key of the database as the partition key (which may be perfectly acceptable sometimes, but I can't see how it would be the normal).

In a recent project, this is how we decided to identify a partition key and item id for the objects in our system. I think this would apply to many systems that have natural composite primary key candidates on their objects.

In our system, every object is restrict to a state (StateCode) and vendor (VendorId). From there, we have multiple entities like Sales Orders, Customers, Widgets, ... In our SQL Server implementation, every table had an obvious natural composite primary key of StateCode, VendorId, EntityId. In the Cosmos DB scenario, we chose the Partition Key to be StateCode-Vendor-EntityType with an Item Id of EntityId. This allows all the entities of a specific type to be queried within a partition (saving RUs) while still allowing very simple querying within that partition (eg, homogenous entities). You end up utilizing all parts of the composite natural key in this way, but allow for actual partitioning of entities.

In more complicated scenarios, where we wanted to query across entities for a given vendor, we can remove EntityType from the partition key and either move it into the item id or use it to filter the objects being searched. This allows cross entity querying within a partition, but the query itself is slightly more complicated because of heterogenous entities.

If the entire ID of the entity is in the Partition Key, then you pretty much have to always look up the item individually or search every partition when not looking up by ID - at which point who cares how evenly your data is distributed across partitions if you have to search them all anyway.

Perhaps the OP can describe more about the entities - do they have natural composite key candidates (regardless of whether they're being used or not in SQL implementation)? If not, what does the current persistence layer look like in terms of identifying items in the system by some id?

MPavlak
  • 2,133
  • 1
  • 23
  • 38