29

In the scenario where we have 1000 entries (unique keys) entering cosmos per minute, is it safe to use /id as the partition key?

In particular, there is the concept of Logical Partitions https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data The graphic here scares me a little bit, showing that the logical partitions are actual entities (Ex. "city": "London"). If I have an 8 hour TTL and 1000 entries per minute, I don't necessarily want 480,000 logical partitions that cosmos needs to manage.

What I imagine happens is that the value of the partition key is simply hashed and modulo with the number of physical partitions, ex. https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview#choose-partitionkey indicates that this is true in the "Logical Partition Mangement" section. Furthermore, the "Choosing a Partition Key" section suggests (but does not actually state) that /id would be a fantastic partition key, as it doesn't have to worry about the 10GB limit, throughput limit, no hot spots, wide (huge) range of values, and since the application doesnt need to filter on anything except the id, cross partition queries wont be an issue for this use case.

In summary, do I need to worry about the memory/CPU/etc overhead of hundreds of thousands of partition key values (logical partitions)? The docs indicate the more values of the partition key is better, but don't say if its possible to have too many values.

David Makogon
  • 69,407
  • 21
  • 141
  • 189
user2770791
  • 553
  • 1
  • 5
  • 11
  • 5
    While there's no "worry" per se, regarding storage when using `/id` as partition key (given that max logical partition size is 10GB), keep in mind that if you are searching for documents based on a property other than `id`, and `id` is partition key, you will be forced to do a cross-partition query (in other words, the query will be applied to every single logical partition). This is due to a query being scoped to a single partition. If you only retrieve documents by `id` then this isn't an issue. Just think about that when it comes to query performance. – David Makogon Feb 11 '19 at 20:26
  • 1
    If you do query on anything other than `id`: Before committing to `id` as partition key, it might be worth benchmarking to see what your RU costs will be, when querying against a property other than `id` (you'll need to enable cross-partition query in the query options). You might find that a different partition key suits your query use-cases better. – David Makogon Feb 11 '19 at 20:34
  • Currently yes, it will just be against the ID. Regarding your first comment, are you saying the cross partition query is ran against every logical partition (Huge number), as opposed to every physical partition (Just a few)? That is somewhat concerning. – user2770791 Feb 12 '19 at 21:44

2 Answers2

32

I am from the Cosmos DB engineering team.

You don't have to worry about the number of logical partition keys that are created on a Cosmos DB collection/container. As long as the partition key is an appropriate choice for your writes (subject to a per-logical partition key cap of 10GB) and queries, you should be good.

Krishnan Sundaram
  • 1,321
  • 9
  • 11
  • 13
    If they use `id` as partition key, they'll need to worry if they are querying data on a property other than `id`, since they'd be forced to do a cross-partition query. – David Makogon Feb 11 '19 at 20:28
  • 5
    Agreed, @DavidMakogon, which is why I've indicated that the partition key would have to be an appropriate choice for both writes and queries. – Krishnan Sundaram Feb 13 '19 at 05:38
  • 5
    Just to be clear, if a `Customer` table - which has an `Id` UUID property - is mostly queried by `emailAdress`, then it would be a good idea to use `/emailAddress` as the partition-key? – Aaron Newton Mar 16 '19 at 11:36
  • 1
    I should add that I will need to be able to query by `Id` sometimes. There's a decent chance I will already know the `emailAddress` at that point, e.g. for a logged in user where I want to get some additional details. I would make `emailAddress` the `Id`, but there's a reasonable chance a user will need to change `emailAddress`. It seems like a non-trivial issue. Will Cosmo Db's auto-indexing come to my rescue here? – Aaron Newton Mar 17 '19 at 04:59
  • Hi Krishnan, I am having issues integrating my current deployment specifically because of partitioning. I would like to enable CrossPartitionQueries but I think I cannot since I am using mongo driver for java. The Comosdb sdk's from what I've seen are using apache http client and specify the http header. Any way I can enable cross partition queries from mongo shell? – Roman Gherta Aug 19 '19 at 14:50
7

Implications are:

  1. best cardinality
  2. easy&fast&cheap document reads

  3. no transactions as transaction scope is partition key

  4. queries by anything other than id will be cross-partition

PS. I can hardly imagine the case for not needing anything but by id reads/queries. except maybe for document caching (combined with TTL).

dee zg
  • 13,793
  • 10
  • 42
  • 82