I have a question about DocumentDB partition key choise.
I have data with UserId, DeviceId and WhateverId. UserId parameter will be in queries always, so I have chosen UserId as a partition key. But I have a lot of data for one user (millions of entities) and when I made a quety like "SELECT * FROM c WHERE c.DeviceId = @DeviceId"
with partition key specified it takes a lot of time(about 6 minutes for about 220 000 returned entities).
Maybe it would be more efficient to choose for example DeviceId as a partition key and make queries against a few partitions in parallel
(specifying EnableCrossPartitionQuery = true and MaxDegreeOfParallelism = partition count)?
Or maybe it is a good idea to use separate collection for every user?

- 976
- 1
- 11
- 20
-
Not that this answers your question, but... I think any time you're trying to retrieve a quarter-million entities, you might want to rethink your data access pattern. Also, "`SELECT *`" is yet another code-smell. I don't see how your choice of partition key is going to make a difference if you're trying to move that much data to your app tier. – David Makogon Nov 03 '16 at 12:25
-
Thanks. `SELECT *` was just for quick example, sorry. I'll use `SELECT c.Value`. And this question is just about choosing partition key, because information on azure documentation site is a little bit abstract as for me. All this measurements is just for performance comparison depending on query. – Paval Nov 03 '16 at 13:21
2 Answers
It might help a little but I don't think a partition for each user will solve your problem because you essentially have that under the covers.
You could experiment with the partition key to improve the parrallism but, at best that would give you 2x to 5x improvement in my experience. Is that enough?
For more dramatic improvements you usually have to resort to selective denormalization and/or caching.

- 9,393
- 3
- 27
- 43
-
I've changed partition key to DeviceId and tried to make query like `SELECT c.Value FROM c WHERE c.UserId = @userId and c.WhateverId = @WhateverId`. It took 4.6 for 19845 returned entities. That is ok. But when I tried to query with partition key like `SELECT c.Value FROM c WHERE c.UserId = @userId and c.DeviceId = @DeviceId` it took about 27 seconds for almost the same amount of returned entities. And it is not good because queries with DeviceId is more frequently used. I understand that it happened because when we specifying partition key there is no parallelism. Should I consider another pk – Paval Nov 03 '16 at 13:14
-
1The key is that you'll have to keep experimenting. Don't forget to include index tweaking in your experiments. The default index keys off of the first 3 bytes of the data. If that does not vary enough, you could have an index hot spot. – Larry Maccherone Nov 03 '16 at 13:46
-
You mean it will happen if I have a lot of keys start with the same characters, right. – Paval Nov 03 '16 at 15:33
-
I know this is a bit old, but for the benefit of others coming to this topic...
From your description I assume that the devices are mostly unique to the user. It is often advised to partition on something like userid which is good if you have, say a call centre application, with many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions. However, if you have millions of records for the user then partitioning on User Id is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. Unless each user has 25+ devices with similar usage then Device Id is probably not a good choice either.
In cases such as yours, I generally find a system generated incrementing key (e.g. Event Id or Transaction Id) to be the best choice.

- 123
- 1
- 9