7

I got a 16GB collection with 2 partitions. When I query a document by it's Id, it is very slow. But querying by an indexed field is fast. Both are cross-partition query and if I pass partition key with the query it is fast but partition key is not always available for my query. Got similar results using .NET SDK and Document Explorer Query in Azure Portal.

The collection has custom indexing policy but as far as I know you don't need to index Id or it may not be even possible.

Here are my queries and their corresponding request charges.

SELECT * FROM c where c.id = 'unique-id-123'
-- Request Charge: 344940.79 RUs, Document Count: 1

SELECT * FROM c WHERE c.otherId = 'NOT-so-uniqueId-123'
-- Request Charge: 5.08 RUs, Document Count: 3

As you know, Id is unique so the query returns 1 document while the second query is filtered by otherId which is not so unique and returns 3 documents. Also notice the insanely high RUs consumption with the first query.

So why the 2nd query is faster than by Id?


Update:
Here are the gathered metrics for the above queries.

Query By Id:

Read 1 records in 1497 ms, 339173.109 RU, Size: 6873022 KB
QueryPreparationTime(ms): CompileTime = 2, LogicalBuildTime = 0,
     PhysicalPlanBuildTime = 0, OptimizationTime = 0
QueryEngineTime(ms): DocumentLoadTime = 1126, IndexLookupTime = 0,
     RuntimeExecutionTimes = 356, WriteOutputTime = 0

Query by indexed field:

Read 4 records in 2 ms, 7.56 RU, Size: 9 KB
QueryPreparationTime(ms): CompileTime = 0, LogicalBuildTime = 0, 
     PhysicalPlanBuildTime = 0, OptimizationTime = 0
QueryEngineTime(ms): DocumentLoadTime = 0, IndexLookupTime = 1, 
     RuntimeExecutionTimes = 0, WriteOutputTime = 0

These proves that query by Id is doing table scan as most of the time spent was from DocumentLoadTime and no value for IndexLookupTime.
But I thought Id should be the primary key and indexed by default as per this answer by @andrew-liu.

Mark Trinidad
  • 281
  • 4
  • 13
  • In a partitioned collection, primary key lookups (fast) can only be performed if the id AND partition key are provided. If id is provided without partition key, then the query reverts to full scan hence why it is slow and is more expensive. Are you able to provide a set of potential partition keys to search from? eg. "SELECT * FROM c where c.id = 'id' AND c.pk IN ('pk1', 'pk2', ...)" – Oliver Towers Apr 06 '18 at 22:40
  • 2
    Thanks for the reply @OliverTowers. Unfortunately, it'll be hard to predict the partition keys. The query is for an API that searches by Id. I've added optional parameter for providing the partition key but not all the expected clients would be able to provide that. It is just feels really weird as searching by an indexed field (not the Id) is fast even with cross partition searches. – Mark Trinidad Apr 08 '18 at 22:59

2 Answers2

8

Microsoft support responded and they've resolved the issue. They've added IndexVersion 2 for the collection. Unfortunately, it is not yet available from the portal and newly created accounts/collection are still not using the new version. You'll have to contact Microsoft Support to made changes to your accounts.

Here are the new results from a collection with index version 2 and there's a massive improvement.

SELECT * FROM c where c.id = 'uniqueValue'
-- Index Version 1: Request Charge: 344,940.79 RUs
-- Index Version 2: Request Charge: 3.31 RUs

SELECT * FROM c WHERE c.indexedField = 'value' AND c.id = 'uniqueValue'
-- Index Version 1: Request Charge: 150,666.22 RUs 
-- Index Version 2: Request Charge: 5.65 RUs
Mark Trinidad
  • 281
  • 4
  • 13
  • 3
    Thanks! Could you also share the detailed response from Microsoft Support? Have they mentioned any release date for the official fix? – driAn Jul 16 '18 at 20:22
0

The "Id" field is only unique within a partition key. This is probably party of why your query is so costly, if you have manually configured indexing.

Unfortunately, it is not possible to control the indexing of the 'id' field. You could try checking if the query performance improves if you index everything. It would be interesting if it changed anything for your data, altough it changed nothing for my small sample set.

The specified path '/id/?' could not be accepted because it overrides system property 'id'.

In my experience, DocumentDB queries actually can get cheaper if you have a couple of results in each partition. They can be very costly if there no results within a partition. Try putting a second document with the same id in a different partition and look how the performance changes. Without the cross-partition-query the responses are always extremely fast when querying with an indexed field, regardless of the result count.

I never investigated more since it never bothered me for real use cases. It might also be that the amount of items per partition has no real impact and my data itself is responsible.

Alex AIT
  • 17,361
  • 3
  • 36
  • 73
  • Thank you for your answer. I've tried adding a new document with the same Id in different partition key but it is still very slow. It even doubled the RUs consumption to `690955.94 RUs`. I'll try to experiment with the default indexing policy. – Mark Trinidad Apr 08 '18 at 23:22