How to index for missing values in cosmosDB?

Question

Let's assume I need to verify every document in a CosmosDB collection has AnImportantProperty set up (= property exists, may have explicit value null). Most of them do, but for "reasons", some of them may not.

I can include the new property to indexing policy, so I can easily find which documents are OK with an index-covered query:

select * from c where is_defined(c.AnImportantProperty)

But the opposite query (which is what I'm really interested in) does not seem to benefit from the index:

select * from c where NOT is_defined(c.AnImportantProperty)

Is there a way to write an index/query to find documents with missing property without a full scan?

EDIT: For example, I've heared some second-hand rumors about negated indexes and a mysterious "v2 index". Both may indicate there is (or will be) a solution to this scenario.

Imre Pühvel · Accepted Answer · 2020-12-21T10:14:55.850

Currently (July 2019) there does NOT seem to be a way to write such a query.

All hope is not lost though, as according to Azure Cosmos DB Team comment for the feature request it is in planned stage.

UPDATE: The feature is now completed as reported by https://devblogs.microsoft.com/cosmosdb/april-query-improvements/:

Queries with inequality filters or filters on undefined values can now be run more efficiently. Previously, these filters did not utilize the index. When executing a query, Azure Cosmos DB would first evaluate other less expensive filters (such as =, >, or <) in the query. If there were inequality filters or filters on undefined values remaining, the query engine would be required to load each of these documents. Since inequality filters and filters on undefined values now utilize the index, we can avoid loading these documents and see a significant improvement in RU charge.

Here’s a full list of query filters with improvements:

Inequality comparison expression (e.g. c.age != 4)

NOT IN expression (e.g. c.name NOT IN (‘Luis’, ‘Andrew’, ‘Deborah’))

NOT IsDefined

Is expressions (e.g. NOT IsDefined(c.age), NOT IsString(c.name))

Coalesce operator expression (e.g. (c.name ?? ‘N/A’) = ‘Thomas’)

Ternary operator expression (e.g. c.name = null ? ‘N/A’ : c.name)

If you have queries with these filters, you should add an index for the relevant properties.

score 0 · Answer 2 · answered Jul 16 '19 at 09:17

0

hardly. i cannot think of any since you are effectively listing all except known value. but, i would suggest a bit different approach.

introduce type property on all of your documents
classes/objects that differ in some property can (i won’t say should although i mean it) be considered of different type (even if one just inherits other)
store those without property as type=“someType” and those with type=“someOtherType"
query for those of according type you need
if its not enough, introduce subtype

i’d probably try something like that. anything that avoids scanning too much.

answered Jul 16 '19 at 09:17

dee zg

13,793
10
42
82

It could be worked around like this, but it is indirect knowledge as type property and presence of property may be itself out of sync. Verifying it's correctness is a full scan. I've added rumor links to OP why I hope there could be a better, automatic index-level solution. – Imre Pühvel Jul 17 '19 at 07:21
@ImrePühvel i would argue that that depends exclusively on how you model your domain/app. if your property is required in your model then there cannot be any discrepancy. the question is really where is it better to check for property (and act on it): upon creation or at query time? i would have no doubt enforcing it at creation time since that gives predictible query performance/cost and i would argue scales better. even if they come with such index, i doubt i would rely on it for this use case. just my 2c – dee zg Jul 17 '19 at 08:47

How to index for missing values in cosmosDB?

2 Answers2