Why is getting COUNT of results so expensive?

Question

If I run a query, such as this one below:

SELECT 
DISTINCT c.vin,
c.year-model-trim, 
c.make
FROM c
WHERE c.make= @make    // make is partition key
AND c.manufactured > '2020-01-01'

I'll get 25 documents back, with query stats resembling these:

Request Charge 5.79 RUs
Retrieved document count 70
Retrieved document size 25736 bytes
Output document count 26
Output document size 1423 bytes
Index hit document count 25.9
Index lookup time 0.6 ms
Document load time 0.5900000000000001 ms
Query engine execution time 0.26 ms
System function execution time 0 ms
User defined function execution time 0 ms
Document write time 0.01 ms
Round Trips 1

However, if I want to get only the COUNT results of that query, I can nest the above query in a count query:

SELECT COUNT(m.vin)
FROM
(
    SELECT 
    DISTINCT c.vin,
    c.year-model-trim, 
    c.make
    FROM c
    WHERE c.make= @make    // make is partition key
    AND c.manufactured > '2020-01-01'
)m

And my RU's go from ~5 to ~150.

Request Charge 147.79 RUs
Showing Results 1 - 1
Retrieved document count 0
Retrieved document size 0 bytes
Output document count 1
Output document size 79 bytes
Index hit document count 0
Index lookup time 124.97000000000001 ms
Document load time 0 ms

Why is getting the COUNT of a result set in Cosmos so expensive in terms of RU's used?

Did you intend to check for the literal `'@date'` in `AND c.manufactured > '@date'` instead of the parameter `@date`? — Andrew Morton, Feb 14 '20 at 22:22
@AndrewMorton Neither `COUNT(*)` nor `COUNT(DISTINCT ...)` are Cosmos-supported syntax at present. — David Rubin, Feb 14 '20 at 22:23
@AndrewMorton COUNT(1) is inexpensive in the abstract, but as the OP states, a "simple" count of the results of the cheap subquery turns out to be very expensive and he's trying to understand why. — David Rubin, Feb 14 '20 at 22:30
@DavidRubin I thought that maybe the greater index lookup time of the expensive query was something to do with the `COUNT(m.vin)` apparently contributing to the cost, and thought that perhaps the `m.vin` instead of `*` could be significant. Would it make a difference if `vin` was `NOT NULL`, or is that not available in Cosmos? — Andrew Morton, Feb 14 '20 at 22:34
Cosmos DB is schemaless, and therefore doesn't have a notion of non-nullable properties. — David Rubin, Feb 14 '20 at 22:42
@AndrewMorton updated the 'date' param to an actual date for clarity. — todon2, Feb 15 '20 at 19:12
Hi @todon2. We actually do not support doing an aggregate over DISTINCT. This query should have returned an exception so this is a bug what you are seeing. We're working on making updates and will update our docs to explain this better. Will post this and the link to that as an answer when we get that published. Our apologies for the confusion. — Mark Brown, Feb 18 '20 at 17:59
@MarkBrown can you elaborate? The OP's query doesn't look like an aggregate over DISTINCT, it looks like counting the result-set of a subquery. Why should that be an exception? — David Rubin, Feb 18 '20 at 20:41
Same difference. Doing a count or aggregate, etc. There are multiple reasons why we don't support this. In this case here, the query has multiple round-trips and we don't support a count over these. --Thanks. — Mark Brown, Feb 18 '20 at 21:32

Why is getting COUNT of results so expensive?

0 Answers0