16

Is there a direct function to count distinct elements in a CosmosDb query?

This is the default count:

SELECT value count(c.id) FROM c

And distinct works without count:

SELECT distinct c.id FROM c

But this returns a Bad Request - Syntax Error: SELECT value count(distinct c.id) FROM c

How would count and distinct work together?

Ovi
  • 2,620
  • 7
  • 34
  • 51

7 Answers7

11

[Update 19 Nov 2020]

Here is another query that solves the district issue and works for count. Basically, you need to encapsulate the distinct and then count. We have tested it with paging for cases that you want the unique records and not just the count and it's working as well.

select value count(1) from c join (select distinct value c from p in c.products)

You can also use where clause inside and outside of the bracket depending on what your condition is based on.

This is also mentioned slightly differently in another answer here.

Check the select clause documentation for CosmosDB.

@ssmsexe brought this to my attention and I wanted to update the answer here.

[Original Answer]

Support for distinct has been added on 19th Oct 2018

The following query works just fine

SELECT distinct value c FROM c join p in c.products

However, it still doesn't work for count.

The workaround for counting distinct is to create a stored procedure to perform the distinct count. It will basically query and continue until the end and return the count.

If you pass a distinct query like above to the stored procedure below you will get a distinct count

function count(queryCommand) {
  var response = getContext().getResponse();
  var collection = getContext().getCollection();
  var count = 0;

  query(queryCommand);

  function query(queryCommand, continuation){
    var requestOptions = { continuation: continuation };
    var isAccepted = collection.queryDocuments(
        collection.getSelfLink(),
        queryCommand,
        requestOptions,
        function (err, feed, responseOptions) {
            if (err) {
                throw err;
            }

            //  Scan results
            if (feed) {
                count+=feed.length;
            }

            if (responseOptions.continuation) {
                //  Continue the query
                query(queryCommand, responseOptions.continuation)
            } else {
                //  Return the count in the response
                response.setBody(count);
            }
        });
    if (!isAccepted) throw new Error('The query was not accepted by the server.');
  }
}

The issue with that workaround is that it can potentially cross the RU limit on your collection and be unsuccessful. If that's the case you can implement a similar code on the server side which is not that great.

Aboo
  • 2,314
  • 1
  • 18
  • 23
9

How about SELECT COUNT(1) FROM (SELECT distinct c.id FROM c) AS t;? – Evaldas Buinauskas May 30 '18 at 14:44

On 15 May 2019, The comment above is working with Where condition, I didn't try with a Join but the request does return the answer I'm looking for.

And it is working with the 100 elements limitation in CosmosDB.

If I make an example with Product it should be : SELECT COUNT(1) FROM (SELECT DISTINCT c.Id FROM c WHERE c.Brand = 'Coca')

Azutanguy
  • 141
  • 1
  • 6
  • Works perfectly. Thanks :) – mohdnaveed Jul 12 '19 at 05:32
  • 5
    This does not always work perfectly. I have found it can appear to work but then give incorrect results. This is a known issue with it to the product team – Martin Smith Oct 11 '19 at 07:13
  • Yup, it's still not implemented by azure cosmos db team! known issue it is. – Arulmouzhi Jul 28 '20 at 14:10
  • Answer from Visakh lower works correctly https://stackoverflow.com/a/63185726/957386 – Tiny Oct 08 '20 at 23:08
  • I have a similar problem. I can use this query almost as-is. When I run the inner subquery without "distinct", I get 96 items as result. When I run it with "distinct", I get 88. That is as expected. But when I run the full query, it returns 96. Even if I am using "distinct" in the inner subquery. So it seems that the outer COUNT will actually show "Retrieved document count" instead of the "Output document count". That's what I can understand after looking at the Query Statistics. I'm querying in the Azure Portal. – Mats Magnem Feb 19 '21 at 21:22
  • This query will work wrong when it's cross-partition. It will calculate the sum of distinct counts over the partitions instead. – gukoff Jul 17 '23 at 07:51
8

I know this is an old thread.

However, just to keep the topic updated, currently (Jul 2020) you are able to do SELECT DISTINCT over Cosmos DB table. However directly applying COUNT(DISTINCT..) doesnt give correct results. Hence, you need to apply a workaround as below using a subquery based approach to get the correct distinct count results

SELECT COUNT(UniqueIDValues) AS UniqueCount
FROM (SELECT Id FROM c GROUP BY Id) AS UniqueIDValues
Visakh
  • 81
  • 1
  • 1
  • 1
    If one prefers DISTINCT keyword to GROUP BY, I found this one works a well: SELECT COUNT(UniqueIDValues) AS UniqueCount FROM (SELECT DISTINCT Id FROM c) AS UniqueIDValues – Tiny Oct 08 '20 at 23:06
  • @Tiny this query will work wrong when cross-partition: https://github.com/Azure/azure-cosmos-dotnet-v3/issues/3986 – gukoff Jul 19 '23 at 11:01
4

I did some investigation and found solution for it. In order to get count of distinct results you can not use count(1). You need to "wrap" subquery with AS subqueryName and then use count(subqueryName) like below:

select count(subqueryName) from (SELECT distinct r.x FROM r) as subqueryName

Cheers!

gorrch
  • 521
  • 3
  • 16
  • This query will work wrong when cross-partition: github.com/Azure/azure-cosmos-dotnet-v3/issues/3986 – gukoff Jul 19 '23 at 11:02
2

As I know, for now, Cosmos DB does not support nested queries.

The only way to do what u want is to return all distinct ids as a query result and then do count on them. You can either do in directly in code or with help of stored procedure(that should be more efficient on big number of docs).

Olha Shumeliuk
  • 720
  • 7
  • 14
1

To count distinct elements you have to use COUNT and GROUP BY together. You don't need need subqueries, it works in a very simple query like this example where we want to list all the family unique last names in our container and the count of families having the same name:

select count(1) as numfam, f.lastName from f group by f.lastName

result:

[
    {
        "numfam": 1
    },
    {
        "numfam": 1,
        "lastName": "Wakefield"
    },
    {
        "numfam": 2,
        "lastName": "Andersen"
    }
]

Notice I have one item in my collection without lastName, freedom of being schemaless.

unfortunately today is not yet possible to add an "order by" clause to the query to sort for example the most common names in descending order. The cosmos team declared is working on it though, so this feature is expected at some point. You can always sort the result in your client code.

If you want to know the count for a specific name you can use this query (you can use a parametrized query to input the name in one place) :

select "Andersen" as lastName, count(1) as numfam from f  
where  f.lastName = "Andersen"

result:

[
    {
        "lastName": "Andersen",
        "numfam": 2
    }
]
Anton M
  • 798
  • 7
  • 17
0

Azure cosmos DB doesn't support the distinct keyword yet as part of the SQL API. The best way to achieve this is by using the stored procedure with custom code. Please find more details regarding the custom stored proc here.

It seems the distinct keyword is underdevelopment.
Please find the reference link here.

cosmos DB supports most of the aggregate functions, please see the list of of supported aggregate functions here.

Please find the more details in the following link.

Ravikumar B
  • 779
  • 1
  • 14
  • 25
  • Thanks Ravi, nice repo of links. I've gone through most of them before posting though so it's not that helpfull right now. I'm sure that as CosmosDb evolves there will be an acceptable answer to this. – Ovi Jun 29 '18 at 07:46
  • Its my pleasure :) – Ravikumar B Jun 29 '18 at 08:47
  • I am also trying to use count with distinct... I am using a distinct query to get a paged list of results, then want to get a count of the same list to get the "Total Count" It seems that we will need to list everything in the query into memory before getting both the page that we need and the total count.. this seems very inefficient? Is this the case? – Mark Redman Jul 01 '18 at 17:40