1

I have about 100 thousand records in Cosmos DB. I want to get the distinct records by some property. I am using Stored Procedure to acheive this and sets the page size to -1 to get the maximum records. When i fire a query without distinct, i get about 19 thousand records. At the same time if i fired the distinct query, it gives me distinct records, and the distinct applied with in the undistincted 19 thousand records instead of the entire 100 thousand records.

Below is the query i have used:

SELECT r.[[FieldName]] FROM r -> returns 19000 records with duplicates

SELECT DISTINCT r.[[FieldName]] FROM r -> returns distinct records (few about 5000) which are distincted from the above 19000 records instead of 100 thousand records

ewramner
  • 5,810
  • 2
  • 17
  • 33
Naveen Prasath
  • 539
  • 1
  • 3
  • 23
  • 2
    selecting 100k recording without cosmos' pagination logic counts insane. The RU/s should hit the limit and return whatever they were able to retrieve before they capped. – Nick Chapsas Jul 09 '18 at 17:31
  • I think you need to figure out why only 19,000 records retrieved from 100,000 records without any filters, right? – Jay Gong Jul 10 '18 at 06:47
  • No @Jay Gong, i think DocumentDB returns only about 4 MB data, where return record count doesn't matters. But when we fire distinct query it should return distinct records against the entire collection(100K+ records). But in my case it applies distinct only against about 4 MB of data ie) 19K records. – Naveen Prasath Jul 10 '18 at 07:07
  • Hi Nick Chapsas, i don't want to select 100K records in single query. I am using Pagination. But applying Distinct with in the page is wrong. For example i may have the data starts with 'A' in last page also. If the distinct applies with in the page, its wrong and it should be done with the entire 100K+ records. Am i right? – Naveen Prasath Jul 10 '18 at 07:10
  • @NaveenPrasath, when u set maxItemCount on document feed request it's actually the max number of items which can be retrieved from CosmosDB. But, it's not guaranteed that exactly this number will be retrieved. You need to proceed querying using continuation token – Olha Shumeliuk Jul 10 '18 at 08:11
  • Didn't you say that "sets the page size to -1 to get the maximum records"? How are you using pagination then? – Nick Chapsas Jul 10 '18 at 09:34
  • @OlgaShumeliuk Agreed. I have to proceed querying records using continuation token. But when i apply a distinct query, distinct should be applied for the entire document collection. For ex, I have 2 records starts with A, 1 is in top and another 1 is in at last. When i apply distinct, for first time i will get 1 record and also i will have second record on my last page. This is how Cosmos DB works. But that is wrong right? – Naveen Prasath Aug 06 '18 at 15:00
  • @NickChapsas I dont think setting the page size to -1 will affect the pagination. In Cosmos DB Pagination is based on continuation token right? – Naveen Prasath Aug 06 '18 at 15:01
  • @NaveenPrasath Yeah but the continuation token is used because of the pagination. If you say that you want your page to have unlimited items then you should get everything in one page at least for that partition. – Nick Chapsas Aug 06 '18 at 15:27

0 Answers0