4

I am using Azure Cosmos DB SQL API to try to achieve the following;

We have device data stored within a collection and would love to retrieve the latest event data per device serial effectively without having to do N queries for each device separately.

SELECT * 
FROM c
WHERE c.serial IN ('V55555555','synap-aim-g1') ORDER BY c.EventEnqueuedUtcTime DESC

Im assuming I would need to use Group By - https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-group-by

Any assistance would be greatly appreciated

Rough example of data :

[
    {
        "temperature": 25.22063251827873,
        "humidity": 71.54208429695204,
        "serial": "V55555555",
        "testid": 1,
        "location": {
            "type": "Point",
            "coordinates": [
                30.843687,
                -29.789895
            ]
        },
        "EventProcessedUtcTime": "2020-09-07T12:04:34.5861918Z",
        "PartitionId": 0,
        "EventEnqueuedUtcTime": "2020-09-07T12:04:34.4700000Z",
        "IoTHub": {
            "MessageId": null,
            "CorrelationId": null,
            "ConnectionDeviceId": "V55555555",
            "ConnectionDeviceGenerationId": "637323979596346475",
            "EnqueuedTime": "2020-09-07T12:04:34.0000000"
        },
        "Name": "admin",
        "id": "6dac491e-1f28-450d-bf97-3a15a0efaad8",
        "_rid": "i2UhAI7ofAo3AQAAAAAAAA==",
        "_self": "dbs/i2UhAA==/colls/i2UhAI7ofAo=/docs/i2UhAI7ofAo3AQAAAAAAAA==/",
        "_etag": "\"430131c1-0000-0100-0000-5f5621d80000\"",
        "_attachments": "attachments/",
        "_ts": 1599480280
    }
]

UPDATE: So doing the following returns the correct data but sadly you can only return data thats inside your group by or an aggregate function (i.e. cant do select *)

SELECT c.serial, MAX(c.EventProcessedUtcTime)
FROM c
WHERE c.serial IN ('V55555555','synap-aim-g1') 
GROUP BY c.serial

[
    {
        "serial": "synap-aim-g1",
        "$1": "2020-09-09T06:29:42.6812629Z"
    },
    {
        "serial": "V55555555",
        "$1": "2020-09-07T12:04:34.5861918Z"
    }
]
David
  • 693
  • 7
  • 20
  • 1
    As per 'latest event data per device serial' you mean to say you can have multiple documents lets say for 'V55555555' with different 'EventProcessedUtcTime' and we need to select the latest one of it, and this has to be done for every device? – AnuragSharma-MSFT Sep 09 '20 at 09:23
  • That is correct @AnuragSharma-MSFT we will have infinite documents per device serial (partition). I would like to get the latest document (by IoT Hub time as you stated correctly) efficiently – David Sep 09 '20 at 09:59
  • 2
    I am afraid there is no direct way to achieve it using a query in cosmos db. However you can refer to below link for the same topic. If you are using any sdk, this would help in achieving the desired functionality: https://learn.microsoft.com/en-us/answers/questions/38454/index.html – AnuragSharma-MSFT Sep 09 '20 at 10:27
  • Ok cool, thought so. Thanks! – David Sep 09 '20 at 10:44
  • Hi @David, really thanks for AnuragSharma-MSFT's help! We're glad that you resolved it. The update should be posted as the answer, I just help you post it. You can accept it as answer. This can be beneficial to other community members. Thank you. – Leon Yue Sep 10 '20 at 06:01

2 Answers2

3

Thanks for @AnuragSharma-MSFT's help:

We're glad that you resolved it in this way, thanks for sharing the update:

So doing the following returns the correct data but sadly you can only return data thats inside your group by or an aggregate function (i.e. cant do select *)

SELECT c.serial, MAX(c.EventProcessedUtcTime)
FROM c
WHERE c.serial IN ('V55555555','synap-aim-g1') 
GROUP BY c.serial

[
    {
        "serial": "synap-aim-g1",
        "$1": "2020-09-09T06:29:42.6812629Z"
    },
    {
        "serial": "V55555555",
        "$1": "2020-09-07T12:04:34.5861918Z"
    }
]
Leon Yue
  • 15,693
  • 1
  • 11
  • 23
  • Perfect thanks, I will mark the answer. Just a question on doing this as parallel as possible. Would I just get away with like a parallel.foreach vibe? Is there a smarter (RU) way of doing these N queries (using serial.. the partition key and Id) in parallel? These queries will of course be lightweight but just would like to know if there is a clean/proper way of handling these N queries via cosmosDbClient.ReadDocumentsByQueryAsync() – David Sep 10 '20 at 08:38
2

If the question is really about an efficient approach to this particular query scenario, we can consider denormalization in cases where the query language itself doesn't offer an efficient solution. This guide on partitioning and modeling has a relevant section on getting the latest items in a feed.

We just need to get the 100 most recent posts, without the need to paginate through the entire data set.

So to optimize this last request, we introduce a third container to our design, entirely dedicated to serving this request. We denormalize our posts to that new feed container.

Following this approach, you could create a "Feed" or "LatestEvent" container dedicated to the "latest" query which uses the device serial as id and having a single partition key in order to guarantee that there is only one (the most recent) event item per device, and that it can be fetched by the device serial or listed with least possible cost using a simple query:

SELECT *
FROM c
WHERE c.serial IN ('V55555555','synap-aim-g1')

The change feed could be used to upsert the latest event, such that the latest event is created/overwritten in the "LatestEvent" container as its source item is created in the main.

Noah Stahl
  • 6,905
  • 5
  • 25
  • 36
  • Oh my word of course! Never thought of this! Thanks so much. Will implement this today. Superb! – David Sep 10 '20 at 17:42