3

I have a single collection (observation) in a DocumentDB cluster. The collection was recently purged but often exceeds 200GB in documents. Below is an example of a document.

Example document

{
    "_id" : ObjectId("5edfe9eb8b9b6d37ffc2b9ec"),
    "deviceId" : "5e86371746e0fb0001cbbf9b",
    "data" : {
        "type" : "HUMIDITY",
        "reading" : 20.1,
        "units" : "rh"
    },
    "timestamp" : ISODate("2020-06-10T05:00:00.000Z")
}

The poorly performing query is:

db.observation.aggregate([
            {
                "$match": {
                    "deviceId": req.params.deviceId, 
                    "timestamp": {
                        $gte: new Date(req.params.timestamp), $lt: new Date()
                    }
                }
            }, { 
                "$group": {
                    "_id": {
                        "$add": [
                            { "$subtract": [
                                { "$subtract": [ "$timestamp", new Date(0) ] },
                                { "$mod": [ { "$subtract": [ "$timestamp", new Date(0) ] }, 1000 * 60 * aggMins ]}
                            ] },
                            new Date(0)
                        ]
                    }, 
                    "timestamp" : { "$first": "$timestamp" },
                    "units" : { "$first": "$data.units" },
                    "avg": { 
                        "$avg": "$data.reading" 
                    },
                }
            }, {
                "$project": {
                    "_id": 0,
                    "timestamp": 1,
                    "avg": 1,
                    "units": 1
                }
            }
        ])

Running explain() on the query results in:

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "observation",
        "winningPlan" : {
            "stage" : "HASH_AGGREGATE",
            "inputStage" : {
                "stage" : "IXSCAN",
                "indexName" : "deviceId_1",
                "direction" : "forward"
            }
        }
    },
    "executionStats" : {
        "executionSuccess" : true,
        "executionTimeMillis" : "13092.126",
        "planningTimeMillis" : "8.470",
        "executionStages" : {
            "stage" : "HASH_AGGREGATE",
            "nReturned" : "1",
            "executionTimeMillisEstimate" : "13083.523",
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : "240266",
                "executionTimeMillisEstimate" : "12915.796",
                "indexName" : "deviceId_1",
                "direction" : "forward"
            }
        }
    },
    "ok" : 1
}

I am seeking advice on how to optimize the above query.

David J Barnes
  • 506
  • 9
  • 15
  • 1
    You are asking to average readings over a range of keys, the database appears to be doing exactly that. What are you hoping to optimize? – D. SM Jul 20 '20 at 18:33
  • I am hoping to reduce query time from ~13s to something more reasonable. – David J Barnes Jul 20 '20 at 18:35
  • Maybe reduce the data set size by pre-aggregating readings for whatever intervals make sense in your application (hourly/daily/monthly), then everything will be faster. – D. SM Jul 20 '20 at 18:35
  • 1. Create TTL index, so the purge is automatic 2. Pre-aggregate if you can 3. __Create an index on deviceId and timestamp__ 4. Increase the size of your cluster – Yahya Jul 20 '20 at 20:07

1 Answers1

0

I'm not an expert on DocumentDB, but here are some thoughts. You'll need to try them one by one and see what helps you the most:

  1. In the first stage, match, use the project parameter to pass along the fields you need only.

  2. Do not get the unit in the same call, get the timestamp only. The reason for this is that you're using the timestamp in the match, so you already have it at hand. However, getting the unit becomes harder for DocumentDB -this is my understanding, I might be wrong here. In classic MongoDB it would be good getting everything in a single call, but in DocumentDB the data is distributed differently.

Last, one of the comments above suggests to create a TTL index. That is definitely not the best approach: TTL indexes require a ton of IO operations for DocumentDB, and those are expensive. AWS suggests to drop the collection and create a new one, instead of expiring documents.

Keep in mind DocumentDB performs much better with smaller collections than with huge collections. So possibly the best approach here would be to have one collection per device-week ; something like metrics_device1_20200701 in such a way that the indexes fully fit in memory.

Nico Andrade
  • 880
  • 5
  • 16