1

This is a follow up question on my original post here: Improve MongoDB query in Python

My collection of hiking trails looks like this: Each trail might be available multiple times for the same date based on comments set by users.

{
    "_id": ..,
    "trailName": "Trail1"
    "update": ISODate("2023-07-30T08:00:00Z"),
    "user": "Tom",
    "comment": "xy"
  },
  {
    "_id": ..,
    "trailName": "Trail1"
    "update": ISODate("2023-07-31T08:00:00Z"),
    "user": "Mary",
    "comment": "xy"
  },
  {
    "_id": ..,
    "trailName": "Trail1"
    "update": ISODate("2023-07-31T08:00:00Z"),
    "user": "Joe",
    "comment": "xyz"
  },
  {
    "_id": ..,
    "trailName": "Trail2"
    "update": ISODate("2023-07-10T08:00:00Z"),
    "user": "Tom",
    "comment": "xy"
  }

In python I query all trail names and iterate them. For each trail I want to query:

  • It's name
  • The number of elements that are available (updateCount)
  • The last update (mostRecentUpdate)
  • All users and their comments for the last update

So far I could query everything except getting the users and the comments. My query looks like this:

db.mycollection.aggregate([
  {
    "$group": {
      "_id": "$trailName",
      "updateCount": {
        "$count": {}
      },
      "mostRecentUpdate": {
        "$max": "$update"
      }
    }
  }
])

As a result I get back:

{
    "_id": "Trail1",
    "updateCount": 2,
    "mostRecentUpdate": ISODate("2023-07-31T08:00:00Z")
  },
  {
    "_id": "Trail2",
    "updateCount": 1,
    "mostRecentUpdate": ISODate("2023-07-10T08:00:00Z")
  }

What I would like to get back is something like this:

{
    "_id": "Trail1",
    "updateCount": 2,
    "mostRecentUpdate": ISODate("2023-07-31T08:00:00Z"),
    "users": "Mary", "Joe"
    "comments": "xy", "xyz"
  },
  {
    "_id": "Trail2",
    "updateCount": 1,
    "mostRecentUpdate": ISODate("2023-07-10T08:00:00Z")
    "users": "Tom"
    "comments": "xy"
  }

How can I achieve this to get ONLY the users and comments for the most recent date? When I add in the $group step something like:

"comments": { "$push": { "comment": "$comment", "user": "$user" } } 

I will get all comments and users from ALL dates. But I only want it for the last date. I could not set a condition or filter for that. Also, I cannot first filter only the most recent date, because I need to get the number of updates that have been.

So my question is: How can I set a filter to get the comments for the most recent date?

I need to accomplish this task using PyMongo.

ray
  • 11,310
  • 7
  • 18
  • 42
Robbert
  • 109
  • 5
  • 1
    Might have to append a `$lookup` stage after the `$group` which goes and (re)grabs the docs for the matching `trailName` and associated `update` field pairs. But just to confirm - you're saying this `update` 'timestamp' is in fact only representative of a less granular date hence the possibility of non-uniqueness? – user20042973 Aug 02 '23 at 13:32
  • the update timestamp is indeed not unique. – Robbert Aug 02 '23 at 14:39

1 Answers1

1

For Mongo v4.2 that OP is using, you can use a longer syntax that perform a $group to get the max timestamp first, then $lookup to fetch back latest documents. Then, continue with the existing $group.

db.collection.aggregate([
  {
    "$group": {
      "_id": "$trailName",
      "maxUpdate": {
        "$max": "$update"
      }
    }
  },
  {
    "$lookup": {
      "from": "collection",
      "let": {
        "trail": "$_id",
        "maxUpdate": "$maxUpdate"
      },
      "pipeline": [
        {
          "$match": {
            $expr: {
              $and: [
                {
                  $eq: [
                    "$trailName",
                    "$$trail"
                  ]
                },
                {
                  $eq: [
                    "$update",
                    "$$maxUpdate"
                  ]
                },
                
              ]
            }
          }
        }
      ],
      "as": "mostRecentDocs"
    }
  },
  {
    "$unwind": "$mostRecentDocs"
  },
  {
    "$replaceRoot": {
      "newRoot": "$mostRecentDocs"
    }
  },
  {
    "$group": {
      "_id": "$trailName",
      "updateCount": {
        "$count": {}
      },
      "mostRecentUpdate": {
        "$max": "$update"
      },
      "users": {
        "$push": "$user"
      },
      "comments": {
        "$push": "$comment"
      }
    }
  }
])

Mongo Playground


You can just use $setWindowFields with $denseRank to get the most recent record. $match with the computed rank then you can reuse your existing query.

db.collection.aggregate([
  {
    "$setWindowFields": {
      "partitionBy": "$trailName",
      "sortBy": {
        "update": -1
      },
      "output": {
        "rank": {
          "$denseRank": {}
        }
      }
    }
  },
  {
    "$match": {
      "rank": 1
    }
  },
  {
    "$group": {
      "_id": "$trailName",
      "updateCount": {
        "$count": {}
      },
      "mostRecentUpdate": {
        "$max": "$update"
      },
      "users": {
        "$push": "$user"
      },
      "comments": {
        "$push": "$comment"
      }
    }
  }
])

Mongo Playground

ray
  • 11,310
  • 7
  • 18
  • 42
  • thanks so much for your answer. It would work perfectly fine, unfortunately I have version 4.2 and setWindowsFields was introduced with version 5. – Robbert Aug 02 '23 at 14:43
  • @Robbert updated the answer to include a v4.2 solution – ray Aug 02 '23 at 15:43
  • I highly appreciate your effort to make a working solution for 4.2. Unfortunately I get errors with the "let" statement and if I uncomment that, it says "pipeline not supported". I am not sure, maybe it is because my MongoDB Data Source is Azure's CosmosDB (Azure Cosmos DB-API for MongoDB)? I am quite new, sorry for not providing this information earlier. – Robbert Aug 03 '23 at 04:18
  • @Robbert Now worries. CosmosDB is out of my wit for now. While I am thinking about other workarounds, I have put the cosmosdb tag to increase the question's exposure. HTH – ray Aug 03 '23 at 14:04