How to use MongoDB aggregation query to group documents by status and create ranges of id with each group?

Question

How to use MongoDB aggregation query to obtain range of documents based on filtered status?

Consider the following example data:

[
  {
    "visible_id": 1,
    "status": "INITIAL",
    "name": "ab"
  },
  {
    "visible_id": 2,
    "status": "INITIAL",
    "name": "aab"
  },
  {
    "visible_id": 3,
    "status": "SCANNED",
    "name": "bba"
  },
  {
    "visible_id": 4,
    "status": "SCANNED",
    "name": "cca"
  },
  {
    "visible_id": 5,
    "status": "INITIAL",
    "name": "dds"
  },
  {
    "visible_id": 6,
    "status": "INITIAL",
    "name": "aax"
  },
  {
    "visible_id": 7,
    "status": "INITIAL",
    "name": "bsd"
  },
  {
    "visible_id": 8,
    "status": "PRINTED",
    "name": "ads"
  }
]

The aggregated query should list me all the available sticker in the filter status in a range value of visible_id. If a different status is found in between, a new range must be projected from the next filter status onward appended to the result list.

Example Filter and Results according to the above CSV data: if filter status = "INITIAL" result:

[
    {status: "INITIAL", range: "1-2"},
    {status: "INITIAL", range: "5-7"}
]

if filter status = "SCANNED" result:

[
    {status: "SCANNED", range: "3-4"},
]

if filter status = "PRINTED" result:

[
    {status: "PRINTED", range: "8-8"},
]

yes. `visible_id` is a strict sequence and unique without any gabs. — T4puSD, May 31 '23 at 09:30

Wernfried Domscheit · Accepted Answer · 2023-05-31T09:42:52.610

Try this one:

db.collection.aggregate([
   { // Get the status from previous document
      $setWindowFields: {
         sortBy: { visible_id: 1 },
         output: {
            lag: { $shift: { output: "$status", by: -1 } }
         }
      }
   },
   {
      $set: { // Check if status is different to previous status
         group: {
            $cond: {
               if: { $eq: ["$status", "$lag"] },
               then: "$$REMOVE",
               else: "$visible_id"
            }
         }
      }
   },
   {
      $fill: { // Fill missing values with "last observation carried forward"
         sortBy: { visible_id: 1 },
         output: { group: { method: "locf" } }
      }
   },
   { $match: { status: "SCANNED" } }, // Limit the output
   {
      $group: { // Get min and max values
         _id: "$group",
         min: { $min: "$visible_id" },
         max: { $max: "$visible_id" },
         status: { $first: "$status" }
      }
   },
   {
      $project: { // some output cosmetics
         _id: 0,
         status: 1,
         range: { $concat: [{ $toString: "$min" }, "-", { $toString: "$max" }] }
      }
   }
])

Mongo Playground

In case visible_id is a strict sequence and you have an index on status, then this version would be much faster:

db.collection.aggregate([
   { $match: { status: "INITIAL" } },
   {
      $setWindowFields: {
         sortBy: { visible_id: 1 },
         output: {
            lag: { $shift: { output: "$visible_id", by: -1 } }
         }
      }
   },
   {
      $set: {
         group: {
            $cond: {
               if: { $eq: [{$subtract: ["$visible_id",1]}, "$lag"] },
               then: "$$REMOVE",
               else: "$visible_id"
            }
         }
      }
   },
   {
      $fill: {
         sortBy: { visible_id: 1 },
         output: { group: { method: "locf" } }
      }
   },
   {
      $group: {
         _id: "$group",
         min: { $min: "$visible_id" },
         max: { $max: "$visible_id" },
         status: { $first: "$status" }
      }
   },
   {
      $project: {
         _id: 0,
         status: 1,
         range: { $concat: [{ $toString: "$min" }, "-", { $toString: "$max" }] }
      }
   }
])

Thanks. But it would be nice if you could explain what this query does and how it achieved the desired output? Also, It would be great if you can explain the performance hit of this query if ran on huge collection. — T4puSD, May 31 '23 at 08:24
Have a look at documentation and then ask question if you don't understand something. In your question you did not show any effort from your side to get a solution, not even a trial (you are even too lazy to provide correct format for input data). Don't expect me to do your job. — Wernfried Domscheit, May 31 '23 at 09:02
Performance wise it depends on the input data (see my comment at your question). In order to find `status` changes, MonogDB has to scan the entire collection, thus the performance could be limited. Having strict sequence of `visible_id` it would be possible to use `$match` at first stage, then this aggregation pipeline can be slightly re-written and the performance would be better. — Wernfried Domscheit, May 31 '23 at 09:06
I am a total noob at mongodb. If you don't mind helping me a bit more. Can you please explain how I can improve the performance of the query by re-writing the pipeline? The `visible_id` property is unique and is a strict sequence without any gabs. Thanks in advance. — T4puSD, May 31 '23 at 09:36
Please see my update. Instead of checking the previous `status` it checks whether "current visible_id minus 1 = previous visible_id". Then it knows to start a new group. — Wernfried Domscheit, May 31 '23 at 09:37

How to use MongoDB aggregation query to group documents by status and create ranges of id with each group?

1 Answers1