Group by day with Multiple Date Fields

Question

I have documents stored into MongoDB like this :

{
    "_id" : "XBpNKbdGSgGfnC2MJ",
    "po" : 72134185,
    "machine" : 40940,
    "location" : "02A01",
    "inDate" : ISODate("2017-07-19T06:10:13.059Z"),
    "requestDate" : ISODate("2017-07-19T06:17:04.901Z"),
    "outDate" : ISODate("2017-07-19T06:30:34Z")
}

And I want give the sum, by day, of inDate and outDate.

I can retrieve of both side the sum of documents by inDate day and, on other side, the sum of documents by outDate, but I would like the sum of each.

Currently, I use this pipeline :

      $group: {
        _id: {
          yearA: { $year: '$inDate' },
          monthA: { $month: '$inDate' },
          dayA: { $dayOfMonth: '$inDate' },
        },
        count: { $sum: 1 },
      },

and I give :

{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "count" : 1 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "count" : 11 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "count" : 20 }

But I would like, if it's possible :

{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "countIn" : 1, "countOut" : 4 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "countIn" : 11, "countOut" : 23 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "countIn" : 20, "countOut" : 18 }

Any idea ? Many thanks :-)

Best practice here would be to run each aggregation separately and then "combine" the results in post processing. It's fairly simple and common practice to run "parallel" processes and essentially "combine" on the output key that is in common. Running "parallel" is your better option here, with different handling depending on the language choice and environment, as well as the size of overall results. So ideally with nodejs or something that allows parallel execution of the queries in an "async" matter, rather than blocking on each separately. — Neil Lunn, Jul 26 '17 at 04:38

Neil Lunn · Answer 1 · 2017-07-26T10:32:11.940

You can also split the documents at the source, by essentially combining each value into an array of entries by "type" for "in" and "out". You can do this simply using $map and $cond to select the fields, then $unwind the array and then determine which field to "count" again by inspecting with $cond:

collection.aggregate([
  { "$project": {
    "dates": {
      "$filter": {
        "input": { 
          "$map": {
            "input": [ "in", "out" ],
            "as": "type",
            "in": {
              "type": "$$type",
              "date": {
                "$cond": {
                  "if": { "$eq": [ "$$type", "in" ] },
                  "then": "$inDate",
                  "else": "$outDate"
                }
              }
            }
          }
        },
        "as": "dates",
        "cond": { "$ne": [ "$$dates.date", null ] }
      }
    }
  }},
  { "$unwind": "$dates" },
  { "$group": {
    "_id": {
      "year": { "$year": "$dates.date" },
      "month": { "$month": "$dates.date" },
      "day": { "$dayOfMonth": "$dates.date" }
    },
    "countIn": {
      "$sum": {
        "$cond": {
          "if": { "$eq": [ "$dates.type", "in" ]  },
          "then": 1,
          "else": 0
        }
      }
    },
    "countOut": {
      "$sum": {
        "$cond": {
          "if": { "$eq": [ "$dates.type", "out" ]  },
          "then": 1,
          "else": 0
        }
      }
    }
  }}
])

That's a safe way to do this that does not risk breaking the BSON limit, no matter what size of data you send at it.

Personally I would rather run as separate processes and "combine" the aggregated results separately, but that would depend on the environment you are running in, which is not mentioned in the question.

For an example of "parallel" execution, you can structure in Meteor somewhere along these lines:

import { Meteor } from 'meteor/meteor';
import { Source } from '../imports/source';
import { Target } from '../imports/target';

Meteor.startup(async () => {
  // code to run on server at startup

  await Source.remove({});
  await Target.remove({});

  console.log('Removed');

  Source.insert({
    "_id" : "XBpNKbdGSgGfnC2MJ",
    "po" : 72134185,
    "machine" : 40940,
    "location" : "02A01",
    "inDate" : new Date("2017-07-19T06:10:13.059Z"),
    "requestDate" : new Date("2017-07-19T06:17:04.901Z"),
    "outDate" : new Date("2017-07-19T06:30:34Z")
  });

  console.log('Inserted');

  await Promise.all(
    ["In","Out"].map( f => new Promise((resolve,reject) => {
      let cursor = Source.rawCollection().aggregate([
        { "$match": { [`${f.toLowerCase()}Date`]: { "$exists": true } } },
        { "$group": {
          "_id": {
            "year": { "$year": `$${f.toLowerCase()}Date` },
            "month": { "$month": `$${f.toLowerCase()}Date` },
            "day": { "$dayOfYear": `$${f.toLowerCase()}Date` }
          },
          [`count${f}`]: { "$sum": 1 }
        }}
      ]);

      cursor.on('data', async (data) => {
        cursor.pause();
        data.date = data._id;
        delete data._id;
        await Target.upsert(
          { date: data.date },
          { "$set": data }
        );
        cursor.resume();
      });

      cursor.on('end', () => resolve('done'));
      cursor.on('error', (err) => reject(err));
    }))
  );

  console.log('Mapped');

  let targets = await Target.find().fetch();
  console.log(targets);

});

Which is essentially going to output to the target collection as was mentioned in comments like:

{
        "_id" : "XdPGMkY24AcvTnKq7",
        "date" : {
                "year" : 2017,
                "month" : 7,
                "day" : 200
        },
        "countIn" : 1,
        "countOut" : 1
}

@dnickless Typo corrected. Should have been a clear typo. Just pointing out correctness here. — Neil Lunn, Jul 26 '17 at 05:41
Many thanks, I use NodeJs with Meteor framework. I will create a post treatment for combine both results. Maybe I can try to use local collection into client. — Aurélien, Jul 26 '17 at 06:21
@Aurélien That could be viable, and there you could effectively `"upsert"` the data into a result collection. It would take some tweaking with meteor to do "proper" parallel execution, but it is possible. The code above though should work at a pinch directly with meteor. — Neil Lunn, Jul 26 '17 at 06:25
OK, I'm trying to execute your query. I didn't say in many case, `outDate` is `null`. I modify the query with this line : `"if": { $and: [ { "$eq": [ "$dates.type", "in" ] }, {"$ne": [ "$dates.date", null] } ] },` but I get this error : _"can't convert from BSON type EOO to Date"_ — Aurélien, Jul 26 '17 at 06:48
@Aurélien It was presumed in your question that both fields would be present. You can use `$filter` preferably as shown to remove anything where the field is not present. Alternately `$match` after the `$unwind` as MongoDB version permits. MongoDB 3.2 should be shipping standard with recent meteor releases — Neil Lunn, Jul 26 '17 at 07:44
@Aurélien Finally got around to adding a parallel execution sample. By using output streams and running parallel through `Promise.all` you can deal with large output and even much more complex processing. Here we simply `.upsert()` into a target collection and "merge" the keys. So it's really just another alternate example. — Neil Lunn, Jul 26 '17 at 10:35
@Aurélien Is there something in the provided answer that you believe does not address your question? If so then please comment on the answer to clarify what exactly needs to be addressed that has not. If it does in fact answer the question you asked then please note to [Accept your Answers](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) to the questions you ask — Neil Lunn, Aug 03 '17 at 03:16

dnickless · Answer 2 · 2017-09-25T08:57:42.960

Riiiight. I came up with the following query. Admittedly, I have seen simpler and nicer ones in my life but it certainly gets the job done:

db.getCollection('test').aggregate
(
  {
    $facet: // split aggregation into two pipelines
    {
      "in": [
        { "$match": { "inDate": { "$ne": null } } }, // get rid of null values
        { $group: { "_id": { "y": { "$year": "$inDate" }, "m": { "$month": "$inDate" }, "d": { "$dayOfMonth": "$inDate" } }, "cIn": { $sum : 1 } } }, // compute sum per inDate
      ],
      "out": [
        { "$match": { "outDate": { "$ne": null } } }, // get rid of null values
        { $group: { "_id": { "y": { "$year": "$outDate" }, "m": { "$month": "$outDate" }, "d": { "$dayOfMonth": "$outDate" } }, "cOut": { $sum : 1 } } }, // compute sum per outDate
      ]
    }
  },
  { $project: { "result": { $setUnion: [ "$in", "$out" ] } } }, // merge results into new array
  { $unwind: "$result" }, // unwind array into individual documents
  { $replaceRoot: { newRoot: "$result" } }, // get rid of the additional field level
  { $group: { _id: { year: "$_id.y", "month": "$_id.m", "day": "$_id.d" }, "countIn": { $sum: "$cIn" }, "countOut": { $sum: "$cOut" } } } // group into final result
)

As always with MongoDB aggregations you can get an idea of what's going on by simply reducing the projection stages step by step starting from the end of the query.

EDIT:

As you can see in the comments below there was a bit of a discussion around document size limits and the general applicability of this solution.

So let's look at those aspects in greater detail and let's also compare the performance of the $facet based solution to the one based on $map (suggested by @NeilLunn to avoid potential document size issues).

I created 2 million test records that have random dates assigned to both the "inDate" and the "outDate" field:

{
    "_id" : ObjectId("597857e0fa37b3f66959571a"),
    "inDate" : ISODate("2016-07-29T22:00:00.000Z"),
    "outDate" : ISODate("1988-07-14T22:00:00.000Z")
}

The data range covered was from 01.01.1970 all the way to 01.01.2050, that's a total of 29220 distinct days. Given the random distribution of the 2 million test records across this time range both queries can be expected to return the full 29220 possible results (which both did).

Then I ran both queries five times after restarting my single MongoDB instance freshly and the results in milliseconds I got looked like this:

$facet: 5663, 5400, 5380, 5460, 5520

$map: 9648, 9134, 9058, 9085, 9132

I also measured the size of the single document returned by the facet stage which was 3.19MB so reasonably far away from the MongoDB document size limit (16MB at the time of writing) which, however, only applies to the result document anyway and wouldn't be a problem during pipeline processing.

Bottom line: If you want performance, use the solution suggested here. Be careful about the document size limit, though, in particular if your use case is not the exact one described in the question above (e.g. when you need to collect even more/bigger data). Also, I am not sure if in a sharded scenario both solutions still expose the same performance characteristics...

[`$facet`](https://docs.mongodb.com/manual/reference/operator/aggregation/facet/) means that you are putting **ALL** results into a **single document**. In just about every real world case this means you would break the BSON Limit. It's only real intended usage is to return already "reduced" numbers. Such as a "count" result from a separate pipeline process. It is not to be used to "shoehorn" running multiple queries in one process. If someone needs "multiple queries" then they should run them as "multiple queries" instead. That's good practice. — Neil Lunn, Jul 26 '17 at 04:32
That's not at all how I read the documentation... https://docs.mongodb.com/manual/reference/operator/aggregation/facet/ `$facet` has nothing to do with a single document - it simply allows you to create several aggregation pipelines for the same set of documents. Besides, I *am* aggregating in the `$facet` stages so the output will be small. Oh well, I suppose, @Aurélien can try some runs and see how memory/performance behave. — dnickless, Jul 26 '17 at 05:20
You are of course incorrect and should perhaps actually try it yourself. `$facet` does in fact produce **one** document as pipeline output. It does actually say this even if it is not very prominent. But the output examples on the documentation page also all show **one** document. Hence there is a problem with the BSON Limit and it should be avoided for anything that does not intentionally "reduce". So my point here is to "highlight" that this suggestion is **bad practice** and not to be followed. — Neil Lunn, Jul 26 '17 at 05:39
You're right about the single document statement. Mind you, the output of my facet stage is going to be relatively tiny in size and numbers so I personally do not see the risk of running into the document limit that easily. I'd really like to see some performance comparison here! — dnickless, Jul 26 '17 at 06:09
Understand that the point here is anyone coming along and seeing this, who actually has "large output" requirements needs to understand not yo use `$facet` for this purpose. There are other perfectly reasonable approaches that do not risk breaching the BSON limit. So just because a question shows "one document" it should not be presumed that output in actual use cases will stay under that constraint. It's simply not advisable to use the operator in this way. I'm sure people will abuse it, but at least I know I said something about it. — Neil Lunn, Jul 26 '17 at 06:14

Group by day with Multiple Date Fields

2 Answers2

Linked