100mb aggregation limit - does object fields contribute their reference size or actual value size to this limit?

Question

I am referring to an uncorrelated $lookup. Each document passing through this stage will receive the same array computed in the $lookup. What if the actual size of this array exceeds 100mb? Does it matter? An array is a reference type, so does the size of the reference contribute to the 100mb limit or the size of the actual value of the array?

db.absences.insertMany( [
   { "_id" : 1, "student" : "Ann Aardvark", sickdays: [ new Date ("2018-05-01"),new Date ("2018-08-23") ] },
   { "_id" : 2, "student" : "Zoe Zebra", sickdays: [ new Date ("2018-02-01"),new Date ("2018-05-23") ] },
] )

db.holidays.insertMany( [
   { "_id" : 1, year: 2018, name: "New Years", date: new Date("2018-01-01") },
   { "_id" : 2, year: 2018, name: "Pi Day", date: new Date("2018-03-14") },
   { "_id" : 3, year: 2018, name: "Ice Cream Day", date: new Date("2018-07-15") },
   { "_id" : 4, year: 2017, name: "New Years", date: new Date("2017-01-01") },
   { "_id" : 5, year: 2017, name: "Ice Cream Day", date: new Date("2017-07-16") }
] )

db.absences.aggregate( [
   {
      $lookup:
         {
           from: "holidays",
           pipeline: [
              { $match: { year: 2018 } },
              { $project: { _id: 0, date: { name: "$name", date: "$date" } } },
              { $replaceRoot: { newRoot: "$date" } }
           ],
           as: "holidays"
         }
    }
] )

output of $lookup:

{
  _id: 1,
  student: 'Ann Aardvark',
  sickdays: [
    ISODate("2018-05-01T00:00:00.000Z"),
    ISODate("2018-08-23T00:00:00.000Z")
  ],
  holidays: [
    { name: 'New Years', date: ISODate("2018-01-01T00:00:00.000Z") },
    { name: 'Pi Day', date: ISODate("2018-03-14T00:00:00.000Z") },
    { name: 'Ice Cream Day', date: ISODate("2018-07-15T00:00:00.000Z")
    }
  ]
},
{
  _id: 2,
  student: 'Zoe Zebra',
  sickdays: [
    ISODate("2018-02-01T00:00:00.000Z"),
    ISODate("2018-05-23T00:00:00.000Z")
  ],
  holidays: [
    { name: 'New Years', date: ISODate("2018-01-01T00:00:00.000Z") },
    { name: 'Pi Day', date: ISODate("2018-03-14T00:00:00.000Z") },
    { name: 'Ice Cream Day', date: ISODate("2018-07-15T00:00:00.000Z")
    }
  ]
}

ray · Answer 1 · 2023-01-26T17:16:05.407

The $lookup result is subjected to 16MB single document size limit. So it won't be possible for a 100MB $lookup result to pass without the $lookup + $unwind coalesce optimization.

The 100MB limit you mentioned should be memory restriction of an aggregation pipeline. In the official document,

Starting in MongoDB 6.0, the allowDiskUseByDefault parameter controls whether pipeline stages that require more than 100 megabytes of memory to execute write temporary files to disk by default.

So for your case of exceeding 100MB memory limit, if you can work around the 16MB single document size limit mentioned above, you will likely pass with the allowDiskUse option prior to v6.0.

Regarding caching the output of subquery, according to official doc of $lookup,

Starting in MongoDB 5.0, for an uncorrelated subquery in a $lookup pipeline stage containing a $sample stage, the $sampleRate operator, or the $rand operator, the subquery is always run again if repeated. Previously, depending on the subquery output size, either the subquery output was cached or the subquery was run again.

So for your case, as $sample/$sampleRate/$rand is not involved in your sub-query, it could be possible that your sub-query result is cached and reused.

Diving deep into the source code,

// When local/foreignFields are included, we cannot enable the cache because the $match
// is a correlated prefix that will not be detected. Here, local/foreignFields are absent,
// so we enable the cache.
_cache.emplace(internalDocumentSourceLookupCacheSizeBytes.load());
// Add the user pipeline to '_resolvedPipeline' after any potential view prefix and $match
_resolvedPipeline.insert(_resolvedPipeline.end(), pipeline.begin(), pipeline.end());

As your $lookup does not have localField and foreignField, the lookup result is cached and will be reused.

If the $lookup is an uncorrelated subquery, will the pipeline only perform one lookup? Will each document get a deep copy of the result of the lookup? — Bear Bile Farming is Torture, Jan 26 '23 at 00:14
@BigCatPublicSafetyLaw updated the answer to include a reference from official doc about sub-query result cache. — ray, Jan 26 '23 at 07:25
Ok, so the subquery will be reused. What about the question of each document keeping a deep copy? Is that the case? Or will each document get a reference to the subquery? — Bear Bile Farming is Torture, Jan 26 '23 at 07:36
@BigCatPublicSafetyLaw From [an answer from MongoDB Employee](https://www.mongodb.com/community/forums/t/mongodb-cache-how-does-it-work/101929), the result is cached in RAM. Not sure if that answer your question though. — ray, Jan 26 '23 at 17:19

100mb aggregation limit - does object fields contribute their reference size or actual value size to this limit?

1 Answers1