35

I`m using a script to remove duplicates on mongo, it worked in a collection with 10 items that I used as a test but when I used for the real collection with 6 million documents, I get an error.

This is the script which I ran in Robomongo (now known as Robo 3T):

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();
var count = 0;

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();

This is the error message:

Error: command failed: {
    "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.",
    "code" : 16945,
    "ok" : 0
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:23:13
doassert@src/mongo/shell/assert.js:13:14
assert.commandWorked@src/mongo/shell/assert.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1215:5
@(shell):1:1

So I need to set allowDiskUse:true to work? Where do I do that in the script and is there any problem doing this?

gmauch
  • 1,316
  • 4
  • 25
  • 39
Carlos Siestrup
  • 1,031
  • 2
  • 13
  • 33

4 Answers4

70
{ allowDiskUse: true } 

Should be placed right after the aggregation pipeline.

In your code this should go like this:

db.getCollection('RAW_COLLECTION').aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
], { allowDiskUse: true } )

Note: Using { allowDiskUse: true } may introduce issues related to performance as aggregation pipeline will access data from temporary files on disk. Also depends on disk performance and the size of your working set. Test performance for your use case

rofrol
  • 14,438
  • 7
  • 79
  • 77
Atish
  • 4,277
  • 2
  • 24
  • 32
  • 1
    But is it safe to set it true? I didnt understand why this is necessary – Carlos Siestrup May 24 '17 at 15:17
  • Aggregation pipeline stages have maximum memory use limit. To handle large datasets, set allowDiskUse option to true to enable writing data to temporary files. This should be giving a varying performance as compared to when reading completely from memory. Also depend on the data set size – Atish May 24 '17 at 15:19
  • 2
    This is giving me an error “AttributeError: 'dict' object has no attribute '_txn_read_preference' ". This is becuase it should not be a dict. it should be like this : ```findings = list(collection.aggregate(aggr, allowDiskUse=True))``` – Pawan bisht Aug 25 '21 at 11:48
4

It is always better to use match before group when you have large data. If you are using match before group, you won't get into this problem.

db.getCollection('sample').aggregate([
   {$match:{State:'TAMIL NADU'}},
   {$group:{
       _id:{DiseCode:"$code", State:"$State"},
       totalCount:{$sum:1}
   }},

   {
     $project:{
        Code:"$_id.code",
        totalCount:"$totalCount",
        _id:0 
     }   

   }

])

If you really overcome this issue without match, then solution is { allowDiskUse: true }

Thavaprakash Swaminathan
  • 6,226
  • 2
  • 30
  • 31
3

Here is a simple undocumented trick that can help in a lot of case to avoid disk usage.

You can use a intermediate $project stage to reduce the size of the records passed in the $sort stage.

In this exemple it will drive to :

var bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();
var count = 0;

db.getCollection('RAW_COLLECTION').aggregate([
  // here is the important stage
  { "$project": { "_id": 1, "RegisterNumber": 1, "Region": 1 } }, // this will reduce the records size
  { "$group": {
    "_id": { RegisterNumber: "$RegisterNumber", Region: "$Region" },
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.getCollection('RAW_COLLECTION').initializeOrderedBulkOp();  // re-init after execute
  }
});

see the first $project stage that is here only to avoid the disk usage.

This is especially useful for collection will large records with most of the data unused in the aggregate

Julien TASSIN
  • 5,004
  • 1
  • 25
  • 40
1

From MongoDB Docs

The $group stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $group will produce an error. However, to allow for the handling of large datasets, set the allowDiskUse option to true to enable $group operations to write to temporary files. See db.collection.aggregate() method and the aggregate command for details.

Sercan Ozdemir
  • 4,641
  • 3
  • 34
  • 64