1

I'm trying to remove duplicate documents in MongoDB in a large collection according to the approach described here:

db.events.aggregate([
    { "$group": {
        "_id": { "firstId": "$firstId", "secondId": "$secondId" },
        "dups": { "$push": "$_id" },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } }}
], {allowDiskUse:true, cursor:{ batchSize:100 } }).forEach(function(doc) {
    doc.dups.shift();
    db.events.remove({ "_id": {"$in": doc.dups }});
});

I.e. I want to remove events that has the same "firstId - secondId" combination. However after a while MongoDB responds with this error:

2016-11-30T14:13:57.403+0000 E QUERY    [thread1] Error: getMore command failed: {
    "ok" : 0,
    "errmsg" : "BSONObj size: 17582686 (0x10C4A5E) is invalid. Size must be between 0 and 16793600(16MB)",
    "code" : 10334
}

Is there anyway to get around this? I'm using MongoDB 3.2.6.

Community
  • 1
  • 1
Johan
  • 37,479
  • 32
  • 149
  • 237

1 Answers1

1

The error message indicates that some part of the process is attempting to create a document that is larger than the 16 MB document size limit in MongoDB.

Without knowing your data set, I would guess that the size of the collection is sufficiently large that the number of unique firstId / secondId combinations is growing the result set past the document size limit.

If the size of the collection prevents finding all duplicates values in one operation, you may want to try breaking it up and iterating through the collection and querying to find duplicate values:

db.events.find({}, { "_id" : 0, "firstId" : 1, "secondId" : 1 }).forEach(function(doc) {
  cnt = db.events.find( 
    { "firstId" : doc.firstId, "secondId" : doc.secondId }, 
    { "_id" : 0, "firstId" : 1, "secondId" : 1 } // explictly only selecting key fields to allow index to cover the query 
  ).count()

  if( cnt > 1 )
    print('Dupe Keys: firstId: ' + doc.firstId + ', secondId: ' + doc.secondId)
 })

It's probably not the most efficient implementation, but you get the idea.

Note that this approach heavily relies upon the existence of the index { 'firstId' : 1, 'secondId' : 1 }

Adam Harrison
  • 3,323
  • 2
  • 17
  • 25
  • But what's your suggestion on how to actually remove the duplicates later since this on prints them to the console afaict? (there are indeed a lot of duplicates) – Johan Dec 01 '16 at 06:30
  • The script above could be modified to save the records to some separate collection, after which you could perform any cleanup necessary to remove or modify the duplicate records. The "print" line would just be changed to a `db.dupe_keys.insert( { "firstId" : doc.firstId, "secondId" : doc.secondId } )`. – Adam Harrison Dec 01 '16 at 21:57