Remove all Duplicates except the most recent document

Question

I would like to clear all duplicated of a specific field in a collection. leaving only the earliest entry of the duplicates.

Here is my aggregate query which works great for finding the duplicates:

db.History.aggregate([
  { $group: {
_id: { name: "$sessionId" },  
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 } 
  } }, 
  { $match: { 
count: { $gte: 2 } 
  } },
  { $sort : { count : -1} }
 ],{ allowDiskUse:true,
  cursor:{}});

Only problem is that i need to execute a remove query as well and keep for each of the duplicates the youngest entry (determined by the field 'timeCreated':

"timeCreated" : ISODate("2016-03-07T10:48:43.251+02:00")

How exactly do i do that?

Please updated your question with supported input and output documents. — Saleem, Mar 09 '16 at 13:21
Is there one field indicate the youngest entry? one field type is date named `updateDate`? — zangw, Mar 09 '16 at 13:22

score 2 · Answer 1 · edited May 23 '17 at 10:28

Personally I would take advantage of the fact that the ObjectId values themselves are "monotonic" or therefore "ever increasing in value" which means that the "youngest" or "most recent" would come at the end of a naturally sorted list.

So rather than force the aggregation pipeline to do the sorting, the most logical and efficient thing to do is simply sort the list of unique _id values returned per document as you process each response.

So basically working with the listing that you must have found:

Remove Duplicates from MongoDB

And is actually my answer ( and your the second person to reference this week, and yet no votes received for useful! Hmm! ), where it's just a simple .sort() applied within the cursor iteration for the returned array:

Using the _id Value

var bulk = db.History.initializeOrderedBulkOp(),
    count = 0;

// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.History.aggregate([
    { "$group": {
        "_id": "$sessionId",
        "ids": { "$push": "$_id" }, // _id values are already unique, so $addToSet adds nothing
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
    doc.ids.sort().reverse();    // <-- this is the only real change
    doc.ids.shift();     // remove first match, which is now youngest
    bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
    count++;

    // Execute 1 in 1000 and re-init
    if ( count % 1000 == 0 ) {
       bulk.execute();
       bulk = db.History.initializeOrderedBulkOp();
    }
});

if ( count % 1000 != 0 ) 
    bulk.execute();

Using a specific field

If you "really" are set on adding another date value on which to determine which is youngest then just add to the array in $push first, then apply the client side sort function. Again just a really simple change:

var bulk = db.History.initializeOrderedBulkOp(),
    count = 0;

// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.History.aggregate([
    { "$group": {
        "_id": "$sessionId",
        "ids": { "$push": { 
            "_id": "$_id",
            "created": "$timeCreated"
        }},
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
    doc.ids = doc.ids.sort(function(a,b) {   // sort dates and just return _id
        return a.created.valueOf() < a.created.valueOf()
    }).map(function(el) { return el._id });
    doc.ids.shift();     // remove first match, which is now youngest
    bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
    count++;

    // Execute 1 in 1000 and re-init
    if ( count % 1000 == 0 ) {
       bulk.execute();
       bulk = db.History.initializeOrderedBulkOp();
    }
});

if ( count % 1000 != 0 ) 
    bulk.execute();

So it's a really simple process with no "real" alteration to the original process used to identify the duplicates and then remove all but one of them.

Always the best approach here to just let the server do the job of finding the duplicates, then client side when iterating the cursor you can then work out from the returned array which document is going to be kept and which ones you are going to remove.

there's a typo: "a.created.valueOf() < a.created.valueOf()" should be "a.created.valueOf() < b.created.valueOf()" — Graeme Moss, Jul 04 '18 at 09:27
I am using MongoDB version 3.2.13, but I am receiving error : "results_history2.default.initializeOrderedBulkOp is not a function" Do I need to import any library for this? — Hemant Tripathi, Mar 20 '20 at 15:24
sorting with specific field, date in my case is not working. The resulted array of ids is not sorted — springcloudlearner, Oct 03 '21 at 12:47

Remove all Duplicates except the most recent document

1 Answers1

Using the _id Value

Using a specific field