Remove duplicate documents based on field

Question

I've seen a number of solutions on this, however they are all for Mongo v2 and are not suitable for V3.

My document looks like this:

    { 
    "_id" : ObjectId("582c98667d81e1d0270cb3e9"), 
    "asin" : "B01MTKPJT1", 
    "url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1", 
    "image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg", 
    "salesRank" : NumberInt(442137), 
    "title" : "Trump Wins 45th President Presidential Victory T-Shirt", 
    "brand" : "\"Getting Political On Me\"", 
    "favourite" : false, 
    "createdAt" : ISODate("2016-11-16T17:33:26.763+0000"), 
    "updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}

and my collection contains around 500k documents. I want to remove all duplicate documents (except for 1) where the ASIN is the same

How can I achieve this?

Did you try this [answer](http://stackoverflow.com/a/36099237/122005)? — chridam, Nov 17 '16 at 12:09
Which of the document to you intend to keep? The last created or updated? — styvane, Nov 17 '16 at 12:12
@Styvane It doesn't really matter to be honest as the content for the rest of it is identical (apart from id, createdAt, updatedAt) — K20GH, Nov 17 '16 at 12:17
@Styvane Not the best considering the OP's db size but the concept of sorting could suffice. Could be better optimised by leveraging Bulk API methods instead of the collection's `remove()` method. — chridam, Nov 17 '16 at 12:56
@chridam Would I not have to find all the duplicate asin's with that first? I ran the following but I just get ... ```db.merch-tracker.find({},{asin:1}).sort({_id:1}).forEach(function(doc){db.tee.remove({_id:{$gt:doc._id},asin:doc.asin});``` — K20GH, Nov 17 '16 at 13:13
I think the best way to go is use the aggregation framework to get a list of `_id`s of the duplicate documents, iterate the list and use the bulk API's `removeOne()` method to delete each duplicate document in batches. By the way, what version is your MongoDB server? — chridam, Nov 17 '16 at 13:22
Are you aware of the different collections in this operation `db.merch-tracker.find({},{asin:1}).sort({_id:1}).forEach(fun‌ction(doc){db.tee.re‌move({_id:{$gt:doc._‌id},asin:doc.asin});`? You are trying to delete documents in the `tee` collection with id's from `tracker` collection which obviously will fail since the `_id`s are unique in both collections. — chridam, Nov 17 '16 at 13:27
Yeah I've just changed it as I noticed that. I now have ```db.tee.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){db.tee.remove{(_id:{$gt:doc._id},asin:doc.asin})``` Same thing though :( — K20GH, Nov 17 '16 at 13:34
OK, that function does not work. It seems to remove everything, even none dupes — K20GH, Nov 17 '16 at 16:58

styvane · Accepted Answer · 2017-02-09T09:09:00.240

28

This is something we can actually do using the aggregation framework and without client side processing.

MongoDB 3.4

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$replaceRoot": { "newRoot": "$doc" } },
        { "$out": "collection" }
    ]

)

MongoDB version <= 3.2:

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$project": { 
            "asin": "$doc.asin", 
            "url": "$doc.url", 
            "image": "$doc.image", 
            "salesRank": "$doc.salesRank", 
            "title": "$doc.salesRank", 
            "brand": "$doc.brand", 
            "favourite": "$doc.favourite", 
            "createdAt": "$doc.createdAt", 
            "updatedAt": "$doc.updatedAt" 
        }},
        { "$out": "collection" }
    ]
)

edited Feb 09 '17 at 09:09

answered Nov 17 '16 at 13:53

styvane

59,869
19
150
156

5

Thanks for the answer. I guess I did something wrong in the query - I ran the first one and all my collection is gone :/ Saying this just in case somebody runs it without backup – Hakan Mar 22 '18 at 22:25
I use the MongoDB 3.4 and it works like a charm, thanks a lot. Very easy to add more keys like this `'_id': {"country":"$country", "city":"$city", 'language':"$language" }` – Florent Jul 03 '18 at 16:58
Drop [`{ "$out": "collection" }`](https://docs.mongodb.com/manual/reference/operator/aggregation/out/) if you don't want to overwrite the current collection (e.g. to use as a query), or use `{"$out": "newcollection"}` to create a new collection (named "newcollection"). – Skippy le Grand Gourou Jan 15 '20 at 16:02
@styvane Why the `{ "$sort": { "_id": 1 }}` step ? It doesn't seem necessary ? – Skippy le Grand Gourou Jan 15 '20 at 16:48
1

@SkippyleGrandGourou this is because we want to maintain the oldest version for each document. And the `_id` which in this case is an [`ObjectId`](https://docs.mongodb.com/manual/reference/bson-types/#objectid) is based on timestamp. – styvane Jan 15 '20 at 18:59
1

Spectacular answer! Saved a ton of my time. To be honest, a more concrete example should be placed on MongoDB's docs in implementing $replaceRoot and $$ROOT – johnmikelridzz Apr 13 '20 at 12:42

score 3 · Answer 2 · answered Aug 05 '18 at 20:58

Use a for loop, it will take time but will do the work

db.amazon_sales.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){
    db.amazon_sales.remove({_id:{$gt:doc._id}, asin:doc.asin});
})

Then and this index

db.amazon_sales.createIndex( { "asin": 1 }, { unique: true } )

Remove duplicate documents based on field

2 Answers2

MongoDB 3.4

MongoDB version <= 3.2:

Linked