10

I've seen a number of solutions on this, however they are all for Mongo v2 and are not suitable for V3.

My document looks like this:

    { 
    "_id" : ObjectId("582c98667d81e1d0270cb3e9"), 
    "asin" : "B01MTKPJT1", 
    "url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1", 
    "image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg", 
    "salesRank" : NumberInt(442137), 
    "title" : "Trump Wins 45th President Presidential Victory T-Shirt", 
    "brand" : "\"Getting Political On Me\"", 
    "favourite" : false, 
    "createdAt" : ISODate("2016-11-16T17:33:26.763+0000"), 
    "updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}

and my collection contains around 500k documents. I want to remove all duplicate documents (except for 1) where the ASIN is the same

How can I achieve this?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
K20GH
  • 6,032
  • 20
  • 78
  • 118
  • Did you try this [answer](http://stackoverflow.com/a/36099237/122005)? – chridam Nov 17 '16 at 12:09
  • @chridam I don't like that answer. It is not the way to go( – styvane Nov 17 '16 at 12:10
  • Which of the document to you intend to keep? The last created or updated? – styvane Nov 17 '16 at 12:12
  • @Styvane It doesn't really matter to be honest as the content for the rest of it is identical (apart from id, createdAt, updatedAt) – K20GH Nov 17 '16 at 12:17
  • @Styvane Not the best considering the OP's db size but the concept of sorting could suffice. Could be better optimised by leveraging Bulk API methods instead of the collection's `remove()` method. – chridam Nov 17 '16 at 12:56
  • @chridam Would I not have to find all the duplicate asin's with that first? I ran the following but I just get ... ```db.merch-tracker.find({},{asin:1}).sort({_id:1}).forEach(function(doc){db.tee.remove({_id:{$gt:doc._id},asin:doc.asin});``` – K20GH Nov 17 '16 at 13:13
  • I think the best way to go is use the aggregation framework to get a list of `_id`s of the duplicate documents, iterate the list and use the bulk API's `removeOne()` method to delete each duplicate document in batches. By the way, what version is your MongoDB server? – chridam Nov 17 '16 at 13:22
  • Are you aware of the different collections in this operation `db.merch-tracker.find({},{asin:1}).sort({_id:1}).forEach(fun‌​ction(doc){db.tee.re‌​move({_id:{$gt:doc._‌​id},asin:doc.asin});`? You are trying to delete documents in the `tee` collection with id's from `tracker` collection which obviously will fail since the `_id`s are unique in both collections. – chridam Nov 17 '16 at 13:27
  • Yeah I've just changed it as I noticed that. I now have ```db.tee.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){db.tee.remove{(_id:{$gt:doc._id},asin:doc.asin})``` Same thing though :( – K20GH Nov 17 '16 at 13:34
  • OK, that function does not work. It seems to remove everything, even none dupes – K20GH Nov 17 '16 at 16:58

2 Answers2

28

This is something we can actually do using the aggregation framework and without client side processing.

MongoDB 3.4

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$replaceRoot": { "newRoot": "$doc" } },
        { "$out": "collection" }
    ]

)

MongoDB version <= 3.2:

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$project": { 
            "asin": "$doc.asin", 
            "url": "$doc.url", 
            "image": "$doc.image", 
            "salesRank": "$doc.salesRank", 
            "title": "$doc.salesRank", 
            "brand": "$doc.brand", 
            "favourite": "$doc.favourite", 
            "createdAt": "$doc.createdAt", 
            "updatedAt": "$doc.updatedAt" 
        }},
        { "$out": "collection" }
    ]
)
styvane
  • 59,869
  • 19
  • 150
  • 156
  • 5
    Thanks for the answer. I guess I did something wrong in the query - I ran the first one and all my collection is gone :/ Saying this just in case somebody runs it without backup – Hakan Mar 22 '18 at 22:25
  • I use the MongoDB 3.4 and it works like a charm, thanks a lot. Very easy to add more keys like this `'_id': {"country":"$country", "city":"$city", 'language':"$language" }` – Florent Jul 03 '18 at 16:58
  • Drop [`{ "$out": "collection" }`](https://docs.mongodb.com/manual/reference/operator/aggregation/out/) if you don't want to overwrite the current collection (e.g. to use as a query), or use `{"$out": "newcollection"}` to create a new collection (named "newcollection"). – Skippy le Grand Gourou Jan 15 '20 at 16:02
  • @styvane Why the `{ "$sort": { "_id": 1 }}` step ? It doesn't seem necessary ? – Skippy le Grand Gourou Jan 15 '20 at 16:48
  • 1
    @SkippyleGrandGourou this is because we want to maintain the oldest version for each document. And the `_id` which in this case is an [`ObjectId`](https://docs.mongodb.com/manual/reference/bson-types/#objectid) is based on timestamp. – styvane Jan 15 '20 at 18:59
  • 1
    Spectacular answer! Saved a ton of my time. To be honest, a more concrete example should be placed on MongoDB's docs in implementing $replaceRoot and $$ROOT – johnmikelridzz Apr 13 '20 at 12:42
3

Use a for loop, it will take time but will do the work

db.amazon_sales.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){
    db.amazon_sales.remove({_id:{$gt:doc._id}, asin:doc.asin});
})

Then and this index

db.amazon_sales.createIndex( { "asin": 1 }, { unique: true } )
JavaSheriff
  • 7,074
  • 20
  • 89
  • 159