Remove duplicates from MongoDB 4.0

Question

I am trying to remove duplicates from MongoDB but all solutions find fail. Given the current JSON structure:

{
    "_id": { "$oid": "5cee31bbca8a185b76a692db" },
    "date": { "$date": "2018-10-07T19:11:38.000Z" },
    "id": "1049014405130858496",
    "username": "chrisoldcorn",
    "text": "“The #UK can rest now. The Orange Buffoon is back in his xenophobic #WhiteHouse!” #news #politics #trump #populist #uspoli #ukpolitics #ukpoli #london #scotland #TrumpBaby #usa #america #canada #eu #europe #brexit #maga #msm #gop #elections #election2018 https://medium.com/@chrisoldcorn/trump-babys-uk-visit-a-reflection-1c2aa4ad942 …pic.twitter.com/Y6Yihs9g6K",
    "retweets": 1,
    "favorites": 0,
    "mentions": "@chrisoldcorn",
    "hashtags": "#UK #WhiteHouse #news #politics #trump #populist #uspoli #ukpolitics #ukpoli #london #scotland #TrumpBaby #usa #america #canada #eu #europe #brexit #maga #msm #gop #elections #election2018",
    "geo": "",
    "replies": 0,
    "to": null,
    "lan": "en"
}

I need to remove all duplicates based on field "id" in the file.

I have tried db.tweets.ensureIndex( { id:1 }, { unique:true, dropDups:true } ) but I am not sure this is the correct way. I obtain this output:

Can anyone help me?

@SergioTulentsev he said id not _id, just look at the schema he posted. — Tom Slabbaert, May 30 '19 at 09:35
@anto how do you want to decide which of the documents is the "original"? or it dosent matter aslong as 1 remains? — Tom Slabbaert, May 30 '19 at 09:35
@tomslabbaert there is not an original one. The important is to save only one document between copies. — Anto, May 30 '19 at 10:00
@Anto: in which case that index should have worked just fine. — Sergio Tulentsev, May 30 '19 at 10:01
@SergioTulentsev I edit the question with the output of the operation — Anto, May 30 '19 at 10:03
Ah, something changed. See here: https://stackoverflow.com/questions/29747062/dropdups-true-not-working-mongodb — Sergio Tulentsev, May 30 '19 at 10:08
Interesting, why did they remove it? Because of unpredictable results? Hmm. — Sergio Tulentsev, May 30 '19 at 10:13
@SergioTulentsev thank you for your link. Unfortunately they didn't show how to make the same thing in the newest version of MongoDB. Any suggestion? — Anto, May 30 '19 at 10:17
@Anto: I don't have anything as simple as adding an option to the index, unfortunately. :/ — Sergio Tulentsev, May 30 '19 at 10:18
@SergioTulentsev I don't understand.. Can you show me your solution? Thank you very much for your time! — Anto, May 30 '19 at 10:20
@Anto: something like this: http://tek9g.blogspot.com/2017/06/aggregation-framework-finding-removing.html — Sergio Tulentsev, May 30 '19 at 10:25
Possible duplicate of [Fastest way to remove duplicate documents in mongodb](https://stackoverflow.com/questions/14184099/fastest-way-to-remove-duplicate-documents-in-mongodb) — Rakshith Murukannappa, May 30 '19 at 11:21

score 1 · Answer 1 · answered May 30 '19 at 11:16

1

It looks like you are running a MongoDB with version >3.0 and hence cannot remove dups by ensuring an index

According to the docs:

Changed in version 3.0: The dropDups option is no longer available.

The fastest way to do this would be to

Create a Dump
Drop the collection
Create the new Index
Restore the Dump

All duplicate documents will be dropped during the restore insert

The next best solution will be to run a script to collect all duplicate Ids and remove them

answered May 30 '19 at 11:16

Rakshith Murukannappa

579
7
19

1

Indeed, `mongorestore` will complain about index violations, but will drop the offending documents all the same (and continue with the process) – Sergio Tulentsev May 30 '19 at 20:54

Remove duplicates from MongoDB 4.0

1 Answers1