62

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
   "source_references" : [
                           "_id" : ObjectId("5045xxxxxxxxxxxxxx"),
                           "name" : "xxx",
                           "key" : 123
                          ]
 }

I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).

I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.

Is there a way to remove the duplicates in Mongo Internal command line?

Somnath Muluk
  • 55,015
  • 38
  • 216
  • 226
user1518659
  • 2,198
  • 9
  • 29
  • 40

8 Answers8

83

This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.

If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.

Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.

Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76
Stennie
  • 63,885
  • 14
  • 149
  • 175
  • 5
    Hands on explanations like this should be mandatory in all docs – Erik Nov 16 '12 at 19:46
  • Can we delete only the newest duplicates ? I prefeer to keep the old ones , How can I do this ? – Aysennoussi Feb 19 '14 at 09:08
  • @Sekai: If you want to delete the newest duplicates (or have more control) you'll have to write a bit of custom script/code to find duplicates and work out which document(s) you want to delete. – Stennie Feb 19 '14 at 13:24
  • I'm doing that, I just thought there would a 1 line solution! – Aysennoussi Feb 19 '14 at 13:58
  • I dont want to remove the duplicates, but I want to update some status for the duplicates as `duplicate : true` . Is there any way to do this. – GGGGG Mar 06 '14 at 07:10
  • Does having an index with `dropDups` mean that future inserts are silently ignored, or will they still send back an unique constraint error to the client? – Nic Cottrell Sep 04 '14 at 09:49
  • 1
    @NicCottrell: the `dropDups` option only applies when the unique index is created. Future inserts with duplicate keys will generate a duplicate key error. – Stennie Sep 04 '14 at 11:26
  • 1
    @wordsforthewise Obvious reasons to deprecate `dropDups`: it was not clear which duplicate would be removed and more importantly deleting documents is an unexpected side effect of creating an index. Worst case scenario: if you create a unique index with `dropDups:true` on a field that doesn't exist (eg. due to a typo) the indexed value will be `null` and you'll be left with exactly one document in the collection. – Stennie Oct 26 '16 at 06:20
  • @wordsforthewise I assumed you're referring to complexity of the `dropDups` option for unique indexes rather than deprecation of `ensureIndex` itself, but FYI `ensureIndex` actually calls `createIndex` and only returns a result if there is an error: https://github.com/mongodb/mongo/blob/v3.2/src/mongo/shell/collection.js#L699. There's no need for two methods when both are actually using the `createIndex` command and the difference relies on whether a calling program cares about the return code. – Stennie Oct 26 '16 at 06:34
  • I see, in my case I don't care about which duplicate is dropped because by definition they are equal. I guess as long as you remember that on caveat with fields that don't exist, it should be fine right? – wordsforthewise Oct 26 '16 at 15:24
  • how do we drop duplicates while creating index in mongo? given `ensureIndex` has been depreciated and replaced by `createIndex` – Jee Mok May 26 '21 at 06:25
71

This is the easiest query I used on my MongoDB 3.2

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
    db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})

Index your customKey before running this to increase speed

Kanak Singhal
  • 3,074
  • 1
  • 19
  • 17
  • If i have more duplicates means it will remove the all duplicates – sara Apr 06 '16 at 15:05
  • Yes @sara. It will delete all duplicates until you specify limit in remove query – Kanak Singhal Apr 20 '16 at 05:10
  • 2
    How would this work if I need to search on multiple keys instead of just one? – eozzy May 07 '17 at 00:40
  • 2
    FYI, if you want to leave the newest record change the $gt to $lt db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){ db.myCollection.remove({_id:{$lt:doc._id}, myCustomKey:doc.myCustomKey}); }) – SteveO7 Jun 19 '17 at 20:44
  • 1
    This seemed to be working very slowly. I found this to work much better, but took a lot of work to get working: https://stackoverflow.com/a/44522593/4549682 – wordsforthewise Nov 12 '17 at 21:13
  • @3zzy To search on multiple keys, you would add another key to the projection. Then, in the function, you would add the same key in the query. – vaer-k Feb 14 '18 at 00:11
  • @KanakSinghal $gt:doc._id Is the document created with a greater id guaranteed to be the elder one? – javadev Feb 28 '18 at 17:11
  • I get this error "Executor error during find command :: caused by :: Sort exceeded memory limit of XX bytes, but did not opt in to external sorting.", – Coconut Oct 07 '20 at 20:23
10

While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

  1. Let the MongoDB do that for you using Map Reduce
  2. You do programatically which is less efficient.
Community
  • 1
  • 1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
8

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });
Fernando
  • 379
  • 3
  • 10
  • I like this straightforward and simple method, but I think it would be more logical to remove documents by _id rather than by key. So something like this (inside if statement): `db.collection.remove({ _id: doc._id })` – salkcid Jun 14 '21 at 06:33
6

I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Mayank Patel
  • 8,088
  • 5
  • 55
  • 75
4

Expanding on Fernando's answer, I found that it was taking too long, so I modified it.

var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
  var i = 0;
  db.collection.find({ "field": fieldValue }).forEach(doc => {
    if (i) {
      db.collection.remove({ _id: doc._id });
    }
    i++;
    x += 1;
    if (x % 100 === 0) {
      print(x); // Every time we process 100 docs.
    }
  });
});

The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.

Also, indexing the field before the operation helps.

Computer's Guy
  • 5,122
  • 8
  • 54
  • 74
1

pip install mongo_remove_duplicate_indexes

  1. create a script in any language
  2. iterate over your collection
  3. create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection db.createCollection("cname") create new index db.cname.createIndex({'genre':1},unique:1) now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
  4. now just insert the json format values u received into new collection and handle exception using exception handling for ex pymongo.errors.DuplicateKeyError

check out the package source code for the mongo_remove_duplicate_indexes for better understanding

1

If you have enough memory, you can in scala do something like that:

cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})
Dan Lowe
  • 51,713
  • 20
  • 123
  • 112
gilcu2
  • 343
  • 3
  • 13