24

In the documentation for mongoDB it says: "Changed in version 3.0: The dropDups option is no longer available."

Is there anything I can do (other than downgrading) if I actually want to create a unique index and destroy duplicate entries?

please keep in mind the I receive about 300 inserts per second so I can't just delete all duplicates and hope none will come in by the time I'm done indexing.

Alonzorz
  • 2,113
  • 4
  • 18
  • 21
  • I don't understand your question. Do I get it right that you have existing documents including dupes and now you want to put a unique index on the field contacting duplicates while at the same time potential new dupes come in? – Markus W Mahlberg May 12 '15 at 16:36
  • 1
    yes. I want to get rid of the dups and if new ones come in reject them. – Alonzorz May 13 '15 at 08:41
  • I've stuck with this issue also, is there any alternative how to get rid of duplicates without `dropDups` in MongoDB >= 3.* ?? – dr.dimitru May 14 '15 at 18:23

3 Answers3

18

Yes dropDupes is now deprecated since version 2.7.5 because it was not possible to predict correctly which document would be deleted in the process.

Typically, you have 2 options :

  1. Use a new collection :

    • Create a new collection,
    • Create the unique index on this new collection,
    • Run a batch to copy all the documents from the old collection to the new one and make sure you ignore duplicated key error during the process.
  2. Deal with it in your own collection manually :

    • make sure you won't insert more duplicated documents in your code,
    • run a batch on your collection to delete the duplicates (and make sure you keep the good one if they are not completely identical),
    • then add the unique index.

For your particular case, I would recommend the first option but with a trick :

  • Create a new collection with unique index,
  • Update your code so you now insert documents in both tables,
  • Run a batch to copy all documents from the old collection to the new one (ignore duplicated key error),
  • rename the new collection to match the old name.
  • re-update your code so you now write only in the "old" collection
Maxime Beugnet
  • 822
  • 5
  • 10
  • 1
    Option 1 is probably the best way to go for recreating indexes as well, since a live system will have to wait for indexes to rebuild which can slow it down. – Pykler Jul 16 '15 at 17:50
  • `make sure you ignore duplicated key error during the process.` How would you do this? It seems the errors stop the transaction mid-process – Quest Jul 19 '16 at 19:43
  • 1
    Use MongoDB unordered bulk inserts : "If an error occurs during the processing of one of the write operations, MongoDB will continue to process remaining write operations in the list." Exemple : db.persons.insert([{"_id" : "Bob"}, {"_id" : "John"}, {"_id" : "Bob"}, {"_id" : "Marc"}], { ordered : false }) will insert 3 documents and show one duplicate key error. With {ordered : true}, only the first 2 would be inserted. More doc [here](https://docs.mongodb.com/manual/reference/method/db.collection.initializeUnorderedBulkOp/#db.collection.initializeUnorderedBulkOp) – Maxime Beugnet Jul 20 '16 at 00:00
  • 1
    @Quest : Remember there are no transactions in MongoDB : it's not ACID. If you run 10 inserts, it won't rollback if the fifth one returns an error. Don't mistake this with the Document Level Locking which is a kind of "mini" transaction true just for one document. – Maxime Beugnet Jul 20 '16 at 12:47
  • DropDups is deprecated from 2.6 version of MongoDB. https://docs.mongodb.com/v2.6/tutorial/create-a-unique-index/#drop-duplicates – Akansh Sep 13 '16 at 07:01
  • Just a small update regarding my previous comment: now MongoDB does support multi-document ACID transactions even across a sharded cluster (v4.2): https://www.mongodb.com/transactions – Maxime Beugnet Dec 17 '19 at 16:04
15

As highlighted by @Maxime-Beugnet you can create a batch script to remove duplicates from a collection. I have included my approach below that is relatively fast if the number of duplicates are small in comparison to the collection size. For demonstration purposes this script will de-duplicate the collection created by the following script:

db.numbers.drop()

var counter = 0
while (counter<=100000){
  db.numbers.save({"value":counter})
  db.numbers.save({"value":counter})
  if (counter % 2 ==0){
    db.numbers.save({"value":counter})
  }
  counter = counter + 1;
}

You can remove the duplicates in this collection by writing an aggregate query that returns all records with more than one duplicate.

var cur = db.numbers.aggregate([{ $group: { _id: { value: "$value" }, uniqueIds: { $addToSet: "$_id" }, count: { $sum: 1 } } }, { $match: { count: { $gt: 1 } } }]);

Using the cursor you can then iterate over the duplicate records and implement your own business logic to decide which of the duplicates to remove. In the example below I am simply keeping the first occurrence:

while (cur.hasNext()) {
    var doc = cur.next();
    var index = 1;
    while (index < doc.uniqueIds.length) {
        db.numbers.remove(doc.uniqueIds[index]);
        index = index + 1;
    }
}

After removal of the duplicates you can add an unique index:

db.numbers.createIndex( {"value":1},{unique:true})
Alex
  • 21,273
  • 10
  • 61
  • 73
2

pip install mongo_remove_duplicate_indexes

best way will be to create a python script or in any language you prefer,iterate the collection ,create new collection with a unique index set to true with db.collectionname.createIndex({'indexname':1},unique:true),and insert your documents from previous collection to new collection and since key you wanted to be distinct or duplicates removed will not be inserted in ur new collection and u can handle the ecxeption easily with exception handling

check out the package source code for the example