As highlighted by @Maxime-Beugnet you can create a batch script to remove duplicates from a collection. I have included my approach below that is relatively fast if the number of duplicates are small in comparison to the collection size. For demonstration purposes this script will de-duplicate the collection created by the following script:
db.numbers.drop()
var counter = 0
while (counter<=100000){
db.numbers.save({"value":counter})
db.numbers.save({"value":counter})
if (counter % 2 ==0){
db.numbers.save({"value":counter})
}
counter = counter + 1;
}
You can remove the duplicates in this collection by writing an aggregate query that returns all records with more than one duplicate.
var cur = db.numbers.aggregate([{ $group: { _id: { value: "$value" }, uniqueIds: { $addToSet: "$_id" }, count: { $sum: 1 } } }, { $match: { count: { $gt: 1 } } }]);
Using the cursor you can then iterate over the duplicate records and implement your own business logic to decide which of the duplicates to remove. In the example below I am simply keeping the first occurrence:
while (cur.hasNext()) {
var doc = cur.next();
var index = 1;
while (index < doc.uniqueIds.length) {
db.numbers.remove(doc.uniqueIds[index]);
index = index + 1;
}
}
After removal of the duplicates you can add an unique index:
db.numbers.createIndex( {"value":1},{unique:true})