I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.
So I turns to MapReduce for help. Here is my current progress.
m = function () {
emit(this.myid, 1);
}
r = function (k, vals) {
return Array.sum(vals);
}
res = db.userList.mapReduce(m,r, { out : "myoutput" });
And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:
db.myoutput.find({value: {$gt: 1}}).forEach(
function(obj) {
db.userList.remove(xxxxxxxxx) // I don't know how to do so
})
Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:
{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }
{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }
The final result should preserve only one record. Can someone give me some help on this?
Thank you. :)