Removing duplicate records using MapReduce

Question

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { "MlsId": "12345"" },
  { "MlsId": "12345" },
  { "MlsId": "23456" },
  { "MlsId": "23456" },
  { "MlsId": "0" },
  { "MlsId": "0" },
  { "MlsId": "" },
  { "MlsId": "" }
]

A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.

How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.

Here is what I have so far, but it doesn't check if the MlsId is "0" or "":

m = function () { 
    emit(this.MlsId, 1); 
} 

r = function (k, vals) { 
   return Array.sum(vals); 
} 

res = db.Listing.mapReduce(m,r); 
db[res.result].find({value: {$gt: 1}}); 
db[res.result].drop();

Hari Menon · Accepted Answer · 2011-04-03T17:03:43.850

2

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like

m = function () { 
  if(this.MlsId!=0 && this.MlsId!="") {    
    emit(this.MlsId, 1); 
  }
}

And reduce should return key-value pairs. So it should be:

r = function(k, vals) {
  emit(k,Arrays.sum(vals);
}

After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

edited Apr 03 '11 at 17:03

answered Apr 03 '11 at 16:01

Hari Menon

33,649
14
85
108

You may not be able to answer this, but if I declare the m and r functions and then execute the mapReduce function and then run the db[res.result].drop(); command, is it going to drop all listings or only ones that are duplicates? I don't understand mapReduce and emit so don't know how this all works... – Justin Apr 03 '11 at 16:18
I have made a small correction for reduce(). I am not sure hoe db.drop() works but yea, I guess it'll delete all entried for that particular Id. But I hope your understanding of map-reduce part is clear. I am not sure of mongodb syntax though.. will see if I can figure something out.. – Hari Menon Apr 03 '11 at 16:51
You could try adding an index with the remove duplicates option as mentioned here: http://www.mongodb.org/display/DOCS/Indexes It will automatically drop duplicate values. Another way is to first drop() all values and then add one value. – Hari Menon Apr 03 '11 at 17:02
I can't use the unique index because in my case there will be multiple valid listings with an MlsId of "" or "0". If I'm going to use the MapReduce route then I guess I'll need to know how to drop only the duplicate records, not all of them. – Justin Apr 03 '11 at 19:54
I thought about it some more and decided that I could use the unique index, just on address properties rather than mlsid, since some listings don't have an mlsid but every listing has an address. Thanks for the help! – Justin Apr 03 '11 at 20:12
Removed accepted answer, as creating the index generated the error - "too may dups on index build with dropDups=true". – Justin Apr 03 '11 at 21:20

score 2 · Answer 2 · answered Apr 03 '11 at 19:56

In mongodb you can use a query to restrict documents that are passed in for mapping. You probably want to do that for the ones you don't care about. Then in the reduce function you can ignore the dups and only return one of the docs for each duplicate key.

I'm a little confused about your goal though. If you just want to find duplicates and remove all but one of them then you can just create a unique index on that field and use the dropDups option; the process of creating the index will drop duplicate docs. Keeping the index will ensure that it doesn't happen again.

http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues

As mentioned above, I can't use the unique index with dropDups because it would only keep one listing with the value "" and one listing with the value "0", when I want to keep all of these. — Justin, Apr 03 '11 at 20:02

score -1 · Answer 3 · answered Apr 12 '13 at 14:06

You can use aggregation operation to remove duplicates. Unwind, introduce a dummy $group and $sum stage and ignore the counts in your next stage. Something like this,

db.myCollection.aggregate([
 {
     $unwind: '$list'
 },
 {
    $group:{
   '_id':
       {
         'listing_id':'$_id', 'MlsId':'$list.MlsId'
       },
          'count':
       {
          '$sum':1
       }
      }
},
{
      $group:
       {
        '_id':'$_id.listing_id',
        'list':
         {
          '$addToSet':
           {
            'MlsId':'$_id.MlsId'
           }
         }
       }
}
]);

score -1 · Answer 4 · answered Feb 17 '14 at 15:57

this is how I following the @harri answer to remove duplicates:

//contains duplicated documents id and numeber of duplicates 
db.createCollection("myDupesCollection")
res = db.sampledDB.mapReduce(m, r, { out : "myDupesCollection" });

// iterate through duplicated docs and remove duplicates (keep one) 
db.myDupesCollection.find({value: {$gt: 1}}).forEach(function(myDoc){
    u_id = myDoc._id.MlsId;
    counts =myDoc.value;
    db.sampledDB.remove({MlsId: u_id},counts-1); //if there are 3 docs, remove 3-1=2 of them
});

Removing duplicate records using MapReduce

4 Answers4

Linked