20

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});
Piotr Czapla
  • 25,734
  • 24
  • 99
  • 122
  • OP, please consider changing the accepted answer to @expert 's. More concise and more efficient – Guy Mar 25 '22 at 10:45

3 Answers3

69

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:

db.places.aggregate(
    { $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} },
    { $limit : 5 }
    );

It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

expert
  • 29,290
  • 30
  • 110
  • 214
  • 1
    I don't quite see how your solution works on question data. Should the group line be `{ $group : {'md5' : "$extra_info.md5", total : { $sum : 1}}},`? – zhon Sep 30 '13 at 15:49
  • 2
    @zhon No. Have you read documentation ? It says `For this _id field, you can specify various expressions, including a single field from the documents in the pipeline, a computed value from a previous stage, a document that consists of multiple fields, and other valid expressions, such as constant or subdocument fields. You can use $project operators in expressions for the _id field.` – expert Sep 30 '13 at 16:48
  • 5
    For the question data, the group line should be: { $group : { _id : "$md5", total : { $sum : 1 } } }, – kdkeck Sep 26 '14 at 21:45
31

The easiest way to do it in one pass is to sort by md5 and then process appropriately.

Something like:

var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {

  if(current.md5 == previous_md5){
    db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
  }

  previous_md5 = current.md5;

});

That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).

This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
6

You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group

Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).

If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.

Scott Hernandez
  • 7,452
  • 2
  • 34
  • 25
  • I can't use gruping because it is limited to 10k elements (I have 3M). But the note that MR will return just data from the index is interesting. I didn't know that. Thanks! (+1) – Piotr Czapla Nov 21 '10 at 19:14