3

I have data in MongoDB. The structure of one object is like this:

{
    "_id" : ObjectId("5395177980a6b1ccf916312c"),
    "institutionId" : "831",
    "currentObject" : {
          "systemIdentifiers" : [
            {
                "value" : "24387",
                "system" : "ABC"
            }]
      }
}

I have to know how many objects have same institutionId and systemIdentifiers[0].value and want to return only those duplicated in that way. In order to do that I group them up by these IDs and count occurrences.

The object (a pair of IDs) should be returned when count is greater than 1.

This is a chunk of code which does grouping with using MapReduce.

var map = function() {
    var key = this.institutionId;
    var val = this.currentObject.systemIdentifiers[0].value;
    emit({"institutionId":key,"workId":val}, {count:1});     
};
var reduce = function(key, values) {
    var count = 0;
    values.forEach(function(v) {
        count += v['count'];
    });
    return {count: count};
}
db.name.mapReduce(map, reduce, {out: "grouped"})
db.grouped.find()

To get only those having count greather than 1, I do

db.grouped.aggregate([{$match:{"value.count":{$gt: 1}}}])

An example result is then following

{
    "_id" : {
        "institutionId" : "1004",
        "workId" : "591426"
    },
    "value" : {
        "count" : 2
    }
}

But I am curious whether if possible to have it done just by doing MapReduce as one statement. Sth like adding a finalizer or so.

chridam
  • 100,957
  • 23
  • 236
  • 235
Szymon Roziewski
  • 956
  • 2
  • 20
  • 36

2 Answers2

2

If there is a single document havig a key it will never go inside reduce, is considered reduced already, that is the behaviour of the MongoDB map-reduce:

MongoDB will not call the reduce function for a key that has only a single value.

Using finalzie also doesn't help much, i.e. if in finalize funtion you do a if count > 1 then return reducedVal else None, than you will have None (instead of 1) in the result.

I am afraid that using (one) map-reduce, documents having count 1 will be alwasy in the result, since they are fired up from map.

You can use 2 map reduce operations in a chain, in the second map you don't emit the documents having count < 2. But these does not think it's better than an extra query as it is in your example.

Community
  • 1
  • 1
sergiuz
  • 5,353
  • 1
  • 35
  • 51
1

A much better, simpler and more efficient approach would be to use the aggregation framework where you can use operators like $arrayElemAt to return the first subdocument from the array and then using the $group pipeline to aggregate the counts. You can then place the $match pipeline to filter your results based on the given criteria.

The following example shows this faster approach:

db.name.aggregate([
    {
        "$project": {
            "key": "$institutionId",
            "val": {
                "$arrayElemAt": ["$currentObject.systemIdentifiers", 0]
            }
        }
    },
    {
        "$group": {
            "_id": {
                "institutionId": "$key",
                "workId": "$val.value"
            },
            "count": { "$sum": 1 }
        }
    },
    { "$match": { "count": { "$gt": 1 } } }
])
chridam
  • 100,957
  • 23
  • 236
  • 235