16

First, the background. I used to have a collection logs and used map/reduce to generate various reports. Most of these reports were based on data from within a single day, so I always had a condition d: SOME_DATE. When the logs collection grew extremely big, inserting became extremely slow (slower than the app we were monitoring was generating logs), even after dropping lots of indexes. So we decided to have each day's data in a separate collection - logs_YYYY-mm-dd - that way indexes are smaller, and we don't even need an index on date. This is cool since most reports (thus map/reduce) are on daily data. However, we have a report where we need to cover multiple days.

And now the question. Is there a way to run a map/reduce (or more precisely, the map) over multiple collections as if it were only one?

ibz
  • 44,461
  • 24
  • 70
  • 86

2 Answers2

35

A reduce function may be called once, with a key and all corresponding values (but only if there are multiple values for the key - it won't be called at all if there's only 1 value for the key).

It may also be called multiple times, each time with a key and only a subset of the corresponding values, and the previous reduce results for that key. This scenario is called a re-reduce. In order to support re-reduces, your reduce function should be idempotent.

There are two key features in a idempotent reduce function:

  • The return value of the reduce function should be in the same format as the values it takes in. So, if your reduce function accepts an array of strings, the function should return a string. If it accepts objects with several properties, it should return an object containing those same properties. This ensures that the function doesn't break when it is called with the result of a previous reduce.
  • Don't make assumptions based on the number of values it takes in. It isn't guaranteed that the values parameter contains all the values for the given key. So using values.length in calculations is very risky and should be avoided.

Update: The two steps below aren't required (or even possible, I haven't checked) on the more recent MongoDB releases. It can now handle these steps for you, if you specify an output collection in the map-reduce options:

{ out: { reduce: "tempResult" } }

If your reduce function is idempotent, you shouldn't have any problems map-reducing multiple collections. Just re-reduce the results of each collection:

Step 1

Run the map-reduce on each required collection and save the results in a single, temporary collection. You can store the results using a finalize function:

finalize = function (key, value) {
  db.tempResult.save({ _id: key, value: value });
}

db.someCollection.mapReduce(map, reduce, { finalize: finalize })
db.anotherCollection.mapReduce(map, reduce, { finalize: finalize })

Step 2

Run another map-reduce on the temporary collection, using the same reduce function. The map function is a simple function that selects the keys and values from the temporary collection:

map = function () {
  emit(this._id, this.value);
}

db.tempResult.mapReduce(map, reduce)

This second map-reduce is basically a re-reduce and should give you the results you need.

Josh Kodroff
  • 27,301
  • 27
  • 95
  • 148
Niels van der Rest
  • 31,664
  • 16
  • 80
  • 86
  • How can I store the results of all the map/reduce in a single collection that I can map/reduce later? – ibz Oct 01 '10 at 09:43
  • @ionut bizau: You can use a finalize function for that. See my updated answer for details. – Niels van der Rest Oct 01 '10 at 10:06
  • 1
    Niels, your answer is pretty good. But what if we have duplicated keys in both reduces? I suggest to save tempResult data as {id, value} under usual ids and map them for reduce with map = function () { emit(this.id, this.value); } Oh! I've found usefull feature of mapReduce starting from MongoDB 1.7.4 http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions > { reduce : "collectionName" } - If documents exists for a given key in the result set and in the old collection, then a reduce operation will be performed on the two values and the result will – lig Jan 16 '11 at 19:06
  • @lig: Good catch, I didn't think about duplicate keys! The new `out` options of v1.7.4 look very useful in this scenario, especially `{ reduce: "collectionName" }`. Feel free to add an answer using this new feature, I'll vote it up ;) – Niels van der Rest Jan 17 '11 at 10:42
  • 3
    Now in 1.8 you can use {out:{reduce: 'collectionName'}} just like you mentioned. It's perfect for aggregating stats together incrementally. Check out this tutorial: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/ – Clint Apr 20 '11 at 16:49
  • According to http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/#requirements-for-the-map-function you cannot access the database. When I tried to do it myself, I found that the finalize command does not have access to the db variable – Archimedes Trajano Mar 11 '13 at 05:05
  • @ArchimedesTrajano: Good find, thanks for letting me know. I have added the solution that was mentioned in the comments to the answer, as it apparently is the only way to this now. – Niels van der Rest Mar 11 '13 at 08:02
1

I used map-reduce method. here is an example.

var mapemployee = function () {
    emit(this.jobid,this.Name);};

var mapdesignation = function () {
    emit(this.jobid, this.Designation);};

var reduceF = function(key, values) {
    var outs = {Name:null,Designation: null};
    values.forEach(function(v){
    if(outs.Name ==null){
   outs.Name = v.Name }
   if(outs.Name ==null){
    outs.Nesignation = v.Designation}                    
     });
    return outs;
};

result = db.employee.mapReduce(mapemployee, reduceF, {out: {reduce: 'output'}});
result = db.designation.mapReduce(mapdesignation,reduceF, {out: {reduce: 'output'}});

Refference : http://www.itgo.me/a/x3559868501286872152/mongodb-join-two-collections

Lasith Niroshan
  • 925
  • 10
  • 18