6

I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.

I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?

Thanks!

Chris Fulstow
  • 41,170
  • 10
  • 86
  • 110
Johanisma
  • 2,058
  • 3
  • 22
  • 25

2 Answers2

4

MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a products collection with this schema:

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
    sales:
    [ 
        { id:1, date: "20100316", howMany: 2 },
        { id:2, date: "20100316", howMany: 5 }
    ]
}

Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.

Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:

SQL: SELECT Sales WHERE ProductId = 123

MongoDB: db.sales.find( { productid: 123 } )

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
}

sales

{
    id: 1,
    productid: 123,
    date: "20100316",
    howMany: 2 
}

{
    id: 2,
    productid: 123,
    date: "20100316",
    howMany: 5
}
Chris Fulstow
  • 41,170
  • 10
  • 86
  • 110
  • I like MongoDB but have this kind of data structure feels incorrect to me. Because if you just want to show a page with product informations what do you care about the Gigs of data with all the informations about the sales of this product. isn't it a kind of perormance killer? – Maximilian Ruta Nov 14 '12 at 11:56
  • 2
    @MaximilianRuta If you're working with a lot of **relational data** your better of using a SQL database. Diaspora for example migrated from MongoDB to MySQL for this very reason. See [MongoDB to MySQL the How and the Why by Sarah Mei](https://www.youtube.com/watch?v=OqBAVC9GGeI) – Alex Bitek Jan 15 '13 at 06:33
3

My approach is below :

having a look to hadoop I have find CompositeInputFormat approach brefily, it takes two or more collections as an input for map-reduce job

according to my investigation mongodb dont provide this yet. mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)

so I have decided to put the collections that need to be joined in one collection on wich I will perform the mapreduce for "sql right join"

this is from my log reporter project. the first phase map-reduce is enough to perform right join in case "no clock". the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.

db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1,  startDay : 1,endDay:2, "table" : "user" });

//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]

m1 = function(){
   if( typeof this.table!= "undefined" && this.table!=null){
       username = this.username;
       startDay = this.startDay;
       endDay   = this.endDay;
       while(startDay<=endDay){
           emit({username:username,day:startDay},{clocks:["join"]});
          // emit({username:username,day:startDay},1);
           startDay++;
       }
   }else{
       emit({username:this.username,day:this.day},{clocks:[this.clock]});
   }
}
r1 = function(key,values){
    result = {clocks:[]}
    values.forEach(function(x){
        result.clocks = x.clocks.concat(result.clocks);
        result.clocks=result.clocks.filter(function(element, index, array){
            return element!="join";            
        })
    })
    return result;
}

db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();

m2=function(){
   key=this._id;
   this.value.clocks.forEach(function(x){
       key.clock=x;
       emit(key,1);       
   })   
}
r2 = function(key,values){
    value=0;
    values.forEach(function(x){
        value+=1;      
    })
    return result;
}

db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();