10

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here's what I exactly need to do:

I am having two sets of data:

  1. point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info

  2. Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info

As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number


In order to clarify, here's an example:

For point pairs:

(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)

for line pairs:

(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)

what I want is as following:

for tile 0:

 (tile0, point0:line0)
 (tile0, point0:line1)
 (tile0, point1:line0)
 (tile0, point1:line1)

for tile 1:

 (tile1, point1:line2)
 (tile1, point1:line3)
 (tile1, point2:line2)
 (tile1, point2:line3)
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
reza
  • 1,188
  • 3
  • 17
  • 32

2 Answers2

8

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:

 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)
tileX (lineK, pointR)
...

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points;
 List<String> lines;

Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list
Razvan
  • 9,925
  • 6
  • 38
  • 51
  • great. but how should I do the cross product in the reduce section? – reza Aug 03 '12 at 21:55
  • 4
    great but this only works if you can fit the whole points/lines into memory in order to store them in the two mentioned List. I am afraid that is not the case in my large datasets:( – reza Aug 03 '12 at 22:10
  • This post is very old but since I have a problem very similar to this so posting here. Can someone provide sample example of code here as how to do this in map/reduce format. I am trying to understand and learn this. Thanks in advance – user1188611 Mar 28 '13 at 13:34
  • 1
    @reza have you figured out the above solution for large datasets ?? – Bruce_Wayne Sep 23 '14 at 05:01
1

So basically you have two options here.Reduce side join or Map Side Join .

Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.

If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

Animesh Raj Jha
  • 2,704
  • 1
  • 21
  • 25