Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here's what I exactly need to do:
I am having two sets of data:
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here's an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)