Apache Spark: Read two files (full outer join) as master-slave in java

Question

I have two input files - let's call them master and slave. Based on a common key, I want to join them (retain both left and right for null values on the other side) and in order.

So basically, end result is any kind of Java RDD which looks like

<master record, slave record>

and there are null values where the other does not exist.

I don't want to use lengthy operations like sort, zip by key or join itself. I am looking for a custom reader that I can write so that I don't have to read these files separately and join them later. Any ideas as to how I can write them?

Thanks!

edit: I am not looking for ready-made code. Just a rought guideline/outline also helps

I don't think there's anything wrong with this question. Regarding the answer, I am not sure if there can be a join (or shuffle) that preserves the order of the initial rdd (even the name "shuffle" gives the picture). I would look into custom partitioners, but I wouldn't give it many chances. For a reader that performs joins, I am way more pessimistic. Good luck! — vefthym, Apr 24 '17 at 10:24
In my opinion, by the time you have implemented this custom reader, (which, if I understand correctly may be more complex than you imagine), you might just as well have used Apache Drill to do a full outer join, written the output to disk once, and use Spark to read/process that single joined file. — ImDarrenG, Apr 24 '17 at 10:56

Apache Spark: Read two files (full outer join) as master-slave in java

0 Answers0