0

Hi have a JavaRDDPair with 2 elements:

("TypeA", List<jsonTypeA>),

("TypeB", List<jsonTypeB>)

I need to combine the 2 pairs into 1 pair of type:

("TypeA_B", List<jsonCombinedAPlusB>)

I need to combine the 2 lists into 1 list, where each 2 jsons (1 of type A and 1 of type B) have some common field I can join on.

Consider that list of type A is significantly smaller than the other, and the join should be inner, so the result list should be as small as the list of type A.

What is the most efficient way to do that?

Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26

1 Answers1

2

rdd.join(otherRdd) provides you inner join on the first rdd. To use it, you will need to transform both RDDs to a PairRDD that has as key the common attribute that you will be joining on. Something like this (example, untested):

val rddAKeyed = rddA.keyBy{case (k,v) => key(v)}
val rddBKeyed = rddB.keyBy{case (k,v) => key(v)}

val joined = rddAKeyed.join(rddBKeyed).map{case (k,(json1,json2)) => (newK, merge(json1,json2))}

Where merge(j1,j2) is the specific business logic on how to join the two json objects.

maasg
  • 37,100
  • 11
  • 88
  • 115
  • I think that the OP didn't ask about syntax but rather about performance - i.e. is a.join(b) more efficient than b.join(a) – ihadanny Feb 24 '15 at 11:13