8

Suppose

rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).

Want to generate

( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).

Any easy methods? I think it is different from the cross join but can't find a good solution. My solution is

(rdd1
 .cartesian( rdd2 )
 .filter( lambda (k, v): k[0]==v[0] )
 .map( lambda (k, v): (k[0], (k[1], v[1])) ))
surj
  • 4,706
  • 2
  • 25
  • 34
Peng Sun
  • 130
  • 1
  • 1
  • 8

1 Answers1

14

You are just looking for a simple join, e.g.

rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]
dpeacock
  • 2,697
  • 13
  • 16