3

Help ,I have two RDDs, i want to merge to one RDD.This is my code.

val us1 = sc.parallelize(Array(("3L"), ("7L"),("5L"),("2L")))
val us2 = sc.parallelize(Array(("432L"), ("7123L"),("513L"),("1312L")))
Simon
  • 51
  • 1
  • 1
  • 4

2 Answers2

11

Just use union:

val merged = us1.union(us2)

Documentation is here

Shotcut in Scala is:

val merged = us1 ++ us2
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • 1
    @Simon [Please upvote or accept answers rather than leaving thank you comments](http://stackoverflow.com/help/someone-answers) – evan.oman Dec 13 '16 at 18:08
5

You need the RDD.unionThese don't join on a key. Union doesn't really do anything itself, so it is low overhead. Note that the combined RDD will have all the partitions of the original RDDs, so you may want to coalesce after the union.

val x = sc.parallelize(Seq( (1, 3), (2, 4) ))
val y = sc.parallelize(Seq( (3, 5), (4, 7) ))
val z = x.union(y)
z.collect
res0: Array[(Int, Int)] = Array((1,3), (2,4), (3,5), (4,7))

API

def++(other: RDD[T]): RDD[T]

Return the union of this RDD and another one.

def++ API

def union(other: RDD[T]): RDD[T]

Return the union of this RDD and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them).

def union API

Indrajit Swain
  • 1,505
  • 1
  • 15
  • 22
  • why would you want to coalesce afterwards? If the two input RDDs are properly partitioned, then the union RDD will be too. – Tim Dec 13 '16 at 14:52
  • Just for performance and to update the Partition .its not mandate but can be done . It returns a new RDD that is reduced into numPartitions partitions. – Indrajit Swain Dec 13 '16 at 16:39
  • Right, I get what coalesce does. But if your partitions are correctly sized in both input RDDs, performing a coalesce will produce partitions that are too large (especially if you use the shuffle = false option) – Tim Dec 13 '16 at 16:43
  • Then if its partitions correctly done then its all good . your code good to go :) – Indrajit Swain Dec 13 '16 at 17:04