how to merge two RDD to one RDD

Question

Help ,I have two RDDs, i want to merge to one RDD.This is my code.

val us1 = sc.parallelize(Array(("3L"), ("7L"),("5L"),("2L")))
val us2 = sc.parallelize(Array(("432L"), ("7123L"),("513L"),("1312L")))

what is your expected output and what have you tried? – mtoto Dec 13 '16 at 11:46 — mtoto, Dec 13 '16 at 11:46
3L 7L 5L 2L 432L 7123L 513L 1312L – Simon Dec 13 '16 at 11:49 — Simon, Dec 13 '16 at 11:49
i want this RDD ,means two RDD merge to one RDD – Simon Dec 13 '16 at 11:49 — Simon, Dec 13 '16 at 11:49

score 11 · Answer 1 · answered Dec 13 '16 at 11:52

11

Just use union:

val merged = us1.union(us2)

Documentation is here

Shotcut in Scala is:

val merged = us1 ++ us2

answered Dec 13 '16 at 11:52

T. Gawęda

15,706
4
46
61

1

@Simon [Please upvote or accept answers rather than leaving thank you comments](http://stackoverflow.com/help/someone-answers) – evan.oman Dec 13 '16 at 18:08

score 5 · Answer 2 · answered Dec 13 '16 at 12:12

5

You need the RDD.unionThese don't join on a key. Union doesn't really do anything itself, so it is low overhead. Note that the combined RDD will have all the partitions of the original RDDs, so you may want to coalesce after the union.

val x = sc.parallelize(Seq( (1, 3), (2, 4) ))
val y = sc.parallelize(Seq( (3, 5), (4, 7) ))
val z = x.union(y)
z.collect
res0: Array[(Int, Int)] = Array((1,3), (2,4), (3,5), (4,7))

API

def++(other: RDD[T]): RDD[T]

Return the union of this RDD and another one.

def++ API

def union(other: RDD[T]): RDD[T]

Return the union of this RDD and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them).

def union API

answered Dec 13 '16 at 12:12

Indrajit Swain

1,505
1
15
22

why would you want to coalesce afterwards? If the two input RDDs are properly partitioned, then the union RDD will be too. – Tim Dec 13 '16 at 14:52
Just for performance and to update the Partition .its not mandate but can be done . It returns a new RDD that is reduced into numPartitions partitions. – Indrajit Swain Dec 13 '16 at 16:39
Right, I get what coalesce does. But if your partitions are correctly sized in both input RDDs, performing a coalesce will produce partitions that are too large (especially if you use the shuffle = false option) – Tim Dec 13 '16 at 16:43
Then if its partitions correctly done then its all good . your code good to go :) – Indrajit Swain Dec 13 '16 at 17:04

how to merge two RDD to one RDD

2 Answers2

Linked