Concatenating datasets of different RDDs in Apache spark using scala

Question

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD

@lmm No.. It will add columns to the RDD. I need to add rows to the RDD. I have two RDDs with same columns whose records needs to be merged to a single RDD. — Atom, Dec 10 '14 at 08:59
No it won't, I just tried it to be sure. `++` creates a union RDD with the results from both. — lmm, Dec 10 '14 at 09:43

score 45 · Accepted Answer · edited Sep 16 '16 at 01:15

45

I think you are looking for RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

edited Sep 16 '16 at 01:15

gsamaras

71,951
46
188
305

answered Dec 10 '14 at 10:21

maasg

37,100
11
88
115

rddPart1.union(rddPart2) will add columns of rddPart2 to rddPart1. I need to add rows of rddPart2 to rddPart1. FYI, both the RDDs in this case have the same column names and types – Atom Dec 10 '14 at 10:54
It is more like inserting records into an already existing RDD. Not creating new columns to RDD – Atom Dec 10 '14 at 11:02
2

@example added an example. There're no new columns to an union RDD. – maasg Dec 10 '14 at 11:12
While the example makes it look like concatenation takes place (rdd1 is followed by rdd2 in the output), I don't believe `union` makes any guarantees about ordering of the data. They could get mixed up with each other. Real concatenation is not so easy, because it implies an order dependency in your data, which is fighting against distributed-ness of spark. – jwd Jul 28 '17 at 23:14

score 2 · Answer 2 · answered May 30 '16 at 05:58

2

I had the same problem. To combine by row instead of column use unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

answered May 30 '16 at 05:58

Josep Curto Díaz

21
1

Not sure it is the right answer, the question was about RDD, not how to do it with dataframes – Kartoch Jan 30 '19 at 10:29

Concatenating datasets of different RDDs in Apache spark using scala

2 Answers2

Linked

Related