36

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Atom
  • 768
  • 1
  • 15
  • 35
  • 1
    Can't you just use `++` ? – lmm Dec 10 '14 at 08:30
  • 1
    @lmm No.. It will add columns to the RDD. I need to add rows to the RDD. I have two RDDs with same columns whose records needs to be merged to a single RDD. – Atom Dec 10 '14 at 08:59
  • 1
    No it won't, I just tried it to be sure. `++` creates a union RDD with the results from both. – lmm Dec 10 '14 at 09:43

2 Answers2

45

I think you are looking for RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))
gsamaras
  • 71,951
  • 46
  • 188
  • 305
maasg
  • 37,100
  • 11
  • 88
  • 115
  • rddPart1.union(rddPart2) will add columns of rddPart2 to rddPart1. I need to add rows of rddPart2 to rddPart1. FYI, both the RDDs in this case have the same column names and types – Atom Dec 10 '14 at 10:54
  • It is more like inserting records into an already existing RDD. Not creating new columns to RDD – Atom Dec 10 '14 at 11:02
  • 2
    @example added an example. There're no new columns to an union RDD. – maasg Dec 10 '14 at 11:12
  • While the example makes it look like concatenation takes place (rdd1 is followed by rdd2 in the output), I don't believe `union` makes any guarantees about ordering of the data. They could get mixed up with each other. Real concatenation is not so easy, because it implies an order dependency in your data, which is fighting against distributed-ness of spark. – jwd Jul 28 '17 at 23:14
2

I had the same problem. To combine by row instead of column use unionAll:

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html