Spark union of multiple RDDs

Question

In my pig code I do this:

all_combined = Union relation1, relation2, 
    relation3, relation4, relation5, relation 6.

I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise:

first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
# .... and so on

Is there a union operator that will let me operate on multiple rdds at a time:

e.g. union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6)

It is a matter on convenience.

score 99 · Accepted Answer · edited Apr 25 '18 at 23:50

99

If these are RDDs you can use SparkContext.union method:

rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])

rdd = sc.union([rdd1, rdd2, rdd3])
rdd.collect()

## [1, 2, 3, 4, 5, 6, 7, 8, 9]

There is no DataFrame equivalent but it is just a matter of a simple one-liner:

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

unionAll(df1, df2, df3).show()

## +---+----+
## |  k|   v|
## +---+----+
## |  1|foo1|
## |  2|bar1|
## |  3|foo2|
## |  4|bar2|
## |  5|foo3|
## |  6|bar3|
## +---+----+

If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:

def unionAll(*dfs):
    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

edited Apr 25 '18 at 23:50

Community

1
1

answered Nov 16 '15 at 21:00

zero323

322,348
103
959
935

1

What is the purpose of *rest here? It is not used anywhere. – Sivaprasanna Sethuraman Jul 14 '17 at 08:38
2

I want to perform about 3000 unions between one-row DFs. Using the first option, it gets exponentially slower after the 100th iteration(I am testing this with tqdm). Using the second option, it starts really slow from the beginning and keeps slowing down linearly. Is there any better way of doing this? – drkostas Jun 08 '18 at 16:14
3

@drkostas may not be the best way, but I solved that by saving off an RDD then loading it and continuing the loop. This kills the history of the RDD, you're slowing down because it reruns each loop in the RDDs history before it for each new loop. Spark doesn't like looping – Gramatik May 30 '19 at 23:04
2

@Gramatik Yes I solved in the same way too. By saving every dataframe in a parquet with the option `append` and then loading the parquet in a new dataframe. – drkostas Jun 01 '19 at 14:17

score 2 · Answer 2 · answered Nov 15 '19 at 10:14

2

You can also use addition for UNION between RDDs

rdd = sc.parallelize([1, 1, 2, 3])
(rdd + rdd).collect()
## [1, 1, 2, 3, 1, 1, 2, 3]

answered Nov 15 '19 at 10:14

Ryuuk

89
1
3
14

score -4 · Answer 3 · answered Nov 16 '15 at 20:33

-4

Unfortunately it's the only way to UNION tables in Spark. However instead of

first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
...

you can perform it in a little bit cleaner way like this:

result = rdd1.union(rdd2).union(rdd3).union(rdd4)

answered Nov 16 '15 at 20:33

Nhor

3,860
6
28
41

1

there's more than one way to union tables in spark. this comment is incorrect. see zero323's comment above – HarshMarshmallow Aug 03 '18 at 20:24

Spark union of multiple RDDs

3 Answers3

Linked

Related