How to use .unionAll() in a reduce expression to create single dataframe

Question

I'm trying to learn to use functional programming constructs like reduce, and I'm trying to grok how to use it to union multiple dataframes together. I was able to accomplish it with a simple for loop. You can see the commented out expr which was my attempt, the problem I'm running into is the fact that reduce is a Python function, and so I'm interleaving Python and Spark code in the same function, which doesn't make the compiler happy.

Here is my code:

df1 = sqlContext.createDataFrame(
        [
            ('1', '2', '3'),
        ],
        ['a', 'b', 'c']
    )

df2 = sqlContext.createDataFrame(
    [
        ('4', '5', '6'),
    ],
    ['a', 'b', 'c']
)

df3 = sqlContext.createDataFrame(
    [
        ('7', '8', '9'),
    ],
    ['a', 'b', 'c']
)

l = [df2, df3]

# expr = reduce(lambda acc, b: acc.unionAll(b), l, '')
for df in l:
    df1 = df1.unionAll(df)

df1.select('*').show()

zero323 · Accepted Answer · 2016-07-07T21:17:34.473

You provide incorrect initial value for reduce what leads to a situations where you call

''.unionAll(b)

and it should be obvious it doesn't make sense. Either drop initial:

reduce(lambda acc, b: acc.unionAll(b), l) if l else None

or replace '' with a DataFrame with a valid schema:

first, *rest = l
reduce(lambda acc, b: acc.unionAll(b), rest, first)

Also there is no need for lambda expression:

from pyspark.sql import DataFrame

reduce(DataFrame.unionAll, rest, first)

If you're in adventures mood you can even monkey patch DataFrame:

DataFrame.__add__ = DataFrame.unionAll
sum(rest, first)

On a side note iterative unions without truncating lineage are not the best idea in Spark.

How to use .unionAll() in a reduce expression to create single dataframe

1 Answers1