0

I'm trying to learn to use functional programming constructs like reduce, and I'm trying to grok how to use it to union multiple dataframes together. I was able to accomplish it with a simple for loop. You can see the commented out expr which was my attempt, the problem I'm running into is the fact that reduce is a Python function, and so I'm interleaving Python and Spark code in the same function, which doesn't make the compiler happy.

Here is my code:

df1 = sqlContext.createDataFrame(
        [
            ('1', '2', '3'),
        ],
        ['a', 'b', 'c']
    )

df2 = sqlContext.createDataFrame(
    [
        ('4', '5', '6'),
    ],
    ['a', 'b', 'c']
)

df3 = sqlContext.createDataFrame(
    [
        ('7', '8', '9'),
    ],
    ['a', 'b', 'c']
)

l = [df2, df3]

# expr = reduce(lambda acc, b: acc.unionAll(b), l, '')
for df in l:
    df1 = df1.unionAll(df)

df1.select('*').show()
Community
  • 1
  • 1
flybonzai
  • 3,763
  • 11
  • 38
  • 72

1 Answers1

1

You provide incorrect initial value for reduce what leads to a situations where you call

''.unionAll(b)

and it should be obvious it doesn't make sense. Either drop initial:

reduce(lambda acc, b: acc.unionAll(b), l) if l else None

or replace '' with a DataFrame with a valid schema:

first, *rest = l
reduce(lambda acc, b: acc.unionAll(b), rest, first)

Also there is no need for lambda expression:

from pyspark.sql import DataFrame

reduce(DataFrame.unionAll, rest, first)

If you're in adventures mood you can even monkey patch DataFrame:

DataFrame.__add__ = DataFrame.unionAll
sum(rest, first)

On a side note iterative unions without truncating lineage are not the best idea in Spark.

zero323
  • 322,348
  • 103
  • 959
  • 935