Spark UDF that takes in unknown number of columns

Question

I have a list of spark dataframes with different schemas. Example:

list_df = [df1, df2, df3, df4]
# df1.columns = ['a', 'b']
# df2.columns = ['a', 'b', 'c']
# df3.columns = ['a', 'b', 'c', 'd']
# df4.columns = ['a', 'b', 'c', 'd', 'e']

Now, I want to write a single udf that is able to operate on this list of dataframes with different number of columns.

There is a previous post on how to do it using scala: Spark UDF with varargs, where the udf takes in an array of columns.

But it seems that the approach does not work for python. Any suggestions?

Thanks.

zero323 · Accepted Answer · 2016-08-08T10:47:42.413

Actually this approach works just fine in Python:

from pyspark.sql.functions import array, udf

df = sc.parallelize([("a", "b", "c", "d")]).toDF()

f = udf(lambda xs: "+".join(xs))

df.select(f("_1")).show()
## +------------+
## |<lambda>(_1)|
## +------------+
## |           a|
## +------------+

df.select(f(array("_1", "_2"))).show()
## +-----------------------+
## |<lambda>(array(_1, _2))|
## +-----------------------+
## |                    a+b|
## +-----------------------+

df.select(f(array("_1", "_2", "_3"))).show()
## +---------------------------+
## |<lambda>(array(_1, _2, _3))|
## +---------------------------+
## |                      a+b+c|
## +---------------------------+

Since Python UDF are not the same type of entity like their Scala counterpart are not constrained by the types and number of the input arguments you also use args:

g = udf(lambda *xs: "+".join(xs))

df.select(g("_1", "_2", "_3", "_4")).show()
## +------------------------+
## |<lambda>(_1, _2, _3, _4)|
## +------------------------+
## |                 a+b+c+d|
## +------------------------+

to avoid wrapping input with array.

You can also use struct as an alternative wrapper to get access to the column names:

h = udf(lambda row: "+".join(row.asDict().keys()))

df.select(h(struct("_1", "_2", "_3"))).show()
## +----------------------------+
## |<lambda>(struct(_1, _2, _3))|
## +----------------------------+
## |                    _1+_3+_2|
## +----------------------------+

A related question: is there a way to access the column names inside udf so that I am able to take values from the correct fields? Thanks. — Yiliang, Aug 08 '16 at 02:37

Spark UDF that takes in unknown number of columns

1 Answers1