Whats is the correct way to sum different dataframe columns in a list in pyspark?

Question

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Why isn't approach #2. & #3. not working? I am on Spark 2.2

score 12 · Accepted Answer · answered Dec 07 '17 at 10:11

12

Because,

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

answered Dec 07 '17 at 10:11

Suresh

5,678
2
24
40

Thanks.Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it? – GeorgeOfTheRF Dec 07 '17 at 11:06
If you are using only two columns as mentioned, you can sum it straightaway, df.withColumn('sum1',df['`A.p1`']+df['`B.p1`']). But if there are many columns, can use UDF. – Suresh Dec 07 '17 at 11:21
I want to pass n columns in a list and sum it. my list of columns will change based on user input – GeorgeOfTheRF Dec 07 '17 at 11:23
btw, why native python sum() is not benefiting? – Suresh Dec 07 '17 at 11:41
its not a pypark function so it wont be really be completely benefiting from spark right. – GeorgeOfTheRF Dec 07 '17 at 11:44
you are trying to do row wise computation, summing each row separately. In this case, I believe native function is good enough. Also, almost all pyspark functions work at column level and not at row level. – Suresh Dec 07 '17 at 11:54

score 3 · Answer 2 · answered Dec 07 '17 at 16:00

TL;DR builtins.sum is just fine.

Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]]

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>]

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0]
for element in iterable[1:]:
    accum = accum + element

This gives Column:

Column<b'(A.p1 + B.p1)'>

which is the same as calling

df["`A.p1`"] + df["`B.p1`"]

No data has been touched and when evaluated it is benefits from all Spark optimizations.

score 0 · Answer 3 · answered Apr 23 '20 at 06:13

Addition of multiple columns from a list into one column

I tried a lot of methods and the following are my observations:

PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)

In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.

from pyspark.sql.functions import expr

cols_list = ['a', 'b', 'c']

# Creating an addition expression using `join`
expression = '+'.join(cols_list)

df = df.withColumn('sum_cols', expr(expression))

This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Whats is the correct way to sum different dataframe columns in a list in pyspark?

3 Answers3

Addition of multiple columns from a list into one column

Linked