PySpark - Sum a column in dataframe and return results as int

Question

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.

df.groupBy().sum()

But I get a dataframe back.

+-----------+
|sum(Number)|
+-----------+
|        130|
+-----------+

I would 130 returned as an int stored in a variable to be used else where in the program.

result = 130

score 50 · Answer 1 · answered May 22 '18 at 12:00

50

I think the simplest way:

df.groupBy().sum().collect()

will return a list. In your example:

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

answered May 22 '18 at 12:00

Olivier Darrouzet

744
5
8

15

How does it Understand which column to groupBy and which one to Sum ? – user2458922 Oct 01 '20 at 15:03

score 39 · Accepted Answer · answered Aug 15 '18 at 08:59

39

The simplest way really :

df.groupBy().sum().collect()

But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:

RDD and ReduceByKey : 2.23 s

GroupByKey: 30.5 s

answered Aug 15 '18 at 08:59

Aron Asztalos

824
8
7

1

Great! It worked! But what if I need to sum all columns? I tried to do: df.groupBy().sum().collect()[0].asDict() but my data doesnt fit in memory, so I`m trying to use your tip as workaround. For exemplo: list(map(lambda x: df.select(list(x)).groupBy().sum().collect()[0].asDict(), np.array_split(PossibleNulls, 10))) worked! But, too slow ;( – magavo Jul 11 '21 at 14:39

LaSul · Answer 3 · 2019-10-15T11:25:57.130

39

If you want a specific column :

import pyspark.sql.functions as F     

df.agg(F.sum("my_column")).collect()[0][0]

edited Oct 15 '19 at 11:25

answered Oct 14 '19 at 13:14

LaSul

2,231
1
20
36

The function would return a dataframe, I just need the int value. – Bryce Ramgovind Oct 14 '19 at 14:36
Oh yeah didn't see the detail, sorry. I've modified – LaSul Oct 14 '19 at 21:57
Returns a list now – Bryce Ramgovind Oct 15 '19 at 07:14
This should be the accepted answer in my opinion. – Ric S May 13 '22 at 15:24

score 17 · Answer 4 · answered Oct 28 '18 at 07:18

17

This is another way you can do this. using agg and collect:

sum_number = df.agg({"Number":"sum"}).collect()[0]

result = sum_number["sum(Number)"]

answered Oct 28 '18 at 07:18

Ali AzG

1,861
2
18
28

ChrisDanger · Answer 5 · 2021-09-15T13:19:44.560

2

Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.

import pyspark.sql.functions as f   

df.select(f.sum('Number')).collect()[0][0]

edited Sep 15 '21 at 13:19

answered Sep 09 '21 at 02:37

ChrisDanger

1,071
11
10

2

Answers to this question are confusingly similar to each other. Can you add a note explaining how this improves on other answers here? – joanis Sep 09 '21 at 17:08

score 0 · Answer 6 · answered Apr 20 '21 at 11:26

You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.

df.groupBy().sum().first()[0]

In your case, the result is a dataframe with single row and column, so above snippet works.

qwr · Answer 7 · 2021-10-06T23:15:36.760

0

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:

df.select("Number").rdd.keys().sum()

SQL sum using selectExpr:

df.selectExpr("sum(Number)").first()[0]

edited Oct 06 '21 at 23:15

answered Oct 06 '21 at 17:07

qwr

9,525
5
58
102

score -2 · Answer 8 · answered Dec 14 '17 at 12:05

-2

The following should work:

df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

answered Dec 14 '17 at 12:05

ags29

2,621
1
8
14

seasee my · Answer 9 · 2019-06-21T09:39:51.827

-3

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()

import pyspark.sql.functions as F                                                    
df.groupBy().agg(F.sum('Number')).show()

edited Jun 21 '19 at 09:39

answered Jun 21 '19 at 08:03

seasee my

99
7

PySpark - Sum a column in dataframe and return results as int

9 Answers9

Linked