How to sum the values of a column in pyspark dataframe

Question

I am working in Pyspark and I have a data frame with the following columns.

Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True)
Q1.printSchema()

root
|-- index_date: integer (nullable = true)
|-- item_id: integer (nullable = true)
|-- item_COICOP_CLASSIFICATION: integer (nullable = true)
|-- item_desc: string (nullable = true)
|-- index_algorithm: integer (nullable = true)
|-- stratum_ind: integer (nullable = true)
|-- item_index: double (nullable = true)
|-- all_gm_index: double (nullable = true)
|-- gm_ra_index: double (nullable = true)
|-- coicop_weight: double (nullable = true)
|-- item_weight: double (nullable = true)
|-- cpih_coicop_weight: double (nullable = true)

I need the sum of all the elements in the last column (cpih_coicop_weight) to use as a Double in other parts of my program. How can I do it? Thank you very much in advance!

score 27 · Answer 1 · answered Sep 11 '18 at 01:13

27

If you want just a double or int as return, the following function will work:

def sum_col(df, col):
    return df.select(F.sum(col)).collect()[0][0]

Then

sum_col(Q1, 'cpih_coicop_weight')

will return the sum. I am new to pyspark so I am not sure why such a simple method of a column object is not in the library.

answered Sep 11 '18 at 01:13

Louis Yang

3,511
1
25
24

1

I totally agree with this statement. Why is it necessary to call an empty groupby to get the sum of a column? This function should be the accepted answer (and probably in the library) – seth127 Dec 03 '18 at 21:24

score 9 · Answer 2 · answered Feb 01 '18 at 17:28

9

try this :

from pyspark.sql import functions as F
total = Q1.groupBy().agg(F.sum("cpih_coicop_weight")).collect()

In total, you should have your result.

answered Feb 01 '18 at 17:28

Steven

14,048
6
38
73

1

Here `total` is a `[Row(sum(cpih_coicop_weight)=xxx]` if you want to get the actual scalar value you need to `total[0][0]` – RubenLaguna Sep 10 '18 at 21:03
Why this is not one of the default method of dataframe object? – Louis Yang Sep 11 '18 at 01:04

score 5 · Answer 3 · answered Jul 16 '18 at 12:36

5

This can also be tried.

total = Q1.agg(F.sum("cpih_coicop_weight")).collect()

answered Jul 16 '18 at 12:36

Athar

963
10
23

How to sum the values of a column in pyspark dataframe

3 Answers3