Get sum of each column in pyspark dataframe

Question

I have a dataframe which consists of 3 rows and more than 20 columns(dates)

+----+-----+-----+         
|Cat |01/02|02/02|......
+----+-----+-----+
| a  | 20  |   7 |......
| b  | 30  |  12 |......
+----+---+-------+

and I want to get the sum from each column and add it as an extra row to the dataframe. In other words I expect to look like this:

+----+-----+-----+
|Cat |01/02|02/02|......
+----+-----+-----+
| a  | 20  |   7 |......
| b  | 30  |  12 |......
| All| 50  |  19 |......
+----+---+-------+

I am coding in pySpark and script is the following one:

from pyspark.sql import functions as F
    for col_name in fs.columns:
      print(col_name)

      sf = df.unionAll(
      df.select([
         F.lit('Total').alias('Cat'),
         F.sum(fs.col_name).alias("{}").format(col_name)
         ])
       )

Unfortunately I am getting the error AttributeError: 'DataFrame' object has no attribute 'col_name'. Any ideas what I am doing wrong? Thank you in advance!

remove the `fs.` - you can't use the dot accessor for a column with a string variable. Try: `F.sum(F.col(col_name)).alias(col_name)` — pault, Mar 05 '19 at 21:43
Also I don't understand your loop- are you doing each column one at a time? You can probably achieve the same with something like: `df.union(df.select(F.lit("Total").alias("Cat"), *[F.sum(F.col(c)).alias(c) for c in df.columns if c != 'Cat'])` — pault, Mar 05 '19 at 21:46

Get sum of each column in pyspark dataframe

0 Answers0