0

I'm working in PySpark and I have a dataset like this :

enter image description here

I want to create a new df like this with the corresponding sums :

enter image description here

So I tried this code :


    df = df.withColumnRenamed("month_actual_january", "monthjanuary")
    fin=df.groupBy(["column1","column2"]).sum()

The problem is that I get the following error :

Attribute sum(column3) contains an invalid character among ,;{}()\n\t=. Please use an alias to rename it

Do you know how to fix this error ? Thanks !

Nabs335
  • 125
  • 8
  • 1
    Does this answer your question? [Pyspark dataframe: Summing over a column while grouping over another](https://stackoverflow.com/questions/33961899/pyspark-dataframe-summing-over-a-column-while-grouping-over-another) – samkart Jul 18 '22 at 16:53

1 Answers1

0

Lets try use least squares to pass a wild card alias as follows

   df.groupBy(["column1","column2"]).agg(*[sum(x).alias(f"sum_{x}") for x in df.drop("column1","column2").columns]).show()
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Hi, Thanks for your answer. It solve the previous problem of the column names but I have a new error : Unsupported operation between types + is not a supported operation for types int and str. So I checked my columns types and they are all integers except column1 and column2 but in your code you exclude them so I don't understand why I get this error... – Nabs335 Jul 19 '22 at 12:45
  • This is my columns type : Column1 (string) Column2 (string) Column3 (integer) Column4 (integer) Column5 (integer) And I check the columns 3 4 and 5 values I have integers in all cells – Nabs335 Jul 19 '22 at 12:53
  • @Nabs335 see if you have imported `sum()` from pyspark sql functions – samkart Jul 19 '22 at 14:10