Sum() Window Function in PySpark without defining window spec

Question

I am trying to add a new column "grand total" to my table on each row.

E.G:

first_name	Order_id	price
John	1	2.5
Ali	2	2
Abdul	3	3.5

What I want is:

first_name	Order_id	price	grand_total
John	1	2.5	8
Ali	2	2	8
Abdul	3	3.5	8

My code:

new_df = new_df.withColumn("grand_total",F.sum(F.col("price")).over())

The error I receive is :

** TypeError: over() missing 1 required positional argument: 'window'" **

I am confused because, I am coming from SQL background, and SUM(column_name) over () is possible without the need to define a window inside over ().

score 2 · Answer 1 · answered Jul 28 '22 at 19:26

2

You can try this by aggregating the column price using a sum function and then create a column called grand_total, passing the value of sum to it. Try this:

from pyspark.sql.functions import sum, col, lit
total_sum_price = new_df.agg(sum(col('price')).collect()[0][0]
new_df = new_df.withColumn('grand_total',lit(total_sum_price))

answered Jul 28 '22 at 19:26

Pedro Crespo

66
6

Appreciate this workaround, however, I am looking to make the code snippet work OR at least understanding why this works in SQL and not in PySpark – Hassaan Anwar Jul 28 '22 at 19:32
Well, using over() requires a Window element in pyspark... it is a requirement. See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.over.html?highlight=over#pyspark.sql.Column.over – Pedro Crespo Jul 28 '22 at 19:47

score 2 · Accepted Answer · answered Jul 28 '22 at 19:37

2

try this:

from pyspark.sql import Window
new_df = new_df.withColumn("grand_total",F.sum(F.col("price")).over(Window.partitionBy()))

answered Jul 28 '22 at 19:37

ARCrow

1,360
1
10
26

Sum() Window Function in PySpark without defining window spec

2 Answers2