How to do a column wise lead and lag in pyspark?

Question

I have a data frame in pyspark where I have columns like Quantity1, Quantity 2, ......Quantity. I just want to sum up the previous 5 quantity fields value in these Quantity fields. So in this case I have to do a Column by Lead or Lag but I haven't found any way to do the same. If anyone has any idea or alternate way of doing this in pyspark or SQL please suggest.

Example: Input dataset Bucket Size=2

Output Dataset:

Explanation: Bucket1 = Qty1+Qty2

Bucket2 = Qty2+Qty3

Bucket3 = Qty3+Qty4

BucketN = QtyN+Qty(N+1)

Please post an example input and expected result. Currently, it is not clear what you have in mind. — ZygD, Jun 18 '22 at 09:53

ARCrow · Answer 1 · 2022-06-19T14:59:27.453

0

try this:

import PySpark.sql.functions as f
df = (
    df
    .select(*[(f.col(f'Qty{i}') + f.col(f'Qty{i+1}')).alias(f'Bucket{i}') for i in range(1, len(df.columns) - 2)])
)

edited Jun 19 '22 at 14:59

answered Jun 18 '22 at 16:12

ARCrow

1,360
1
10
26

Thanks, @ARCrow for your support, This solution is good when the data size is small but if my data size is in GBs then this will cause an issue of optimization. That's why I am looking for an inbuilt solution if available Or can we use internal pyspark functions to accomplish this? – Asif Khan Jun 19 '22 at 11:16
From an optimization perspective, if you're referring to dataframe being an immutable object and withColumn creating a new dataset when initiated (https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame), then we can use select statement. I have updated my response. – ARCrow Jun 19 '22 at 14:59

How to do a column wise lead and lag in pyspark?

1 Answers1