0

I have a data frame in pyspark where I have columns like Quantity1, Quantity 2, ......Quantity. I just want to sum up the previous 5 quantity fields value in these Quantity fields. So in this case I have to do a Column by Lead or Lag but I haven't found any way to do the same. If anyone has any idea or alternate way of doing this in pyspark or SQL please suggest.

Example: Input dataset enter image description here Bucket Size=2

Output Dataset: enter image description here

Explanation: Bucket1 = Qty1+Qty2

Bucket2 = Qty2+Qty3

Bucket3 = Qty3+Qty4

BucketN = QtyN+Qty(N+1)

Asif Khan
  • 143
  • 12
  • 3
    Please post an example input and expected result. Currently, it is not clear what you have in mind. – ZygD Jun 18 '22 at 09:53

1 Answers1

0

try this:

import PySpark.sql.functions as f
df = (
    df
    .select(*[(f.col(f'Qty{i}') + f.col(f'Qty{i+1}')).alias(f'Bucket{i}') for i in range(1, len(df.columns) - 2)])
)
ARCrow
  • 1,360
  • 1
  • 10
  • 26
  • Thanks, @ARCrow for your support, This solution is good when the data size is small but if my data size is in GBs then this will cause an issue of optimization. That's why I am looking for an inbuilt solution if available Or can we use internal pyspark functions to accomplish this? – Asif Khan Jun 19 '22 at 11:16
  • From an optimization perspective, if you're referring to dataframe being an immutable object and withColumn creating a new dataset when initiated (https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame), then we can use select statement. I have updated my response. – ARCrow Jun 19 '22 at 14:59