2

I have two columns in my dataframe: "Cust_ID" and "Fill_days". I need to add a column for "Adjusted Days", like this:

enter image description here

The idea: for every first row for "Cust_ID", the "Adjusted Days" value should be 0. For the following rows,
if previous row's "Fill_days" + "Adjusted Days" < 0 then 0, else previous row's "Fill_days" + "Adjusted Days". If we need to do this in Excel. the following formula is used:

C2=0
C3=IF(B2+C2<0,0,B2+C2)
C4=IF(B3+C3<0,0,B3+C3)

I was able to write the code using pandas, however the code is slow.

How to do it in Spark? I use Spark 3.2.1.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Rishabh
  • 178
  • 9

1 Answers1

2

First of all, you need a column for order. Spark does not know the order. All the rows can be at any position unless you have a column which can tell exact order. I have added the column "order".

Secondly, referencing the same column is not possible per se. You need a workaround. The following will collect date into a list for every "Cust_ID" and do operations only within this list. When it's done, explode the result using inline.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 1, 5),
     (1, 2, 2),
     (1, 3, 1),
     (1, 4, -9),
     (1, 5, -2),
     (1, 6, 9),
     (5, 1, -2),
     (5, 2, 1),
     (5, 3, -1)],
    ['Cust_ID', 'order', 'Fill_days'])

Script:

df = df.groupBy('Cust_ID').agg(
    F.aggregate(
        F.array_sort(F.collect_list(F.struct('order', 'Fill_days'))),
        F.expr("array(struct(bigint(null) order, 0L Fill_days, 0L Adjusted_Days))"),
        lambda acc, x: F.array_union(
            acc,
            F.array(x.withField(
                'Adjusted_Days',
                F.greatest(F.lit(0), F.element_at(acc, -1)['Fill_days'] + F.element_at(acc, -1)['Adjusted_Days'])
            ))
        )
    ).alias('a')
)
df = df.selectExpr("Cust_ID", "inline(slice(a, 2, size(a)))")

df.show()
# +-------+-----+---------+-------------+
# |Cust_ID|order|Fill_days|Adjusted_Days|
# +-------+-----+---------+-------------+
# |1      |1    |5        |0            |
# |1      |2    |2        |5            |
# |1      |3    |1        |7            |
# |1      |4    |-9       |8            |
# |1      |5    |-2       |0            |
# |1      |6    |9        |0            |
# |5      |1    |-2       |0            |
# |5      |2    |1        |0            |
# |5      |3    |-1       |1            |
# +-------+-----+---------+-------------+

For understanding, please analyze this answer, as it's not an easy task to explain it for the 2nd time.

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • Py4JError: An error occurred while calling None.org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable. Trace: py4j.security.Py4JSecurityException: Constructor public org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable(scala.collection.Seq) is not whitelisted. at py4j.security.WhitelistingPy4JSecurityManager.checkConstructor(WhitelistingPy4JSecurityManager.java:451) – Rishabh Oct 25 '22 at 12:27
  • 1
    Reading some threads on "is not whitelisted" leads me to think that this is the problem of your cluster configurations. – ZygD Oct 25 '22 at 12:34
  • 1
    It’s working, thank you so much ❤️. Had cluster issue – Rishabh Oct 25 '22 at 20:22