PySpark: how to performs conditional calculation on each element of a long string

Question

I have a dataframe that looks like this:

+--------+-------------------------------------+-----------+
| Worker | Schedule                            | Overtime  |
+--------+-------------------------------------+-----------+
| 1      | 23344--23344--23344--23344--23344-- | 3         |
+--------+-------------------------------------+-----------+
| 2      | 34455--34455--34455--34455--34455-- | 2         |
+--------+-------------------------------------+-----------+
| 3      | 466554-466554-466554-466554-466554- | 1         |
+--------+-------------------------------------+-----------+

Each number in the long 35-digit string in Schedule is a worker's work hours in a 35-day window.

Here is how to read each row:

Worker #1 works 2hr on Monday, 3hr on Tuesday, 3hr on Wednesday, 4hr on Thursday, 4hr on Friday, then off on Saturday and Sunday... (same for the following weeks in that 35-day window)
Worker #3 works 4hr on Monday, 6hr on Tuesday, 6hr on Wednesday, 5hr on Thursday, 5hr on Friday, 4hr on Saturday, then off on Sunday... (same for the following weeks in that 35-day window)

I would like to implement the following operation:

- For each day of a worker's schedule, if the hour he works that day + Overtime is <= 6, add that overtime hours to his schedule. No change is applied to days off (marked with -)

For example: Worker #1's updated schedule would look like: 56644--56644--56644--56644--56644--

2+3 <= 6 -> add 3 hrs
3+3 <= 6 -> add 3 hrs
4+3 !<= 6 -> no edit
-- -> days off, no edit

Using same logic, Worker #2's updated schedule would look like: 56655--56655--56655--56655--56655--

Worker #3's updated schedule would look like: 566665-566665-566665-566665-566665-

I am wondering how do I perform this operation in PySpark?

Much appreciation for your help!

score 0 · Answer 1 · answered May 02 '22 at 04:41

The shortest way (and probably the best performance) is using Spark SQL transform, which will loop through an array of your schedule, and perform the comparison element-wise. Even though, the code would look a bit cryptic.

from pyspark.sql import functions as F

(df
    .withColumn('recalculated_schedule', F.expr('concat_ws("", transform(split(schedule, ""), x -> case when x = "-" then "-" when x + overtime <= 6 then cast(x + overtime as int) else x end))')) 
    .show(10, False)
)

+------+-----------------------------------+--------+-----------------------------------+
|worker|schedule                           |overtime|recalculated_schedule              |
+------+-----------------------------------+--------+-----------------------------------+
|1     |23344--23344--23344--23344--23344--|3       |56644--56644--56644--56644--56644--|
|2     |34455--34455--34455--34455--34455--|2       |56655--56655--56655--56655--56655--|
|3     |466554-466554-466554-466554-466554-|1       |566665-566665-566665-566665-566665-|
+------+-----------------------------------+--------+-----------------------------------+

PySpark: how to performs conditional calculation on each element of a long string

1 Answers1