I have a dataframe that looks like this:
+--------+-------------------------------------+-----------+
| Worker | Schedule | Overtime |
+--------+-------------------------------------+-----------+
| 1 | 23344--23344--23344--23344--23344-- | 3 |
+--------+-------------------------------------+-----------+
| 2 | 34455--34455--34455--34455--34455-- | 2 |
+--------+-------------------------------------+-----------+
| 3 | 466554-466554-466554-466554-466554- | 1 |
+--------+-------------------------------------+-----------+
Each number in the long 35-digit string in Schedule
is a worker's work hours in a 35-day window.
Here is how to read each row:
- Worker #1 works 2hr on Monday, 3hr on Tuesday, 3hr on Wednesday, 4hr on Thursday, 4hr on Friday, then off on Saturday and Sunday... (same for the following weeks in that 35-day window)
- Worker #3 works 4hr on Monday, 6hr on Tuesday, 6hr on Wednesday, 5hr on Thursday, 5hr on Friday, 4hr on Saturday, then off on Sunday... (same for the following weeks in that 35-day window)
I would like to implement the following operation:
- For each day of a worker's schedule, if the hour he works that day + Overtime is <= 6, add that overtime hours to his schedule. No change is applied to days off (marked with -
)
For example:
Worker #1's updated schedule would look like:
56644--56644--56644--56644--56644--
- 2+3 <= 6
->
add 3 hrs - 3+3 <= 6
->
add 3 hrs - 4+3 !<= 6
->
no edit - --
->
days off, no edit
Using same logic, Worker #2's updated schedule would look like:
56655--56655--56655--56655--56655--
Worker #3's updated schedule would look like:
566665-566665-566665-566665-566665-
I am wondering how do I perform this operation in PySpark?
Much appreciation for your help!