Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.
I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.
I have a spark dataframe around 12 million rows with few columns. Now I want to derive a column from based on the greatest value from two other columns and the derived value of previous row.
Input-
Column A | Column B | Column C | Column D | Column E | Column F | Column G |
---|---|---|---|---|---|---|
1 | Apr | 2022 | 3 | 2 | 3 | 0 |
1 | May | 2022 | 3 | 2 | 3 | 0 |
1 | Jun | 2022 | 3 | 1 | 1 | 0 |
1 | Jul | 2022 | 3 | 2 | 4 | 0 |
1 | Aug | 2022 | 3 | 2 | -1 | 0 |
1 | Sep | 2022 | 3 | 8 | 3 | 0 |
1 | Oct | 2022 | 3 | 2 | 3 | 0 |
1 | Nov | 2022 | 3 | 2 | 0 | 0 |
1 | Dec | 2022 | 3 | 2 | 0 | 0 |
1 | Jan | 2023 | 3 | 2 | 6 | 0 |
1 | Feb | 2023 | 3 | 2 | 5 | 0 |
1 | Mar | 2023 | 3 | 2 | 6 | 0 |
1 | Apr | 2023 | 3 | 2 | 7 | 0 |
1 | May | 2023 | 0 | 0 | 0 | 0 |
2 | Apr | 2017 | 3 | 2 | 3 | 0 |
2 | Apr | 2017 | 3 | 2 | 3 | 0 |
2 | Apr | 2017 | 3 | 1 | 1 | 0 |
2 | Apr | 2017 | 3 | 2 | 4 | 0 |
2 | Apr | 2017 | 3 | 2 | -1 | 0 |
2 | Apr | 2017 | 3 | 7 | 3 | 0 |
Output-
Column A | Column B | Column C | Column D | Column E | Column F | Column G | Column H | Logic |
---|---|---|---|---|---|---|---|---|
1 | Apr | 2022 | 3 | 2 | 3 | 0 | 3 | Max of Column E and F |
1 | May | 2022 | 3 | 2 | 3 | 0 | 3 | Max of Column E and F and output of Column H previous row |
1 | Jun | 2022 | 3 | 1 | 1 | 0 | 3 | Max of Column E and F and output of Column H previous row |
1 | Jul | 2022 | 3 | 2 | 4 | 0 | 4 | Max of Column E and F and output of Column H previous row |
1 | Aug | 2022 | 3 | 2 | -1 | 0 | 4 | Max of Column E and F and output of Column H previous row |
1 | Sep | 2022 | 3 | 8 | 3 | 0 | 8 | Max of Column E and F and output of Column H previous row |
1 | Oct | 2022 | 0 | 2 | 3 | 0 | 8 | Max of Column E and F and output of Column H previous row |
1 | Nov | 2022 | 0 | 2 | 0 | 0 | 8 | Max of Column E and F and output of Column H previous row |
1 | Dec | 2022 | 0 | 0 | 0 | 0 | 8 | Max of Column E and F and output of Column H previous row |
1 | Jan | 2023 | 0 | 0 | 6 | -3 | 6 | If Column G < 0 Max of Column E and F and (output of Column H previous row + Column G) |
1 | Feb | 2023 | 0 | 2 | 5 | 0 | 6 | Max of Column E and F and output of Column H previous row |
1 | Mar | 2023 | 0 | 2 | 6 | 0 | 6 | Max of Column E and F and output of Column H previous row |
1 | Apr | 2023 | 0 | 0 | 7 | 0 | 7 | Max of Column E and F and output of Column H previous row |
1 | May | 2023 | 0 | 0 | 0 | 0 | 7 | Max of Column E and F and output of Column H previous row |
2 | Apr | 2017 | 3 | 2 | 3 | 0 | 3 | Max of Column E and F |
2 | Apr | 2017 | 3 | 2 | 3 | 0 | 3 | Max of Column E and F and output of Column H previous row |
2 | Apr | 2017 | 3 | 1 | 1 | 0 | 3 | Max of Column E and F and output of Column H previous row |
2 | Apr | 2017 | 3 | 2 | 4 | 0 | 4 | Max of Column E and F and output of Column H previous row |
2 | Apr | 2017 | 3 | 2 | -1 | 0 | 4 | Max of Column E and F and output of Column H previous row |
2 | Apr | 2017 | 3 | 7 | 3 | 0 | 7 | Max of Column E and F and output of Column H previous row |
New Column H needs to be created with following logic
- Dataframe is populated with distinct values in column A spread across diff month and year and aligned in ascending order.
- First row of each distinct value in Column A must be maximum of Column E and F
- Subsequent row value for Column H is maximum of Column E and F and value derived on previous row for column H.(when Column G >=0)
- If Column G < 0, Column H is maximum of (Column E and F and (column G + previous row value for column H))