How to populate rows of Spark dataframe column based with previous row based on a multiple conditions?

Question

Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.

I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.

I have a spark dataframe around 12 million rows with few columns. Now I want to derive a column from based on the greatest value from two other columns and the derived value of previous row.

Input-

Column A	Column B	Column C	Column D	Column E	Column F
1	Apr	2022	3	2	3
1	May	2022	3	2	3
1	Jun	2022	3	1	1
1	Jul	2022	3	2	4
1	Aug	2022	3	2	-1
1	Sep	2022	3	8	3
1	Oct	2022	3	2	3
1	Nov	2022	3	2	0
1	Dec	2022	3	2	0
1	Jan	2023	3	2	6
1	Feb	2023	3	2	5
1	Mar	2023	3	2	6
1	Apr	2023	3	2	7
1	May	2023	0	0	0
2	Apr	2017	3	2	3
2	Apr	2017	3	2	3
2	Apr	2017	3	1	1
2	Apr	2017	3	2	4
2	Apr	2017	3	2	-1
2	Apr	2017	3	7	3

Output-

Column A	Column B	Column C	Column D	Column E	Column F	Column G	Column H	Logic
1	Apr	2022	3	2	3	0	3	Max of Column E and F
1	May	2022	3	2	3	0	3	Max of Column E and F and output of Column H previous row
1	Jun	2022	3	1	1	0	3	Max of Column E and F and output of Column H previous row
1	Jul	2022	3	2	4	0	4	Max of Column E and F and output of Column H previous row
1	Aug	2022	3	2	-1	0	4	Max of Column E and F and output of Column H previous row
1	Sep	2022	3	8	3	0	8	Max of Column E and F and output of Column H previous row
1	Oct	2022	0	2	3	0	8	Max of Column E and F and output of Column H previous row
1	Nov	2022	0	2	0	0	8	Max of Column E and F and output of Column H previous row
1	Dec	2022	0	0	0	0	8	Max of Column E and F and output of Column H previous row
1	Jan	2023	0	0	6	-3	6	If Column G < 0 Max of Column E and F and (output of Column H previous row + Column G)
1	Feb	2023	0	2	5	0	6	Max of Column E and F and output of Column H previous row
1	Mar	2023	0	2	6	0	6	Max of Column E and F and output of Column H previous row
1	Apr	2023	0	0	7	0	7	Max of Column E and F and output of Column H previous row
1	May	2023	0	0	0	0	7	Max of Column E and F and output of Column H previous row
2	Apr	2017	3	2	3	0	3	Max of Column E and F
2	Apr	2017	3	2	3	0	3	Max of Column E and F and output of Column H previous row
2	Apr	2017	3	1	1	0	3	Max of Column E and F and output of Column H previous row
2	Apr	2017	3	2	4	0	4	Max of Column E and F and output of Column H previous row
2	Apr	2017	3	2	-1	0	4	Max of Column E and F and output of Column H previous row
2	Apr	2017	3	7	3	0	7	Max of Column E and F and output of Column H previous row

New Column H needs to be created with following logic

Dataframe is populated with distinct values in column A spread across diff month and year and aligned in ascending order.
First row of each distinct value in Column A must be maximum of Column E and F
Subsequent row value for Column H is maximum of Column E and F and value derived on previous row for column H.(when Column G >=0)
If Column G < 0, Column H is maximum of (Column E and F and (column G + previous row value for column H))

In PySpark, this task is difficult. This is a close example: https://stackoverflow.com/questions/74146685/sum-two-values-on-the-previous-row?noredirect=1&lq=1 — ZygD, Nov 02 '22 at 08:17
Also, your input data does not match your output. Please be more careful, as it causes a lot of confusion! Use [Edit](https://stackoverflow.com/posts/74285763/edit) button to fix mistakes. — ZygD, Nov 02 '22 at 08:17

score 0 · Answer 1 · answered Nov 02 '22 at 08:15

understand, there are may tons of similar question but not exakt what your looking for. I hope this helps:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO('''
Column A    Column B    Column C    Column D    Column E    Column F    Column G
1   Apr 2022    3   2   3   0
1   May 2022    3   2   3   0
1   Jun 2022    3   1   1   0
1   Jul 2022    3   2   4   0
1   Aug 2022    3   2   -1  0
1   Sep 2022    3   8   3   0
'''), sep="\t")

# Max of Column E and F
df['Column H'] = df.apply(lambda row: max(row.loc['Column E'], row.loc['Column F']), axis=1)

# Helper column (Column H previous row)
df['tmp'] = df['Column H'].shift(1)

# Max of Column E and F and output of Column H previous row
df['Logic'] = df.apply(lambda row: max(row.loc['Column E'], row.loc['Column F'], row.loc['tmp']), axis=1)

# Delete helper
df.drop("tmp", axis=1)

index	Column A	Column B	Column C	Column D	Column E	Column F	Column H	Logic
0	1	Apr	2022	3	2	3	3	3.0
1	1	May	2022	3	2	3	3	3.0
2	1	Jun	2022	3	1	1	1	3.0
3	1	Jul	2022	3	2	4	4	4.0
4	1	Aug	2022	3	2	-1	2	4.0
5	1	Sep	2022	3	8	3	8	8.0

In case it does not exactly what you want, it should give you at least some ideas on how to achieve what you want.

Disclaimer: Some lines in my solution may have a simpler solution. Feel free to add a comment if so — tturbo, Nov 02 '22 at 08:18
I tried this, however works only if this needs to be applied across the dataframe. In my problem, Solution needs to be updated for different values for Column A and is not unique. In that case, how can this be twisted. Column A will have ( value 1 with 10 rows, value 2 with 5 rows). Solution has to be applied for each distinct value of Column A. Any leads. — Raja, Nov 02 '22 at 08:26
Can you please provide a simplified example data (only relevant columns, not to many) and the desired output with values? — tturbo, Nov 02 '22 at 11:19

How to populate rows of Spark dataframe column based with previous row based on a multiple conditions?

1 Answers1