1

Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.

I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.

I have a spark dataframe around 12 million rows with few columns. Now I want to derive a column from based on the greatest value from two other columns and the derived value of previous row.

Input-

Column A Column B Column C Column D Column E Column F Column G
1 Apr 2022 3 2 3 0
1 May 2022 3 2 3 0
1 Jun 2022 3 1 1 0
1 Jul 2022 3 2 4 0
1 Aug 2022 3 2 -1 0
1 Sep 2022 3 8 3 0
1 Oct 2022 3 2 3 0
1 Nov 2022 3 2 0 0
1 Dec 2022 3 2 0 0
1 Jan 2023 3 2 6 0
1 Feb 2023 3 2 5 0
1 Mar 2023 3 2 6 0
1 Apr 2023 3 2 7 0
1 May 2023 0 0 0 0
2 Apr 2017 3 2 3 0
2 Apr 2017 3 2 3 0
2 Apr 2017 3 1 1 0
2 Apr 2017 3 2 4 0
2 Apr 2017 3 2 -1 0
2 Apr 2017 3 7 3 0

Output-

Column A Column B Column C Column D Column E Column F Column G Column H Logic
1 Apr 2022 3 2 3 0 3 Max of Column E and F
1 May 2022 3 2 3 0 3 Max of Column E and F and output of Column H previous row
1 Jun 2022 3 1 1 0 3 Max of Column E and F and output of Column H previous row
1 Jul 2022 3 2 4 0 4 Max of Column E and F and output of Column H previous row
1 Aug 2022 3 2 -1 0 4 Max of Column E and F and output of Column H previous row
1 Sep 2022 3 8 3 0 8 Max of Column E and F and output of Column H previous row
1 Oct 2022 0 2 3 0 8 Max of Column E and F and output of Column H previous row
1 Nov 2022 0 2 0 0 8 Max of Column E and F and output of Column H previous row
1 Dec 2022 0 0 0 0 8 Max of Column E and F and output of Column H previous row
1 Jan 2023 0 0 6 -3 6 If Column G < 0 Max of Column E and F and (output of Column H previous row + Column G)
1 Feb 2023 0 2 5 0 6 Max of Column E and F and output of Column H previous row
1 Mar 2023 0 2 6 0 6 Max of Column E and F and output of Column H previous row
1 Apr 2023 0 0 7 0 7 Max of Column E and F and output of Column H previous row
1 May 2023 0 0 0 0 7 Max of Column E and F and output of Column H previous row
2 Apr 2017 3 2 3 0 3 Max of Column E and F
2 Apr 2017 3 2 3 0 3 Max of Column E and F and output of Column H previous row
2 Apr 2017 3 1 1 0 3 Max of Column E and F and output of Column H previous row
2 Apr 2017 3 2 4 0 4 Max of Column E and F and output of Column H previous row
2 Apr 2017 3 2 -1 0 4 Max of Column E and F and output of Column H previous row
2 Apr 2017 3 7 3 0 7 Max of Column E and F and output of Column H previous row

New Column H needs to be created with following logic

  1. Dataframe is populated with distinct values in column A spread across diff month and year and aligned in ascending order.
  2. First row of each distinct value in Column A must be maximum of Column E and F
  3. Subsequent row value for Column H is maximum of Column E and F and value derived on previous row for column H.(when Column G >=0)
  4. If Column G < 0, Column H is maximum of (Column E and F and (column G + previous row value for column H))
Raja
  • 11
  • 1
  • In PySpark, this task is difficult. This is a close example: https://stackoverflow.com/questions/74146685/sum-two-values-on-the-previous-row?noredirect=1&lq=1 – ZygD Nov 02 '22 at 08:17
  • Also, your input data does not match your output. Please be more careful, as it causes a lot of confusion! Use [Edit](https://stackoverflow.com/posts/74285763/edit) button to fix mistakes. – ZygD Nov 02 '22 at 08:17

1 Answers1

0

understand, there are may tons of similar question but not exakt what your looking for. I hope this helps:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO('''
Column A    Column B    Column C    Column D    Column E    Column F    Column G
1   Apr 2022    3   2   3   0
1   May 2022    3   2   3   0
1   Jun 2022    3   1   1   0
1   Jul 2022    3   2   4   0
1   Aug 2022    3   2   -1  0
1   Sep 2022    3   8   3   0
'''), sep="\t")

# Max of Column E and F
df['Column H'] = df.apply(lambda row: max(row.loc['Column E'], row.loc['Column F']), axis=1)

# Helper column (Column H previous row)
df['tmp'] = df['Column H'].shift(1)

# Max of Column E and F and output of Column H previous row
df['Logic'] = df.apply(lambda row: max(row.loc['Column E'], row.loc['Column F'], row.loc['tmp']), axis=1)

# Delete helper
df.drop("tmp", axis=1)
index Column A Column B Column C Column D Column E Column F Column G Column H Logic
0 1 Apr 2022 3 2 3 0 3 3.0
1 1 May 2022 3 2 3 0 3 3.0
2 1 Jun 2022 3 1 1 0 1 3.0
3 1 Jul 2022 3 2 4 0 4 4.0
4 1 Aug 2022 3 2 -1 0 2 4.0
5 1 Sep 2022 3 8 3 0 8 8.0

In case it does not exactly what you want, it should give you at least some ideas on how to achieve what you want.

tturbo
  • 680
  • 5
  • 16
  • Disclaimer: Some lines in my solution may have a simpler solution. Feel free to add a comment if so – tturbo Nov 02 '22 at 08:18
  • I tried this, however works only if this needs to be applied across the dataframe. In my problem, Solution needs to be updated for different values for Column A and is not unique. In that case, how can this be twisted. Column A will have ( value 1 with 10 rows, value 2 with 5 rows). Solution has to be applied for each distinct value of Column A. Any leads. – Raja Nov 02 '22 at 08:26
  • Can you please provide a simplified example data (only relevant columns, not to many) and the desired output with values? – tturbo Nov 02 '22 at 11:19