Calculate the average of sections of a column with condition met to create new dataframe

Question

I have the below data table

A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})

I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:

Thanks for your help!

score 1 · Answer 1 · answered Jul 01 '21 at 06:15

1

Try:

m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)

Prints:

   A  B
0  2  1
1  3  1

answered Jul 01 '21 at 06:15

Andrej Kesely

168,389
15
48
91

Utsav · Answer 2 · 2021-07-01T06:23:03.180

1

df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1

Output

    A   B
0   2   1
1   3   1

Explanation

We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.

Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.

edited Jul 01 '21 at 06:23

answered Jul 01 '21 at 06:17

Utsav

5,572
2
29
43

This approach works however if I changed mean to std it would convert all values in column B to 0 that I no longer can do any filtering. Any other thought? – bgeaibreyi Jul 01 '21 at 19:57
I ended up creating a data frame that has the all the statistics of column A and insert column B from the groupby.mean to make this work. – bgeaibreyi Jul 01 '21 at 22:35

mohamadmansourx · Accepted Answer · 2021-07-01T07:11:56.680

Keywords: groupby, shift, mean

Code:

df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]

df_result
     A    B
1  2.0  1.0
3  3.0  1.0

As you might noticed, you need first to determine the consecutive rows blocks having the same values. One way to do so is by shifting B one row and then comparing it with itself.

df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's

df['A']                     =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B']                     =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted']             =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
                                [↑      ][↑   ][↑      ][↑   ]

Now we can use the groupby pandas method as follows:

df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())

Now if we looped in the DtaFrameGroupBy object df_grouped we'll see the following tuples:

(0,    A  B  B_shifted
0  2  0          0)
(1,    A  B  B_shifted
1  3  1          0
2  1  1          1
3  2  1          1)
(2,    A  B  B_shifted
4  4  0          1
5  1  0          0)
(3,    A  B  B_shifted
6  5  1          0
7  3  1          1
8  1  1          1)
(4,     A  B  B_shifted
9   7  0          1
10  5  0          0)

We can simply calculate the mean and filter the zero values now as follow

df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]

Calculate the average of sections of a column with condition met to create new dataframe

3 Answers3

References:(link, link).