I have a data frame and I'd like to add a new column based the value in two columns, where one uses the group max. For example, here's an example DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({"col1": ['a'] * 2 + ['b'] * 4,
"col2": [1, 2, 1, 2, 3, 4],
"col3": ['yep', 'no', 'no', 'no', 'yep', 'yep']})
df
col1 col2 col3
0 a 1 yep
1 a 2 no
2 b 1 no
3 b 2 no
4 b 3 yep
5 b 4 yep
Now, what I'd like is a new column that said "yes" where col2
is the group max, grouping on col1
and col3
is equal to "yep".
col1 col2 col3 col4
0 a 1 yep no
1 a 2 no no
2 b 1 no no
3 b 2 no no
4 b 3 yep no
5 b 4 yep yes
I've tried the code below, but I'm getting a slight error on the return value:
# define function
def max_bal(df):
max_row = df['col2'].max()
label = np.where((df['col3'] == 'yep') &
(df['col2'] == max_row),
'Yes',
'No')
return label
Now call function in a group by statement
df.groupby('col1').apply(max_bal)
col1
a [No, No]
b [No, No, No, Yes]
dtype: object
These are the correct values but not returned in a Series
as I would expect. I'd need one Series
/array
/list
because I'm trying to assign the these values to a new column in the df
above (see `col4'). Any pointers would be greatly appreciated.