3

I have a data frame and I'd like to add a new column based the value in two columns, where one uses the group max. For example, here's an example DataFrame

import pandas as pd
import numpy as np


df = pd.DataFrame({"col1": ['a'] * 2 + ['b'] * 4,
                   "col2": [1, 2, 1, 2, 3, 4],
                   "col3": ['yep', 'no', 'no', 'no', 'yep', 'yep']})


df
  col1  col2 col3
0    a     1  yep
1    a     2   no
2    b     1   no
3    b     2   no
4    b     3  yep
5    b     4  yep

Now, what I'd like is a new column that said "yes" where col2 is the group max, grouping on col1 and col3 is equal to "yep".

  col1  col2 col3 col4
0    a     1  yep  no
1    a     2   no  no
2    b     1   no  no
3    b     2   no  no
4    b     3  yep  no
5    b     4  yep  yes

I've tried the code below, but I'm getting a slight error on the return value:

# define function
def max_bal(df):
    max_row = df['col2'].max()
    label = np.where((df['col3'] == 'yep') & 
                     (df['col2'] == max_row),
                     'Yes', 
                     'No')
    return label

Now call function in a group by statement

df.groupby('col1').apply(max_bal)

col1
a             [No, No]
b    [No, No, No, Yes]
dtype: object

These are the correct values but not returned in a Series as I would expect. I'd need one Series/array/list because I'm trying to assign the these values to a new column in the df above (see `col4'). Any pointers would be greatly appreciated.

Colin Basnett
  • 4,052
  • 2
  • 30
  • 49
Ryan Erwin
  • 807
  • 1
  • 11
  • 30

1 Answers1

4

You can return labels as a DataFrame with the same index as the group the function is being applied on.

def max_bal(df):
    max_row = df['col2'].max()
    labels = np.where((df['col3'] == 'yep') & 
                     (df['col2'] == max_row),
                     'Yes', 
                     'No')
    return pd.DataFrame(labels, index=df.index)

df['col4'] = df.groupby('col1').apply(max_bal)
root
  • 32,715
  • 6
  • 74
  • 87
  • This works, but why? I'm learning Python/pandas, and I'm trying to bridge the gap between my initial attempt and the DataFrame solution with the original DataFrame index specified. – Ryan Erwin May 10 '16 at 22:51