0

Following on from this question, I have this dataframe:

    ChildID   MotherID   preWeight
0     20      455        3500
1     20      455        4040
2     13      102        NaN
3     702     946        5000
4     82      571        2000
5     82      571        3500
6     82      571        3800

where I transformed feature 'preWeight' that has multiple observations per MotherID to feature 'preMacro' with a single observation per MotherID, based on the following rules:

  • if preWeight>=4000 for a particular MotherID, I assigned preMacro a value of "Yes" regardless of the remaining observations
  • Otherwise I assigned preMacro a value of "No"

Using this line of code:

    df.groupby(['ChildID','MotherID']).agg(lambda x: 'Yes' if (x>4000).any() else 'No').reset_index().rename(columns={"preWeight": "preMacro"})

However, I realised that this way I am not preserving the NaN values in the dataset, which ideally should be imputed rather than just assigning them "No" values. So I tried changing the above line to:

    df=df.groupby(['MotherID', 'ChildID'])['preWeight'].agg(
        lambda x: 'Yes' if (x>4000).any() else (np.NaN if 'no_value' in x.values.all() else 'No')).reset_index().rename(
        columns={"preWeight": "preMacro"})

I wanted this line to transform the above dataframe to this:

    ChildID   MotherID   preMacro
0   20        455        Yes
1   13        102        NaN
2   702       946        Yes
3   82        571        No

However I got this error when running it:

TypeError: argument of type 'float' is not iterable

I understand that, in the case of non-missing values, the values of x.values.all() are float numbers, which are not iterable, but I am not sure how else to code this, any ideas?

Thanks.

sums22
  • 1,793
  • 3
  • 13
  • 25

1 Answers1

1

For performance dont test in custom function per groups, better is aggregate by GroupBy.agg by helper column for boolean mask with GroupBy.all and GroupBy.any and then set column preMacro by numpy.select:

df = (df.assign(testconst = df['preWeight'] > 4000,
                testna = df['preWeight'].notna())
        .groupby(['ChildID','MotherID'], sort=False)
        .agg({'testconst':'any', 'testna':'all'}))


masks = [df['testconst'] & df['testna'], df['testconst'] | df['testna']]
df['preMacro'] = np.select(masks, ['Yes','No'], default=None)
df = df.drop(['testconst','testna'], axis=1).reset_index()
print (df)
   ChildID  MotherID preMacro
0       20       455      Yes
1       13       102     None <- for avoid convert np.NaN to string nan is used None
2      702       946      Yes
3       82       571       No

If small DataFrame or performance is not important:

f = lambda x: 'Yes' if (x>4000).any() else ('No' if x.notna().all() else np.NaN)
df1 = (df.groupby(['ChildID','MotherID'], sort=False)['preWeight']
         .agg(f)
         .reset_index(name='preMacro'))
print (df1)
   ChildID  MotherID preMacro
0       20       455      Yes
1       13       102      NaN
2      702       946      Yes
3       82       571       No
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks for the detailed answer. I tried the second option and it works for me, but I understand what you are saying about its performance issues. I am quite new to Python and have not used boolean masks on dataframes before but will look into this method. – sums22 Aug 06 '20 at 13:08