Missing column values fill based on the available values

Question

How to fill missing values for apple variety from the same column when there are 1-4 varieties per farm and but cannot be two varieties with the same ripening index on the same farm? Assume the column has all possible scenarios.

For instance, in the below sample, 'Empire' and 'Honeycrisp' have the same ripening but they are from the different farms.

A sample df (a part of a larger dataframe):

df = pd.DataFrame(
        {'farm':   [419,382, 382, 382, 411, 411, 411],
         'variety': ['Gala', 'Gala', 'Empire', '', 'Honeycrisp', '', 'Fuji'],
         'ripening':[2,2,3,3,3,3,6],
         'D': np.random.randn(7)*10,
         'E': list('abcdefg')
         }
     )

df
Out[223]: 
   farm      variety  ripening          D  E
0    419        Gala         2  12.921246  a
1    382        Gala         2  -2.776150  b
2    382      Empire         3   3.551226  c
3    382                     3   2.715187  d
4    411  Honeycrisp         3 -13.557640  e
5    411                     3 -11.525100  f
6    411        Fuji         6  -3.660661  g

my desired output:

   farm      variety  ripening          D  E
0    419        Gala         2  12.921246  a
1    382        Gala         2  -2.776150  b
2    382      Empire         3   3.551226  c
3    382      Empire         3   2.715187  d
4    411  Honeycrisp         3 -13.557640  e
5    411  Honeycrisp         3 -11.525100  f
6    411        Fuji         6  -3.660661  g

Does [this](https://stackoverflow.com/questions/66406662/randomly-filling-nan-values-of-a-column-with-non-null-string-values) answer your question? — The Singularity, Oct 06 '21 at 05:33
@Luke , it does but i like `df.update` below a lot more. it also feels more "mainstream" comparing to `np.random.choice`, but obviously a matter of preference. Thanks for sharing. — gregV, Oct 06 '21 at 16:06

jezrael · Accepted Answer · 2021-10-06T06:03:17.063

Use:

#create NaNs instead empty strings
df['variety'] = df['variety'].replace('', np.nan)

#test if only 1 unique category per ripening and farm
m = m = df.groupby(['farm','ripening'])['variety'].transform('nunique').eq(1)

#only for filtered rows forward filling values per groups
df.update(df[m].groupby(['farm','ripening'])['variety'].ffill())
print (df)
   farm     variety  ripening          D  E
0   419        Gala         2 -12.571434  a
1   382        Gala         2   1.839992  b
2   382      Empire         3  18.946881  c
3   382      Empire         3   6.552552  d
4   411  Honeycrisp         3  11.755782  e
5   411  Honeycrisp         3  11.272973  f
6   411        Fuji         6   7.416918  g

Missing column values fill based on the available values

1 Answers1