0

I'm working on filling missing values in a pandas dataframe that I have. The simplest case would be to use df.fillna() on the entire column with missing values. This works, but I would like to extend this by adding a condition.

Namely, I calculate the median based on the category and if a missing value has Label = 'A' then it is filled with the median of all values that have Label = 'A'

Likewise, if a value has Label = 'B', then it is filled with the median of all values that have Label = 'B'

Below is workable code.

df = pd.DataFrame([[20,'A'],[24,'B'],[23,'A'],[22,'B'],[np.nan,'A'],[24,'B'],[np.nan,'B']],columns=['Value','Label'])

A_median = np.nanmedian(df[df['Label']=='A']['Value'])
B_median = np.nanmedian(df[df['Label']=='B']['Value'])

Currently the dataframe is:

    Value   Label
0   20.0    A
1   24.0    B
2   23.0    A
3   22.0    B
4   NaN     A
5   24.0    B
6   NaN     B

The desired output is:

    Value   Label
0   20.0    A
1   24.0    B
2   23.0    A
3   22.0    B
4   21.5    A
5   24.0    B
6   24.0    B

I tried the following but it causes all other values to be NaN:

df['Value'] = df[df['Label'] == 'A']['Value'].fillna(A_median)

Thank you in advance for your help! Please let me know if you need further information :)

Edit: I know similar questions have been asked but I mainly am trying to understand how to set the value without causing the other category to be null.

Nicklovn
  • 279
  • 4
  • 14
  • 1
    use `df['Value'].fillna(df.groupby('Label')['Value'].transform('mean'),inplace=True)` – ansev Sep 08 '19 at 21:41
  • 2
    Use `df['Value'] = df['Value'].fillna(df.groupby('Label')['Value'].transform('median'))` – Erfan Sep 08 '19 at 21:42
  • Thank you guys. I searched for a while and never found that question but it solves the issue perfectly. – Nicklovn Sep 08 '19 at 21:43

0 Answers0