I'm working on filling missing values in a pandas dataframe that I have. The simplest case would be to use df.fillna() on the entire column with missing values. This works, but I would like to extend this by adding a condition.
Namely, I calculate the median based on the category and if a missing value has Label = 'A' then it is filled with the median of all values that have Label = 'A'
Likewise, if a value has Label = 'B', then it is filled with the median of all values that have Label = 'B'
Below is workable code.
df = pd.DataFrame([[20,'A'],[24,'B'],[23,'A'],[22,'B'],[np.nan,'A'],[24,'B'],[np.nan,'B']],columns=['Value','Label'])
A_median = np.nanmedian(df[df['Label']=='A']['Value'])
B_median = np.nanmedian(df[df['Label']=='B']['Value'])
Currently the dataframe is:
Value Label
0 20.0 A
1 24.0 B
2 23.0 A
3 22.0 B
4 NaN A
5 24.0 B
6 NaN B
The desired output is:
Value Label
0 20.0 A
1 24.0 B
2 23.0 A
3 22.0 B
4 21.5 A
5 24.0 B
6 24.0 B
I tried the following but it causes all other values to be NaN:
df['Value'] = df[df['Label'] == 'A']['Value'].fillna(A_median)
Thank you in advance for your help! Please let me know if you need further information :)
Edit: I know similar questions have been asked but I mainly am trying to understand how to set the value without causing the other category to be null.