How to I replace 0 values of features in a dataset, with its median value corresponding to the label?

Question

For my Exploratory Data Analysis Project the dataset looks as follows :

The features of my dataset are

Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunciton
Age

I want to perform data cleaning, on the numeric features.

And the column Outcome is our label,
where value 0 states the patient is NOT diagnosed with diabetes and

We observe that features like Glucose, BloodPressure, SkinThickness, Insulin and BMI can never be 0.
Hence I wanted to fill the 0 values of the features with the median value, corresponding to their outcome.

Patients records with Outcome = 0, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 0.

Patients records with Outcome = 1, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 1.

Essentially I want a way to group the columns according to the Outcome label and fill ) values, respectively.

def replace_zeros_with_mean(df, col_name):
    df[col_name].replace(0, np.nan, inplace=True)
    mean_value = df[col_name].mean()
    df[col_name].fillna(mean_value, inplace=True)

replace_zeros_with_mean(new_df, "Glucose")

Initially, I performed mean, but thought median would be a great fit.

glucose_group = df["Glucose"].groupby(df["Outcome"])
glucose_group

and it resulted something like this

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fed9b59df70>

I don't know how to proceed and how do I fill np.nan values of records with Outcome = 0,
with median of records with Outcome=0 and the same for records with Outcome = 1.

score 0 · Answer 1 · answered Apr 10 '23 at 08:54

You will need to change your approach in this case.
Go with this flow:

Replace your df with np.nan and check it using df.info().

df[df.columns[:-1]] = df[df.columns[:-1]].replace(0, np.nan)

Filter the df based on the outcome.

df_outcome_0 = df[df['Outcome'] == 0].copy()
df_outcome_1 = df[df['Outcome'] == 1].copy()

Now replace the NaN values with the median or mean for both the df.

df_outcome_0[df.columns[:-1]] = df_outcome_0[df.columns[:-1]].fillna(df_outcome_0[df.columns[:-1]].median())
df_outcome_1[df.columns[:-1]] = df_outcome_1[df.columns[:-1]].fillna(df_outcome_1[df.columns[:-1]].median())

Finally Concat both the df

df_processed = pd.concat([df_outcome_0, df_outcome_1]).sort_index()

How to I replace 0 values of features in a dataset, with its median value corresponding to the label?

1 Answers1