0

For my Exploratory Data Analysis Project the dataset looks as follows :

An Image of Dataset for Reference

Link to GitHub Repository for Dataset

The features of my dataset are

  • Pregnancies

  • Glucose

  • BloodPressure

  • SkinThickness

  • Insulin

  • BMI

  • DiabetesPedigreeFunciton

  • Age

I want to perform data cleaning, on the numeric features.

And the column Outcome is our label,
where value 0 states the patient is NOT diagnosed with diabetes and

We observe that features like Glucose, BloodPressure, SkinThickness, Insulin and BMI can never be 0.
Hence I wanted to fill the 0 values of the features with the median value, corresponding to their outcome.

Patients records with Outcome = 0, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 0.

Patients records with Outcome = 1, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 1.

Essentially I want a way to group the columns according to the Outcome label and fill ) values, respectively.

def replace_zeros_with_mean(df, col_name):
    df[col_name].replace(0, np.nan, inplace=True)
    mean_value = df[col_name].mean()
    df[col_name].fillna(mean_value, inplace=True)
replace_zeros_with_mean(new_df, "Glucose")

Initially, I performed mean, but thought median would be a great fit.

glucose_group = df["Glucose"].groupby(df["Outcome"])
glucose_group

and it resulted something like this

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fed9b59df70>

I don't know how to proceed and how do I fill np.nan values of records with Outcome = 0,
with median of records with Outcome=0 and the same for records with Outcome = 1.

1 Answers1

0

You will need to change your approach in this case.
Go with this flow:

  1. Replace your df with np.nan and check it using df.info().

    df[df.columns[:-1]] = df[df.columns[:-1]].replace(0, np.nan)
    
  2. Filter the df based on the outcome.

    df_outcome_0 = df[df['Outcome'] == 0].copy()
    df_outcome_1 = df[df['Outcome'] == 1].copy()
    
  3. Now replace the NaN values with the median or mean for both the df.

    df_outcome_0[df.columns[:-1]] = df_outcome_0[df.columns[:-1]].fillna(df_outcome_0[df.columns[:-1]].median())
    df_outcome_1[df.columns[:-1]] = df_outcome_1[df.columns[:-1]].fillna(df_outcome_1[df.columns[:-1]].median())
    
  4. Finally Concat both the df

    df_processed = pd.concat([df_outcome_0, df_outcome_1]).sort_index()