For my Exploratory Data Analysis Project the dataset looks as follows :
An Image of Dataset for Reference
Link to GitHub Repository for Dataset
The features of my dataset are
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigreeFunciton
Age
I want to perform data cleaning, on the numeric features.
And the column Outcome
is our label,
where value 0 states the patient is NOT diagnosed with diabetes and
We observe that features like Glucose
, BloodPressure
, SkinThickness
, Insulin
and BMI
can never be 0.
Hence I wanted to fill the 0 values of the features with the median value, corresponding to their outcome.
Patients records with Outcome = 0, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 0.
Patients records with Outcome = 1, will have their 0 values (values that are needed to be treated) replaced by the median of respective columns, where Outcome = 1.
Essentially I want a way to group the columns according to the Outcome
label and fill ) values, respectively.
def replace_zeros_with_mean(df, col_name):
df[col_name].replace(0, np.nan, inplace=True)
mean_value = df[col_name].mean()
df[col_name].fillna(mean_value, inplace=True)
replace_zeros_with_mean(new_df, "Glucose")
Initially, I performed mean
, but thought median
would be a great fit.
glucose_group = df["Glucose"].groupby(df["Outcome"])
glucose_group
and it resulted something like this
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fed9b59df70>
I don't know how to proceed and how do I fill np.nan
values of records with Outcome = 0,
with median of records with Outcome=0 and the same for records with Outcome = 1.