I have a data frame and i am trying to clean the data before my analysis.
I am putting a sample data for trial as my data is bit complex.
A B C D
30 24 13 41
30 25 14 45
30 27 15 44
30 28 16 43
31 21 12 4
31 2 17 99
3 89 99 45
78 24 0 43
35 252 12 45
36 23 13 44
I am trying to deal with the outliers and trying to calculate the Modified Z score (median one) and IQR for filtering out the outliers from the data so that i can get the quality data for further analysis.
I want to calculate IQR and then Z score for each column and filter out the outliers for each column in the data frame.
I have tried few things till now like:
IQR:
for col in df2.columns:
col = np.array([col])
q1_a = np.percentile(col, 25)
q3_a = np.percentile(col, 75)
iqr1 = q3_a - q1_a
print(iqr1)
Modified Z score:
for col in df2.columns:
threshold = 3.5
col_zscore = col +'_zscore'
median_y = df[col].median()
print(median_y)
median_absolute_deviation_y = (np.abs(df2[col] - median_y)).median()
print(median_absolute_deviation_y)
modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
print(modified_z_scores)
df2[col_zscore] = np.abs(modified_z_scores)
df2 = df2[(np.abs(df2[col_zscore]) < 3.5).all(axis=1)]
print(df2)
But not getting the right answer. The function does not apply on each column and create the dataframe of my intention at the end. Please help. Thanks.