1

I have a data frame and i am trying to clean the data before my analysis.

I am putting a sample data for trial as my data is bit complex.

A   B  C D
30 24 13 41
30 25 14 45
30 27 15 44
30 28 16 43
31 21 12 4
31 2  17 99
3 89 99 45
78 24 0   43
35 252 12 45
36 23 13 44

I am trying to deal with the outliers and trying to calculate the Modified Z score (median one) and IQR for filtering out the outliers from the data so that i can get the quality data for further analysis.

I want to calculate IQR and then Z score for each column and filter out the outliers for each column in the data frame.

I have tried few things till now like:

IQR:


for col in df2.columns:
    col = np.array([col])
    q1_a = np.percentile(col, 25)
    q3_a = np.percentile(col, 75)
    iqr1 = q3_a - q1_a
    print(iqr1)

Modified Z score:


for col in df2.columns:
    threshold = 3.5
    col_zscore = col +'_zscore'
    median_y = df[col].median()
    print(median_y)
    median_absolute_deviation_y = (np.abs(df2[col] - median_y)).median()
    print(median_absolute_deviation_y)
    modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
    print(modified_z_scores)
    df2[col_zscore] = np.abs(modified_z_scores)

df2 = df2[(np.abs(df2[col_zscore]) < 3.5).all(axis=1)]
print(df2)

But not getting the right answer. The function does not apply on each column and create the dataframe of my intention at the end. Please help. Thanks.

zsh_18
  • 1,012
  • 1
  • 11
  • 29
  • Hey, I'm wondering what's with 0.7413? I thought modified z score is simply changing mean and std in z score, i.e. `(x-median)/median_absolute_deviation)` – Darren Christopher Feb 08 '21 at 04:37

2 Answers2

4

Use scipy:

Data:

  A    B   C   D
 30   24  13  41
 30   25  14  45
 30   27  15  44
 30   28  16  43
 31   21  12   4
 31    2  17  99
  3   89  99  45
 78   24   0  43
 35  252  12  45
 36   23  13  44

Code:

import pandas as pd
import scipy

df.apply(scipy.stats.iqr)

# output
A    4.0
B    4.5
C    3.5
D    2.0
dtype: float64

df.apply(scipy.stats.zscore)

# output
        A         B         C         D
-0.196943 -0.392191 -0.307452 -0.200009
-0.196943 -0.377930 -0.269495 -0.013954
-0.196943 -0.349407 -0.231538 -0.060468
-0.196943 -0.335145 -0.193581 -0.106981
-0.139019 -0.434976 -0.345409 -1.921013
-0.139019 -0.705944 -0.155624  2.497782
-1.760907  0.534806  2.956852 -0.013954
 2.583435 -0.392191 -0.800893 -0.106981
 0.092679  2.859432 -0.345409 -0.013954
 0.150604 -0.406453 -0.307452 -0.060468

Verify iqr:

  • df.describe()
  • With a visual inspection of the df.describe() output, it's easy to confirm the results of scipy.stats.iqr
df.describe().loc[['25%', '75%']]

# Output
        A      B      C     D
25%  30.0  23.25  12.25  43.0
75%  34.0  27.75  15.75  45.0

df transform (modified z-score):

  • A z-score is the number of standard deviations from the mean, a data point is. More technically, it's a measure of how many standard deviations below or above the population mean, a raw score is. A z-score is also known as a standard score, and it can be placed on a normal distribution curve.
import numpy as np

def mod_z(col: pd.Series, thresh: float=3.5) -> pd.Series:
    med_col = col.median()
    med_abs_dev = (np.abs(col - med_col)).median()
    mod_z = 0.7413 * ((col - med_col) / med_abs_dev)
    mod_z = mod_z[np.abs(mod_z) < thresh]
    return np.abs(mod_z)

df_mod_z = df.apply(mod_z)

# Output
      A        B       C       D
 0.7413  0.12355  0.2471  2.2239
 0.7413  0.12355  0.2471  0.7413
 0.7413  0.61775  0.7413  0.0000
 0.7413  0.86485  1.2355  0.7413
 0.7413  0.86485  0.7413     NaN
 0.7413      NaN  1.7297     NaN
    NaN      NaN     NaN  0.7413
    NaN  0.12355     NaN  0.7413
    NaN      NaN  0.7413  0.7413
    NaN  0.37065  0.2471  0.0000

Filtered Original dataframe:

df_filtered = df[df_mod_z >= 0]
print(df_filtered)

Output

    A     B     C     D
 30.0  24.0  13.0  41.0
 30.0  25.0  14.0  45.0
 30.0  27.0  15.0  44.0
 30.0  28.0  16.0  43.0
 31.0  21.0  12.0   NaN
 31.0   NaN  17.0   NaN
  NaN   NaN   NaN  45.0
  NaN  24.0   NaN  43.0
  NaN   NaN  12.0  45.0
  NaN  23.0  13.0  44.0
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
1

For the IQR problem:

  Weight    Age
0   40      20
1   62      21
2   35      19
3   29      18

If your data frame is like this, you can calculate the IQR using below code.

for col in df2.columns:
    col_values = df2[col]
    col_values = np.array([col_values])
    q1_a = np.percentile(col_values, 25)
    q3_a = np.percentile(col_values, 75)
    iqr1 = q3_a - q1_a
    print(iqr1)

If all data columns contain numerical data, IQR can be calculated using the above code, else you need to encode categorical data columns to numerical first. If you're expecting a specific IQR value, please look at the numpy percentile interpolation parameter.

For the Z_score problem:

df3 = df2
for col in df2.columns:
    threshold = 3.5
    col_zscore = col +'_zscore'
    median_y = df2[col].median()
    print(median_y)
    median_absolute_deviation_y = (np.abs(df2[col]-median_y)).median()
    print(median_absolute_deviation_y)
    modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
    print(modified_z_scores)
    df2[col_zscore] = np.abs(modified_z_scores)

df2 = df3[(np.abs(df3[col_zscore]) < 3.5)]
print(df2)

Try with the above code.

Ayesh Weerasinghe
  • 580
  • 2
  • 7
  • 19
  • Thanks, IQR worked great with your answer. I am trying to calculate modified Zscore which take median into account in place of mean and then take only those values out of dataframe which are within the threshold with zscore. Thanks – zsh_18 Sep 27 '19 at 05:33