How to detect and remove outliers in dataframe

Question

I have a dataset as this

{'SYMBOL': {0: 'BAF180', 1: 'ACTL6A', 2: 'DMAP1', 3: 'C1orf149', 4: 'YEATS4'}, 'Gene Name(s)': {0: ';PB1;BAF180;MGC156155;MGC156156;PBRM1;', 1: ';ACTL6A;ACTL6;BAF53A;MGC5382;', 2: ';DMAP1;DKFZp686L09142;DNMAP1;DNMTAP1;FLJ11543;KIAA1425;EAF2;SWC4;', 3: ';FLJ11730;CDABP0189;C1orf149;NY-SAR-91;RP3-423B22.2;Eaf6;', 4: ';YEATS4;4930573H17Rik;B230215M10Rik;GAS41;NUBI-1;YAF9;'}, 'Description': {0: 'polybromo 1', 1: 'BAF complex 53 kDa subunit|BAF53|BRG1-associated factor|actin-related protein|hArpN beta; actin-like 6A', 2: 'DNA methyltransferase 1 associated protein 1; DNMT1 associated protein 1', 3: 'hypothetical protein LOC64769|sarcoma antigen NY-SAR-91; chromosome 1 open reading frame 149', 4: 'NuMA binding protein 1|glioma-amplified sequence-41; YEATS domain containing 4'}, 'G.O. PROCESS': {0: 'Transcription', 1: 'Transcription', 2: 'Transcription', 3: 'Transcription', 4: 'Transcription'}, 'TurboSEQUESTScore': {0: 70.29, 1: 80.29, 2: 34.18, 3: 30.32, 4: 40.18}, 'Coverage %': {0: 6.7, 1: 28.0, 2: 10.7, 3: 24.2, 4: 21.1}, 'KD': {0: 183572.3, 1: 47430.4, 2: 52959.9, 3: 21501.9, 4: 26482.7}, 'Genebank Accession no': {0: 30794372, 1: 4757718, 2: 13123776, 3: 29164895, 4: 5729838}, 'MS/MS Peptide no.': {0: '9 (9 0 0 0 0)', 1: '9 (9 0 0 0 0)', 2: '4 (3 0 0 1 0)', 3: '3 (3 0 0 0 0)', 4: '4 (4 0 0 0 0)'}}

I would want to detect and remove outliers on the column TurboSEQUESTScore using 3 times of standard deviation as the threshold for outliers How can I go about it? This is what i have tried.

The name of dataframe is rename_df

z_scores = stats.zscore(rename_df['TurboSEQUESTScore'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=None)

I don't seem to solve this properly.

Please paste a sample of your dataframe as text, not an image. — , Nov 25 '21 at 01:56
Use `print(df.head().to_dict())`. That will display copy/pasteable JSON. — , Nov 25 '21 at 02:05
Also, what are you having trouble with: getting the stddev or pruning the outliers, or both? — , Nov 25 '21 at 02:06
Does this help? https://stackoverflow.com/a/62802518/4975981 — Ehsan, Nov 25 '21 at 02:56

iamakhilverma · Answer 1 · 2021-11-25T02:11:52.237

You were approaching it correctly only but just needed to pass the boolean abs_z_scores < 3 to your dataframe, i.e., rename_df[(abs_z_scores < 3)], to get the desired dataframe and then store it in any variable of your choice.

This will do the job in one line and is more readable-

import numpy as np
from scipy import stats
filtered_rename_df = rename_df[(np.abs(stats.zscore(rename_df["TurboSEQUESTScore"])) < 3)]

You'll get a new dataframe named filtered_rename_df with the filtered entries after removing outliers using z-score < 3.

How to detect and remove outliers in dataframe

1 Answers1