Pandas: replace outliers in all columns with nan

Question

I have a data frame with 3 columns, for ex

c1,c2,c3 
10000,1,2 
1,3,4 
2,5,6 
3,1,122 
4,3,4 
5,5,6 
6,155,6

I want to replace the outliers in all the columns which are outside 2 sigma. Using the below code, I can create a dataframe without the outliers.

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 2).all(axis=1)]


c1,c2,c3 
1,3,4 
2,5,6 
4,3,4 
5,5,6

I can find the outliers for each column separately and replace with "nan", but that would not be the best way as the number of lines in the code increases with the number of columns. There must be a better way of doing this. May be boolean output from the above command for rows and then replace "TRUE" with "nan".

Any suggestions, many thanks.

Can you explain exactly what your problem is? Btw you can get rid of the apply - `df[df.sub(df.mean()).abs().div(df.std()).lt(2).all(1)]` — cs95, Oct 05 '17 at 00:35
I want to replace all the outliers (outside 2 sigma range) in a dataframe with "nan". With out apply function, this has to be done for each column separately. — Sridhar, Oct 05 '17 at 00:38

score 10 · Answer 1 · answered Oct 05 '17 at 00:57

pandas
Use pd.DataFrame.mask

df.mask(df.sub(df.mean()).div(df.std()).abs().gt(2))

    c1   c2  c3 
0  NaN  1.0  2.0
1  1.0  3.0  4.0
2  2.0  5.0  6.0
3  3.0  1.0  NaN
4  4.0  3.0  4.0
5  5.0  5.0  6.0
6  6.0  NaN  6.0

numpy

v = df.values
mask = np.abs((v - v.mean(0)) / v.std(0)) > 2
pd.DataFrame(np.where(mask, np.nan, v), df.index, df.columns)

    c1   c2  c3 
0  NaN  1.0  2.0
1  1.0  3.0  4.0
2  2.0  5.0  6.0
3  3.0  1.0  NaN
4  4.0  3.0  4.0
5  5.0  5.0  6.0
6  6.0  NaN  6.0

can you explain what is happening inside the mask – VnC Mar 20 '19 at 11:24 — VnC, Mar 20 '19 at 11:24

Vividha · Answer 2 · 2020-07-22T08:07:18.190

3

lb = df.quantile(0.01)
ub = df.quantile(0.99)
df_new = df[(df < ub) & (df > lb)]
df_new

I am using interquatile range method to detect outliers. Firstly it calculates the lower bound and upper bound of the df using quantile function. Then based on the condition that all the values should be between lower bound and upper bound it returns a new df with outlier values replaced by NaN.

edited Jul 22 '20 at 08:07

answered Jul 21 '20 at 15:57

Vividha

55
7

Pandas: replace outliers in all columns with nan

2 Answers2