Replace given columns' outliers with mean of before and after rows' values in Pandas

Question

How can I replace outliers in score column from the following dataframe with the before and after values?

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10   8.590178   ---> outlier
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04   3.051802   ---> outlier
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

The expected output is:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

Many thanks.

Please include what you have tried so far in your question. See [How do I ask a good question](https://stackoverflow.com/help/how-to-ask). Also how do you define an outlier? — LeoE, Jan 10 '20 at 09:25
detect outliers using stats,zscore and remove the rows. perform fillna — Mohamed Thasin ah, Jan 10 '20 at 09:27
Maybe this will help :- https://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining — Echo, Jan 10 '20 at 09:28
So many wonderful answers, I hesitate which one I should choose to accept. ;) — ah bon, Jan 10 '20 at 10:00

ansev · Answer 1 · 2020-01-10T09:47:12.133

It depends on what you consider as atypical, you could adopt many criteria. You could use Series.mask to replace atypical values with NaN for example with a boolean series created with Series.pct_change and then fill in with Series.interpolate

n = 2
df['score'] = (df['score'].mask(df['score'].pct_change()
                                           .shift(-1)
                                           .abs()
                                           .gt(n))
                           .interpolate() )
print(df)

We could also use Series.mean with Series.std and check with Series.between:

import numpy as np
score_mean = df['score'].mean()
score_std  = df['score'].std()
n = np.sqrt(3)
valid = df['score'].between(score_mean-n*score_std,score_mean+n*score_std)
df['score'] = df['score'].where(valid).interpolate()

Output

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

Thank you. If possible, I think it will be better to create mask with `zscore` or `std`. — ah bon, Jan 10 '20 at 09:40
I chose `n = 2` for the case with `Series.pct_change`, simply by way of example. I chose `n = sqrt(3)`for the second example with `Series.std` and `Series.mean`. — ansev, Jan 10 '20 at 10:13

score 2 · Answer 2 · answered Jan 10 '20 at 10:50

In stats, the basic formula for z-score is (x - mean) / std. Therefore, if you want, you may just define it yourself and compare it against the threshold you choose.

s = df.score
m = ((s - s.mean()) / s.std()).abs() > 1.5
df['score'] =  df['score'].mask(m).interpolate()

Out[701]:
       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  68.683273
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

score 1 · Answer 3 · edited Jan 10 '20 at 09:57

1

Use pd.DataFrame.interpolate:

df.loc[df['score'].lt(10), 'score'] = np.nan
print(df.interpolate())

Output:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

edited Jan 10 '20 at 09:57

jezrael

822,522
95
1,334
1,252

answered Jan 10 '20 at 09:31

U13-Forward

69,221
14
89
114

Could you please implement `scipy.stats.zscore` or others instead of set a definitive value? – ah bon Jan 10 '20 at 09:42
@ahbon What does `stats zscore` mean? – U13-Forward Jan 10 '20 at 09:43
Sorry, `scipy.stats.zscore` I modified. – ah bon Jan 10 '20 at 09:44
1

@ahbon check now – U13-Forward Jan 10 '20 at 09:56

score 1 · Answer 4 · answered Jan 10 '20 at 09:44

Solution using iqr outliers:

import numpy as np
from scipy.stats import iqr

def outliers(df, factor=1.5):
    limit1 = np.quantile(df, 0.25) - factor * iqr(df)
    limit2 = np.quantile(df, 0.75) + factor * iqr(df)
    outliers = df[(df < limit1) | (df > limit2)]
    return outliers

outlier = outliers(df['score'])

df['score'] = df['score'].replace(outlier, np.nan).interpolate()

Output:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

score 1 · Accepted Answer · answered Jan 10 '20 at 09:47

1

Solution with zscore, only changed common used 3 value to 1.5:

from scipy import stats
mask = (np.abs(stats.zscore(df['score'])) > 1.5)
df.score = df.score.mask(mask).interpolate()
print (df)
       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  68.683273
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

answered Jan 10 '20 at 09:47

jezrael

822,522
95
1,334
1,252

Thank you, so I need to tune zscore in order to get good performance on outlier filter, right? – ah bon Jan 10 '20 at 10:12
1

@ahbon - You are welcome, honestly, it should be fast, but not 100% sure, but I can test it. – jezrael Jan 10 '20 at 10:14
2

@ahbon - Tested for 150k rows by `df = pd.concat([df] * 10000, ignore_index=True) `, luigigi answer `17.6 ms ± 814 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)`, second ansev `10.7 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`, my `21.4 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Jan 10 '20 at 10:24
@ahbon - U10 not tested, because not not removed outliers – jezrael Jan 10 '20 at 10:25

Replace given columns' outliers with mean of before and after rows' values in Pandas

5 Answers5