0

How can I replace outliers in score column from the following dataframe with the before and after values?

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10   8.590178   ---> outlier
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04   3.051802   ---> outlier
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

The expected output is:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944

Many thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    Please include what you have tried so far in your question. See [How do I ask a good question](https://stackoverflow.com/help/how-to-ask). Also how do you define an outlier? – LeoE Jan 10 '20 at 09:25
  • 2
    detect outliers using stats,zscore and remove the rows. perform fillna – Mohamed Thasin ah Jan 10 '20 at 09:27
  • Maybe this will help :- https://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining – Echo Jan 10 '20 at 09:28
  • So many wonderful answers, I hesitate which one I should choose to accept. ;) – ah bon Jan 10 '20 at 10:00

5 Answers5

2

It depends on what you consider as atypical, you could adopt many criteria. You could use Series.mask to replace atypical values ​​with NaN for example with a boolean series created with Series.pct_change and then fill in with Series.interpolate

n = 2
df['score'] = (df['score'].mask(df['score'].pct_change()
                                           .shift(-1)
                                           .abs()
                                           .gt(n))
                           .interpolate() )
print(df)

We could also use Series.mean with Series.std and check with Series.between:

import numpy as np
score_mean = df['score'].mean()
score_std  = df['score'].std()
n = np.sqrt(3)
valid = df['score'].between(score_mean-n*score_std,score_mean+n*score_std)
df['score'] = df['score'].where(valid).interpolate()

Output

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944
ansev
  • 30,322
  • 5
  • 17
  • 31
2

In stats, the basic formula for z-score is (x - mean) / std. Therefore, if you want, you may just define it yourself and compare it against the threshold you choose.

s = df.score
m = ((s - s.mean()) / s.std()).abs() > 1.5
df['score'] =  df['score'].mask(m).interpolate()

Out[701]:
       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  68.683273
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944
Andy L.
  • 24,909
  • 4
  • 17
  • 29
1

Use pd.DataFrame.interpolate:

df.loc[df['score'].lt(10), 'score'] = np.nan
print(df.interpolate())

Output:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
1

Solution using iqr outliers:

import numpy as np
from scipy.stats import iqr

def outliers(df, factor=1.5):
    limit1 = np.quantile(df, 0.25) - factor * iqr(df)
    limit2 = np.quantile(df, 0.75) + factor * iqr(df)
    outliers = df[(df < limit1) | (df > limit2)]
    return outliers

outlier = outliers(df['score'])

df['score'] = df['score'].replace(outlier, np.nan).interpolate()

Output:

       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  95.101470
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944
luigigi
  • 4,146
  • 1
  • 13
  • 30
1

Solution with zscore, only changed common used 3 value to 1.5:

from scipy import stats
mask = (np.abs(stats.zscore(df['score'])) > 1.5)
df.score = df.score.mask(mask).interpolate()
print (df)
       date      score
0   2018-07  51.964556
1   2018-08  63.497871
2   2018-09  85.304209
3   2018-10  69.840105
4   2018-11  54.376001
5   2018-12  65.844745
6   2019-01  53.050123
7   2019-02  39.915868
8   2019-04  48.701537
9   2019-05  57.487205
10  2019-06  68.683273
11  2019-07  79.879340
12  2019-08  77.007162
13  2019-09  54.567136
14  2019-10  63.899944
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, so I need to tune zscore in order to get good performance on outlier filter, right? – ah bon Jan 10 '20 at 10:12
  • 1
    @ahbon - You are welcome, honestly, it should be fast, but not 100% sure, but I can test it. – jezrael Jan 10 '20 at 10:14
  • 2
    @ahbon - Tested for 150k rows by `df = pd.concat([df] * 10000, ignore_index=True) `, luigigi answer `17.6 ms ± 814 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)`, second ansev `10.7 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`, my `21.4 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Jan 10 '20 at 10:24
  • @ahbon - U10 not tested, because not not removed outliers – jezrael Jan 10 '20 at 10:25