Removal of outliers using numpy.argwhere

Question

Hey guys this question might be more about logic than code, hopefully someone can light it up.
So, I have a data list that contains some outliers, and I want to remove it by using the difference between each item on the list and identifying where the difference is far too big.
From this example, I want to remove from the data list the indexes[2,3,4]. What is the best way to do it??
I have tried to use np.argwhere() method to find the indexes, however, I am stuck on how to use the result of it to slice a np.array??

data=[4.0, 4.5, 22.5, 40.5, 22.5, 3.5, 3.0, 3.5, 4.5, 3.5, 2.5]
data=np.array(data)
d = data[:-1] - data[1:]
print(np.mean(d))

In this example, when I print the difference (d) it returns me this:

print(d) # returns:[ -0.5 -18.  -18.   18.   19.    0.5  -0.5  -1.    1.    1. ]

That is good. Now, the logic I applied was to indicate where in d we have a number higher than the average of the original data.

x = np.argwhere(d>np.mean(data))
print(x)        # returns: array([3], dtype=int64), array([4], dtype=int64)
indices_to_extract = [x[0]-1,x[-1]]
print(indices_to_extract)      # returns: [array([2], dtype=int64), array([[4]], dtype=int64)]
a1 = np.delete(r,indices_to_extract,axis=0)
print(a1)       #returns: [ 4.   4.5 40.5  3.5  3.   3.5  4.5  3.5  2.5]


 #Desirable return:
[ 4.   4.5 3.5  3.  3.5  4.5  3.5  2.5]

Main question is, how to make the result from np.argwhere() range of number that can be used for slicing??

Is this a question of how to remove outliers with this method or what is a best way to remove outliers? Because there are better ways to detect outliers than simple distance. — Ehsan, Jun 23 '20 at 11:00

score 1 · Accepted Answer · answered Jun 23 '20 at 09:59

The problem with taking the difference between items of the list is that for instance the value with index 1 (4.5) will be considered as outlier (it gets an high value with the difference). Also you can get both positive and negative values when taking the difference, so if you want to do it in that way you should apply the module (abs) on the result of the difference.

A way to spot outliers is the follow:

Compute the z-score:

d = (data - np.mean(data)) / np.std(data)

Select every value from data except for the outliers (above the 75% quantile):

data[np.where( ~(d > np.quantile(d, 0.75)))]

Output:

array([4. , 4.5, 3.5, 3. , 3.5, 4.5, 3.5, 2.5])

Quick question, what does ~ do to the code? Never saw that before?? — Angel Lira, Jun 23 '20 at 10:52
It's the negation, so all the true becomes false and vice versa — DavideBrex, Jun 23 '20 at 11:18

score 1 · Answer 2 · answered Jun 23 '20 at 11:08

I would advise using normalized distances to median which is more robust:

d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
print(data[s < 4])

You can change the threshold (here 4 in the last line) to your desire accuracy.

output:

[4.  4.5 3.5 3.  3.5 4.5 3.5 2.5]

score 0 · Answer 3 · edited Jun 23 '20 at 15:04

0

To use np.argwhere() for a range of numbers say [3,20] in your case you an use:

x = np.argwhere((data<20) & (data>3))

To return array less/greater than a number (say data below 20) you can simply use:

data[np.where(data<20)]

and for a range of numbers say [3,20]:

data[np.where((data<20)&(data>3))]

edited Jun 23 '20 at 15:04

ipinak

5,739
3
23
41

answered Jun 23 '20 at 10:39

Ahana Kk

1
1

Removal of outliers using numpy.argwhere

3 Answers3