4

I'm having a list with a random amount of integers and/or floats. What I'm trying to achieve is to find the exceptions inside my numbers (hoping to use the right words to explain this). For example:

list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
  • 90 to 99% of my integer values are between 1 and 20
  • sometimes there are values that are much higher, let's say somewhere around 100 or 1.000 or even more

My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million.

Is there a function to filter out these special numbers?

Ehsan
  • 12,072
  • 2
  • 20
  • 33
finethen
  • 385
  • 1
  • 4
  • 19
  • 1
    Something like calculating [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)? – DeepSpace Jul 08 '20 at 19:16
  • 3
    You are looking for "outliers". The hard part is determining how you would define an outlier. If most of your numbers fit into a distribution, such as a normal distribution, you can fit your data to a distribution and find the points that are statistically unlikely to come from the distribution. – James Jul 08 '20 at 19:16
  • Does this answer your question? https://stackoverflow.com/questions/57161413/is-there-function-that-can-remove-the-outliers – Ronald Jul 08 '20 at 19:18
  • @James Thanks! Even knowing that they are called 'outliers' really helps me in my search. – finethen Jul 08 '20 at 19:37

3 Answers3

7

Assuming your list is l:

  • If you know you want to filter a certain percentile/quantile, you can use:

    This removes bottom 10% and top 90%. Of course, you can change any of them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):

    import numpy as np
    l = np.array(l)
    l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
    

    output:

    [ 3  2 14  2  8  4  3  5]
    
  • If you are not sure of the percentile cut-off and are looking to remove outliers:

    You can adjust your cut-off for outliers by adjusting argument m in function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.

     import numpy as np 
     l = np.array(l) 
     def reject_outliers(data, m=6.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d / (mdev if mdev else 1.)
        return data[s < m].tolist()
     print(reject_outliers(l))
    

    output:

    [1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]
    
Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • Hello! Can you explain the steps of your 2nd function? For instance why are you finding the median twice (first of data and then again for d? And what is the purpose of "mdev if mdev else 1."? Lastly, what does "s" represent and why is the cut off s < m? Thank you! Sorry my stats background isn't too great. – srv_77 Aug 31 '21 at 17:52
-1

You can use the built-in filter() method:

lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]

lst2 = list(filter(lambda x: x > 5,lst1))

print(lst2)

Output:

[14, 108, 8, 97]
Red
  • 26,798
  • 7
  • 36
  • 58
  • 1
    From OP: "My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million." I think the idea here is to no have `5` or any other value hard-coded – DeepSpace Jul 08 '20 at 19:21
  • @DeepSpace What do you mean `regular range`? – Red Jul 08 '20 at 19:22
  • OP means that the range of numbers which is considered acceptable – DeepSpace Jul 08 '20 at 19:24
-3

So here is a method how to block out those deviators

import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
    mu = 0
    for i in _list:
        mu += i
    mu = mu/len(_list)
    sigma = 0
    for i in _list:
        sigma += math.pow(i-mu,2)
    sigma = math.sqrt(sigma/len(_list))
    return sigma, mu

def frequence(x, sigma, mu):
    return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))

sigma, mu = consts(_list)

new_list = []
for i in range(len(_list)):
    if frequence(_list[i], sigma, mu) > 0.01:
        new_list.append(i)
print(new_list)
mama
  • 2,046
  • 1
  • 7
  • 24
  • From OP: "My problem is, that these values can be different all the time. Maybe the regular range is somewhere between 1.000 to 1.200 and the exceptions are in the range of half a million." I think the idea here is to no have `20` or any other value hard-coded.Also, you are removing elements from the list while iterating which is never a good idea (**even the code you posted causes an `IndexError`**) – DeepSpace Jul 08 '20 at 19:23
  • Ok you dont like it I create a function to detect the normal distrobution and delete those who is not accepted then,,... – mama Jul 08 '20 at 19:30
  • @DeepSpace You're right! The idea really is to not have a hard coded value like in this case the 20. – finethen Jul 08 '20 at 19:39
  • Yes so if you create a normal distribution and remove the less normal, then you get the max and min values (ouliers) and then you just pop them. :) – mama Jul 08 '20 at 19:41
  • @mama Tried it, but unfortunately it doesn't work that well. Functions pops out some values that should be in the accepted range. But anyway a big thank you for your help! – finethen Jul 09 '20 at 18:28
  • It was my pleasure :) – mama Jul 09 '20 at 18:32