0

The problem that I am facing is how i can reject a window of 10 rows if one or many of the rows consist of an outlier while computing rolling average using python pandas? The assistance i require in is the conditional logic based on the following scenarios mentioned below

The condition on the outlier in a window is:

  • The upper bound for outlier is 15, the lower bound is 0

  • if the frequency of occurrence of outlier in a window is greater than 10%, we reject that particular window and move next.

  • if the frequency of occurrence of outlier in a window is less than 10%, we accept the particular window with the following changes: 1) replace the value of the outlier with the value derived from the average of the non-outlier values i.e. the rest of the 9 rows, then averaging the same window again before moving next

Here's the following code till now:

_filter = lambda x: float("inf") if x > 15 or x < 0 else x

#Apply the mean over window with inf to result those values in  
result = df_list["speed"].apply(_filter).rolling(10).mean().dropna()

#Print Max rolling average
print("The max rolling average is:")

result.max()

1 Answers1

0

Use rolling with a custom aggregation function:

df = pd.DataFrame({"a": range(100), "speed": np.random.randint(0, 17, 100)})

MAX = 15
MIN = 0
def my_mean(s):
    outlier_count = ((s<MIN) | (s > MAX)).sum()
    if outlier_count > 2: # defined 2 as the threshold - can put any other number here
        return np.NaN
    res =  s[(s <= MAX) & (s >= MIN)].mean()
    return res

df["roll"] = df.speed.rolling(10).apply(my_mean)

This results, in one example, in:

    ...
    35  35  8   9.444444
    36  36  14  9.666667
    37  37  11  9.888889
    38  38  16  10.250000
    39  39  16  NaN
    40  40  15  NaN
    41  41  6   NaN
    42  42  9   11.375000
    43  43  2   10.000000
    44  44  8   9.125000
    ...

What happens here is as follows:

  • We create a rolling window of size 10 (df.speed.rolling(10))
  • For each window, which is a series of 10 numbers, we apply the function my_mean.
  • my_mean first counts the number of outliers, by summing the number of cases in which elements in the series s are smaller than the minimum or larger that the maximum.
  • if the count is outliers is too large, we just say that there's no mean and return not-a-number.
  • Otherwise, we filter out outliers and calculate the mean of the other numbers (s[(s <= MAX) & (s >= MIN)].mean()).
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • Sorry, I am new to programming. Could you please guide me how your code generates results based on the conditions and how i can implement it alongside my existing code – karan vir singh bajaj May 11 '20 at 06:41
  • @karanvirsinghbajaj - first, is this what you're trying to achieve? Second, I'll edit the post to explain the rationale in more details. – Roy2012 May 11 '20 at 08:23
  • Thank you for following up and offering your support. I highly appreciate it. The goal is to calculate the rolling average from a given dataset that includes speed of a person. However, it contains outliers. The problem that I am facing is how i can reject/accept a window of 10 rows if one or many of the rows consist of an outlier while computing rolling average using python pandas. However, the conditions for the rejection or accepting the outlier are specified above in the post. The condition applies on the window while computing rolling average. – karan vir singh bajaj May 11 '20 at 08:36
  • If you can't have any outliers in the window, just change the outlier threshold to 0: if outlier_count > 2: --> change to if outlier_count > 0 That would solve the issue for you (and you can accept my post as an answer if that's the case). – Roy2012 May 11 '20 at 09:00
  • that updated post is highly effective. There is just one place where I need your help. In the last point in the description you mentioned, what we need to do is calculate the mean of other numbers, then replace the value of the outlier with the value of that mean and then re-calculate the 10 period window on rolling average which includes the value of the outlier and the value of the mean of other 9 rows. That's literally the last part. Thank you so much for being assistful. Sorry for the hassle – karan vir singh bajaj May 11 '20 at 18:39
  • The code above already calculates the mean without taking outliers into account. As to replacing outliers - it's fairly easy. Give it a try. – Roy2012 May 11 '20 at 19:41