0

Suppose I need to remove the outlier, that is (40, 10) in this case (refer to the plot attached below) using IQR rule, how do I do that?

Compared to the neighbouring points, (40, 10) is definitely an outlier. However,
Q1 = 11.25,
Q3 = 35.75
1.5 * IQR = 1.5 * (Q3 - Q1) = 36.75
Only points with y-val lower than 11.25-36.75 or greater than 35.75+36.75 are considered outliers.
How do I find and remove (40, 10) using IQR rule if I must use IQR rule?

Here's my code:

import pandas as pd
import matplotlib.pyplot as plt

test = pd.DataFrame({'x': range(50), 'y': [i if i != 40 else 10 for i in range(50)]})

plt.figure(**FIGURE)
plt.scatter(test['x'], test['y'], marker='x')
plt.show()

Here's the plot generated from the above code.

plot

JohanC
  • 71,591
  • 8
  • 33
  • 66
Ci Leong
  • 92
  • 11
  • 1
    You are using a 1D test for a 2D problem. You could create a regression line and use the distance to the regression line to identify outliers. See e.g. [Can scipy.stats identify and mask obvious outliers?](https://stackoverflow.com/questions/10231206/can-scipy-stats-identify-and-mask-obvious-outliers) – JohanC Sep 26 '20 at 14:19

1 Answers1

0

The way you are using the IQR is only considering the X axis component. If you do not include the Y axis components, then the point at (40, 10) is not an outlier.

You should use a method that considers 2D instances, such as Local Outlier Factor or any other.

Galo Castillo
  • 324
  • 3
  • 7