Highlight the outliers area in CDF plot

Question

I am trying to highlight the area of the CDF in which the "outliers" fall in my visualization (perhaps a light red shading to differentiate the area).

Can you assist with shading the area where the "outlier" points for as per the definition above? For some reason when I try to look what the outlier definition did, I get an empty output, whether it is print(outliers_iqr(days)) or print(str(outliers_iqr(days)[1:-1]). It just prints array([], dtype=int64),

This is my current code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

a = [389, 350, 130, 344, 392, 92, 51, 28, 309, 357, 64, 380, 332, 109, 284, 105, 
 50, 66, 156, 116, 75, 315, 155, 34, 155, 241, 320, 50, 97, 41, 274, 99, 133, 
 95, 306, 62, 187, 56, 110, 338, 102, 285, 386, 231, 238, 145, 216, 148, 105, 
 368, 176, 155, 106, 107, 36, 16, 28, 6, 322, 95, 122, 82, 64, 35, 72, 214, 
 192, 91, 117, 277, 101, 159, 96, 325, 79, 154, 314, 142, 147, 138, 48, 50, 
 178, 146, 224, 282, 141, 75, 151, 93, 135, 82, 125, 111, 49, 113, 165, 19, 
 118, 105, 92, 133, 77, 54, 72, 34]

#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y

#Using +-1.5x IQR method for defining outliers
def outliers_iqr(ys):
    quartile_1, quartile_3 = np.percentile(ys, [25, 75])
    iqr = quartile_3 - quartile_1
    lower_bound = quartile_1 - (iqr * 1.5)
    upper_bound = quartile_3 + (iqr * 1.5)
    return np.where((ys > upper_bound) | (ys < lower_bound))

days = pd.DataFrame({"days" : a})

x, y = ecdf(days)

plt.plot(x, y, marker='.', linestyle='none') 
plt.axvline(x.mean(), color='gray', linestyle='dashed', linewidth=2) #Add mean

x_m = int(x.mean())
y_m = stats.percentileofscore(days.as_matrix(), x.mean())/100.0

ax=plt.gca()
ax.annotate('(%s,%s)' % (x_m,int(y_m*100)) , xy=(x_m,y_m), 
            xytext=(10,-5), textcoords='offset points')

outliers= outliers_iqr(days) 
print(outliers_iqr(days)) #print outliers- doesn't print   
print(str(outliers_iqr(days))[1:-1]) #same

#highlight the outliers area in the CDF plot
ax.fill_between(?, ?, ?, where=?, facecolor='red', alpha=0.3) #between 0 and 1st quartile
ax.fill_between(?, ?, ?, where=?, facecolor='red', alpha=0.3) #between 3rd quartile and 1

percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0

plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay quartiles

for x,y in zip(x_p, y_p):                                        
    ax.annotate('%s' % int(x), xy=(x,y), xytext=(10,-5), textcoords='offset points')

plt.xlabel('Days')
plt.ylabel('ECDF')
plt.legend(('Days', "Mean", 'Quartiles'), loc='lower right')

plt.show()

What criteria would define the *'"better method" for defining "outliers"'*? Without a clear problem description, this question is not useful. — ImportanceOfBeingErnest, Feb 15 '18 at 21:46
@ImportanceOfBeingErnest I guess I am looking for more "widely used" rather than "statistically sound" — user8834780, Feb 15 '18 at 21:49
@ImportanceOfBeingErnest dropped that from the question since it is pretty insignificant to what I am asking :) — user8834780, Feb 15 '18 at 21:53
Did you look at any examples or similar questions? Did you try to use `fill_between`? What problem do you encounter? — ImportanceOfBeingErnest, Feb 15 '18 at 21:59
@ImportanceOfBeingErnest not exactly, when I try to look what the outlier definition did, I get an empty output, whether it is `print(outliers_iqr(days))` or `print(str(outliers_iqr(days)[1:-1])`. It just prints `array([], dtype=int64),` — user8834780, Feb 15 '18 at 23:12
Ok, so you have a problem with the code. That would require to provide a [mcve], which lets people reproduce your problem. — ImportanceOfBeingErnest, Feb 15 '18 at 23:35
@ImportanceOfBeingErnest ok updated. Both my data set functions and full code are listed — user8834780, Feb 15 '18 at 23:43
I edited your question such that the code is runnable. However, the code does not include any call to the function you apparently have a problem with, so what is the data you call that function with? Just integrate this in the [mcve] as well. — ImportanceOfBeingErnest, Feb 15 '18 at 23:58
In your `outliers_iqr()` function, `print(quartile_1, quartile_3, iqr, lower_bound, upper_bound)` gives `77.5 215.5 138.0 -129.5 422.5`, so your upper and lower bounds are outside of the data limits -- that's why your results are empty. Also, I think you want `fill_betweenx` rather than `fill_between`. — Thomas Kühn, Feb 16 '18 at 10:11
@ThomasKühn I would like to include the condition even though there are no outliers in this case.Here is what I did: `outliers = outliers_iqr(days) ax.fill_betweenx(days, outliers, x_p[0], where= outliersx_p[2], facecolor='red', alpha=0.3)` but I am getting an error `ValueError: Input passed into argument "u'x1'"is not 1-dimensional` and I can't seem to find anything about this error. Any suggestions? — user8834780, Feb 18 '18 at 07:23

score 2 · Accepted Answer · answered Feb 19 '18 at 11:06

If your array of outliers can be sometimes empty, you have to take care of that eventuality with an if statement. Also, as you just want to shade regions of your plot, you can actually use Axes.axvspan for that. Here an example that is somewhat modified from the original (all the plotting commands inside a function and adding a second subplot with data that actually has outliers):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

a = [389, 350, 130, 344, 392, 92, 51, 28, 309, 357, 64, 380, 332, 109, 284, 105, 
 50, 66, 156, 116, 75, 315, 155, 34, 155, 241, 320, 50, 97, 41, 274, 99, 133, 
 95, 306, 62, 187, 56, 110, 338, 102, 285, 386, 231, 238, 145, 216, 148, 105, 
 368, 176, 155, 106, 107, 36, 16, 28, 6, 322, 95, 122, 82, 64, 35, 72, 214, 
 192, 91, 117, 277, 101, 159, 96, 325, 79, 154, 314, 142, 147, 138, 48, 50, 
 178, 146, 224, 282, 141, 75, 151, 93, 135, 82, 125, 111, 49, 113, 165, 19, 
 118, 105, 92, 133, 77, 54, 72, 34]


#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y

#Using +-1.5x IQR method for defining outliers
def outliers_iqr(ys):
    quartile_1, quartile_3 = np.percentile(ys, [25, 75])
    iqr = quartile_3 - quartile_1
    lower_bound = quartile_1 - (iqr * 1.5)
    upper_bound = quartile_3 + (iqr * 1.5)

    return  np.where((ys < lower_bound)), np.where((ys > upper_bound))



def generate_plot(ax, df):

    x, y = ecdf(df)

    ax.plot(x, y, marker='.', linestyle='none') 
    ax.axvline(x.mean(), color='gray', linestyle='dashed', linewidth=2) #Add mean

    x_m = int(x.mean())
    y_m = stats.percentileofscore(df.as_matrix(), x.mean())/100.0

    ax.annotate('(%s,%s)' % (x_m,int(y_m*100)) , xy=(x_m,y_m), 
                xytext=(10,-5), textcoords='offset points')

    outliers= outliers_iqr(df.values) 

    #highlight the outliers area in the CDF plot
    for outl in outliers:
        vals = df.values[outl]
        if vals.size>0:
            ax.axvspan(np.min(vals),np.max(vals),alpha=0.5,color='red')


    percentiles= np.array([25,50,75])
    x_p = np.percentile(df, percentiles)
    y_p = percentiles/100.0

    ax.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay quartiles

    for x,y in zip(x_p, y_p):                                        
        ax.annotate('%s' % int(x), xy=(x,y), xytext=(10,-5), textcoords='offset points')

    ax.set_xlabel('Days')
    ax.set_ylabel('ECDF')
    ax.legend(('Days', "Mean", 'Quartiles'), loc='lower right')


fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize=(10,5))

##original data
days = pd.DataFrame({"days" : a})
generate_plot(axes[0],days)

##fake data with outliers
b = np.concatenate([
    np.random.normal(200,50,300),
    np.random.normal(25,10,20),
    np.random.normal(375,10,20),
])
np.random.shuffle(b)
generate_plot(axes[1],pd.DataFrame({"days" : b}))

##naming the subplots
axes[0].set_title('original data')
axes[1].set_title('fake data with outliers')

plt.show()

The result looks like this:

Hope this helps.

Really appreciate the help! Note that by doing `days = pd.DataFrame({"days" : a})` you plotted a scatter plot rather than a cdf- and that indeed requires a `ax.axvspan`. It is easy to change to a cdf ie. `days = df['days'].dropna()`- but that requires an `ax.axhspan` which is easy to adjust. Therefore I adjusted to have a `generate_cdf_plot()` definition with `ax.axhspan()` and a `generate_scatter_plot()` definition with `ax.axvspan()`, and made left graph a CDF, while right graph is a scatter plot. Thank you so much! — user8834780, Feb 20 '18 at 20:23
One point of confusion- in your graphs (and my scatter plot graphing version), what is y? it is no longer cumulative P(x) and can't be labeled ECDF.. — user8834780, Feb 20 '18 at 21:17
@user8834780 I have to admit that I didn't really think about the meaning of the data. I was just trying to generate data that has outliers after your definition, so that the shading of the `axvspan` can be visualised. If it is needed, I can still give it some thought -- just let me know. — Thomas Kühn, Feb 21 '18 at 06:54
Would be great if I could understand what is it that is being plotted in the non CDF version. Since this is essentially a 1D array, all y values seem arbitrary to me unless I am misunderstanding. Coming up with what y is in this context would be great thank you! — user8834780, Feb 21 '18 at 12:17

Highlight the outliers area in CDF plot

1 Answers1