Obtaining values used in boxplot, using python and matplotlib

Question

I can draw a boxplot from data:

import numpy as np
import matplotlib.pyplot as plt

data = np.random.rand(100)
plt.boxplot(data)

Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).

Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:

median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)

iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()

I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.

Why do we need to use max() and min() for upper and lower whisker ?. Can't we use (upper_quartile +1.5*iqr) and (lower_quartile -1.5*iqr) directly as whiskers ? — Harshit Mehta, Jul 19 '18 at 17:15
@HarshitMehta Because the whisker is an actual data point in the set, which may not fall right on the `upper_quartile + 1.5 * iqr` value. — Yuxiang Wang, Jul 25 '18 at 00:39

CT Zhu · Accepted Answer · 2020-01-15T06:05:10.290

31

Why do you want to do so? what you are doing is already pretty direct.

Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.

B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]

It returns an array of the shape (2,) for each whiskers, the second element is the value we want:

[item.get_ydata()[1] for item in B['whiskers']]

edited Jan 15 '20 at 06:05

answered May 04 '14 at 23:18

CT Zhu

52,648
17
120
133

2

Thank you so much! I have actually realized that I am asking a "greedy" question once I have written the example code snippet - not as verbose as I thought. But still, it is great to know that the get_ydata() can do the same thing! – Yuxiang Wang May 05 '14 at 01:16
I want to mention that in last line of your code index should be 1 not 0. I almost get confused. ```[item.get_ydata()[1] for item in B['whiskers']]``` – avijit Jan 11 '20 at 21:42
4

Might be easiest to use the same function that matplotlib uses: https://matplotlib.org/3.1.1/api/cbook_api.html#matplotlib.cbook.boxplot_stats – Paul H Jun 11 '20 at 19:52
1

@PaulH, thank you, great comment. Should be on the top – timanix Jun 25 '21 at 08:21

score 17 · Answer 2 · answered May 18 '20 at 13:27

I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.

The function is:

def get_box_plot_data(labels, bp):
    rows_list = []

    for i in range(len(labels)):
        dict1 = {}
        dict1['label'] = labels[i]
        dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
        dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
        dict1['median'] = bp['medians'][i].get_ydata()[1]
        dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
        dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
        rows_list.append(dict1)

    return pd.DataFrame(rows_list)

And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.

For example:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def get_box_plot_data(labels, bp):
    rows_list = []

    for i in range(len(labels)):
        dict1 = {}
        dict1['label'] = labels[i]
        dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
        dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
        dict1['median'] = bp['medians'][i].get_ydata()[1]
        dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
        dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
        rows_list.append(dict1)

    return pd.DataFrame(rows_list)

data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)

labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()

Outputs the following from get_box_plot_data:

   label  lower_whisker  lower_quartile    median  upper_quartile  upper_whisker
0  data1      -2.491652       -0.587869  0.047543        0.696750       2.559301
1  data2       2.351567        4.310068  4.984103        5.665910       7.489808
2  data3       7.227794        9.278931  9.947674       10.661581      12.733275

And produces the following plot:

That is really awesome. I was about to write the same function. I would suggest that you add how many points are considered outliers (above the upper whisker cap and below the lower whisker). — O. Mohsen, Jan 04 '21 at 04:08
Thx for this answer. It would be even more useful if your code had seeded the random values it creates with a function such as `np.random.seed(0)`. — Alper, Feb 20 '23 at 20:34

Jupiter · Answer 3 · 2023-04-28T16:23:26.910

You can find the values from the dataframe series. For instance, to show the median value as an annotation in the plot.

For example, assume a dataframe with two series col1 (categorical) and col2 (continuous). We want to boxplot col2 as a function of the values of col1:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

d = {'col1': ['B','A','A','B','B','A'], 'col2':[1,20,30,40,60,70]}
df = pd.DataFrame(data=d)
df['col1']= df['col1'].astype("category")

fig, axes = plt.subplots(figsize=(10, 10),nrows=1, ncols=1, sharey=True)
i='col2'
j='col1'
df.boxplot(ax=axes,column=[i], by=j, grid=True)

for value,cat in enumerate(df[j].cat.categories):
    series=df[df[j]==cat]
    median=series[i].describe()['50%']
    median=np.round(median,1)
    axes.annotate(median,(value+1+0.25,median),fontsize=24, color='blue')

plt.show()

score -1 · Answer 4 · answered Jan 06 '22 at 20:13

-1

upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()

equal to

upper_whisker = data.max()
lower_whisker = data.min()

if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR

answered Jan 06 '22 at 20:13

Mandy Yan

1

1

Neither the two respective equations you have given for upper_whisker nor those for lower_whisker are equal. In addition, the upper and lower whisker values, i.e. the values the respective whiskers extend to are not UQ+1.5×IQR and LQ-1.5×IQR. These are outlier limits. The values the respective whiskers extend to are the maximum lower than the upper limit and the minimum higher than the lower limit (your 1st set of equations). Furthermore, the question is about getting the values used in a boxplot and the outlier limits can be based on something else other than 1.5×IQR using the `whis=` option. – Alper Feb 20 '23 at 20:26

Obtaining values used in boxplot, using python and matplotlib

4 Answers4

Linked

Related