How boxen plot is different from box plot?

Question

I want to know why there is boxen plot when we have box plot in sea-born library.
I know one thing that boxen plot optimised way to represent data especially for large data-sets but i don't know why and other then this i don't have any good reasons to use boxen plot.

Which functions are you talking about specifically? Remember that most plotting libraries such as `seaborn` operate on top of `matplotlib` and use `matplotlib` functions internally. — timgeb, Sep 19 '18 at 10:13
I am talking about the seaborn catplot function. sns.catplot(x="day", y="total_bill", hue="sex",kind="violin", data=tips); — Kakarot, Sep 19 '18 at 10:15
If you don't have good reasons to use it don't use it. Problem solved. — user2699, Sep 19 '18 at 12:01
[As per the documentation](https://seaborn.pydata.org/generated/seaborn.boxenplot.html#seaborn.boxenplot), the use of boxen plot has been extensively discussed in [the paper introducing the concept](https://vita.had.co.nz/papers/letter-value-plot.html) — Diziet Asahi, Sep 19 '18 at 12:04
The plot is also called a "letter-value" plot according to the original paper, FYI. — wordsforthewise, Jan 26 '21 at 05:15

wordsforthewise · Answer 1 · 2021-07-08T23:08:49.973

The box plot shows the median as the centerline (50th percentile), then the 25th and 75th percentile as the box boundaries. Then the IQR method is used to calculate outlier boundaries (1.5 * IQR + Q3 for the upper boundary, for example). Q3 is the 3rd quartile, or 75th percentile of the data (75% of the data is below this value). Outliers outside of the outlier whiskers are shown as distinct points.

Boxenplots (actually called letter-value plots in the original paper and in the lvplot R package) show the distribution differently and are better for bigger datasets. Classic boxplots can have too many outliers and don't show as much information about the distribution. Letter-value plots (boxenplots) start with the median (Q2, 50th percentile) as the centerline. Each successive level outward contains half of the remaining data. So the first two sections out from the centerline contain 50% of the data. After that, the next two sections contain 25% of the data. This continues until we are at the outlier level. Each level out is shaded lighter. There are 4 methods for calculating outliers (described in the paper and available in seaborn). The default is to end up with around 5-8 outliers in each tail.

I illustrated with Diziet's nice figure:

The original paper (from authors including Hadley Wickham) is linked in the seaborn boxenplot docs.

score 23 · Accepted Answer · edited Jul 08 '21 at 23:10

23

From my understanding of the paper describing the concept of "boxenplot" (or "letter-value plot" as the authors named it), the goal is to provide a better representation of the distribution of the data than boxplot (esp. when lots of outlier values are present), but without the need to choose specific parameters, for example for the KDE function used by violinplot, which could distort the appearance of the distribution if chosen poorly

Here is a comparison of the three types of plots:

edited Jul 08 '21 at 23:10

wordsforthewise

13,746
5
87
117

answered Sep 19 '18 at 15:14

Diziet Asahi

38,379
7
60
75

what boxen plot showing by replacing outlier with a thin line or arrow ? – Kakarot Sep 19 '18 at 18:56
outliers are still there, plotted as diamonds. The default with seaborn's boxenplot is to show something like 5-8 outliers on both sides I believe. – wordsforthewise Jan 25 '21 at 23:44

How boxen plot is different from box plot?

2 Answers2

Linked