1

I have a the following data:

x <- data.frame('myvar'=c(10,10,9,9,8,8, runif(100)), 'mygroup' = c(rep('a', 26), rep('b', 80)))

I want to describe the data using a box-and-whiskers plot in ggplot2. I have also included the mean using a stat_summary.

library(ggplot2)
ggplot(x, aes(x=myvar, y=mygroup)) + 
geom_boxplot() +
stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red') 

enter image description here

This is fine, but for some of my graphs, the outliers are so huge, that it's hard to make sense of the total distribution. In these cases, I have cut the x axis:

ggplot(x, aes(x=myvar, y=mygroup)) + 
geom_boxplot() +
stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red')  +
scale_x_continuous(limit=c(0,5))

enter image description here

Note, now that the means (and medians?) are calculated using only the subset of data that is visible on the graph. Is there a ggplot way to include the outlier observations in the calculation but drop them from the visualisation?

My desired output would be a graph with x limits at c(0,5) and a red dot at 2.48 for group mygroup='a'.

jpsmith
  • 11,023
  • 5
  • 15
  • 36
Otto Kässi
  • 2,943
  • 1
  • 10
  • 27
  • Try `library(ggplot2); library(ggbreak); ggplot(x, aes(x=myvar, y=mygroup)) + geom_boxplot() + stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red') + scale_x_break(c(1.5, 7.5))` – G. Grothendieck Jan 12 '23 at 15:04

1 Answers1

5

scale_x_continuous will remove those points not lying within the limits. You want to use coord_cartesian to "zoom in" without removing your data:

ggplot(x, aes(x=myvar, y=mygroup)) + 
  geom_boxplot() +
  stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red')  +
  coord_cartesian(c(0,5))

enter image description here

jpsmith
  • 11,023
  • 5
  • 15
  • 36
  • 1
    Is this generally considered good or bad form? Excluding data from the plot then including in the summary seems off to me, but I'm not confident about it. – Paul Stafford Allen Jan 12 '23 at 14:38
  • 2
    Not generally bad form (unless you use if to misrepresent something, of course). For instance, if I plot simulated infectious disease outbreaks, most of their trajectories would be a few cases over a few months, though due to stochasticity some may be super large and long (ie, a quick example [here](https://i.stack.imgur.com/InWNl.png))- so for practical purposes it would make more sense to explore the relevant data by "zooming in" but keeping the summary stats the same (example [here](https://i.stack.imgur.com/C1tf3.png)) - so kind of a "yes and no" answer here :) – jpsmith Jan 12 '23 at 14:46
  • @PaulStaffordAllen Also as good practice I usually indicate that there are values out of range - for another example [here](https://i.stack.imgur.com/0zYYn.png) where values lying on the x axis are below (but their actual values are irrelevant) - they are so extreme that accommodating them in the figure would make all the points indistinguishable. – jpsmith Jan 12 '23 at 15:45