2

When plotting an ellips with ggplot is it possible to constrain the ellips to values that are actually possible?

For example, the following reproducible code and data plots Ele vs. Var for two species. Var is a positive variable and cannot be negative. Nonetheless, negative values are included in the resulting ellips. Is it possible to bound the ellips by 0 on the x-axis (using ggplot)?

More specifically, I am picturing a flat edge with the ellipsoids truncated at 0 on the x-axis.

library(ggplot2)
set.seed(123)
df <- data.frame(Species = rep(c("BHS", "MTG"), each = 100),
                 Ele = c(sample(1500:3000, 100), sample(2500:3500, 100)),
                 Var = abs(rnorm(200)))

ggplot(df, aes(Var, Ele, color = Species)) +
  geom_point() +
  stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2) 

enter image description here

Mike Wise
  • 22,131
  • 8
  • 81
  • 104
B. Davis
  • 3,391
  • 5
  • 42
  • 78
  • 1
    Given that your data basically goes right up to 0, are you imagining an ellipse with a flat edge? Or shifting the ellipses to the right even though they'll miss points near 0? Easiest might be to use a log-transformed x axis and ellipses on the log scale... I guess the real question is "what purpose are the ellipses serving"? – Gregor Thomas Aug 18 '17 at 16:54
  • I am picturing a flat edge so that the ellipsoids are truncated at 0 – B. Davis Aug 18 '17 at 16:56
  • I am using the ellipses to characterize the distribution of the observed values of `Ele` and `Var` for each species. I think the above plot does that well, but it would be better (IMO) if the ellipses contained only possible combinations of each variable. – B. Davis Aug 18 '17 at 17:10
  • 2
    I think @Gregor is right about needing to transform. These ellipses **don't** show the distributions particularly well. If your real data resemble a folded-normal like your simulation, would a log-transform or a sqrt-transform be theoretically reasonable? The y-variable you simulated is uniformly distributed, so the ellipses imply something else there too. Ellipses are really only appropriate for multivariate *t* or normal distributions. – Brian Aug 18 '17 at 18:21

2 Answers2

7

You could edit the default stat to clip points to a particular value. Here we change the basic stat to trim x values less than 0 to 0

StatClipEllipse <- ggproto("StatClipEllipse", Stat,
    required_aes = c("x", "y"),
    compute_group = function(data, scales, type = "t", level = 0.95,
       segments = 51, na.rm = FALSE) {
           xx <- ggplot2:::calculate_ellipse(data = data, vars = c("x", "y"), type = type,
               level = level, segments = segments)
           xx %>% mutate(x=pmax(x, 0))
      }
)

Then we have to wrap it in a ggplot stat that is identical to stat_ellipe except that it uses our custom Stat object

stat_clip_ellipse <- function(mapping = NULL, data = NULL,
                         geom = "path", position = "identity",
                         ...,
                         type = "t",
                         level = 0.95,
                         segments = 51,
                         na.rm = FALSE,
                         show.legend = NA,
                         inherit.aes = TRUE) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatClipEllipse,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      type = type,
      level = level,
      segments = segments,
      na.rm = na.rm,
      ...
    )
  )
}

then you can use it to make your plot

ggplot(df, aes(Var, Ele, color = Species)) +
  geom_point() +
  stat_clip_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2) 

enter image description here

This was inspired by the source code for stat_ellipse.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
4

Based on my comment above, I created a less-misleading option for visualization. This is ignoring the problem with y being uniformly distributed, since that's a somewhat less egregious problem than the heavily skewed x variable.

Both these options use the ggforce package, which is an extension of ggplot2, but just in case, I've also included the source for the particular function I used.

library(ggforce)
library(scales)


# power_trans <- function (n) 
# {
#     scales::trans_new(name = paste0("power of ", fractions(n)), transform = function(x) {
#         x^n
#     }, inverse = function(x) {
#         x^(1/n)
#     }, breaks = scales::extended_breaks(), format = scales::format_format(), 
#         domain = c(0, Inf))
# }

Option 1:

ggplot(df, aes(Var, Ele, color = Species)) +
  geom_point() + 
  stat_ellipse(aes(fill = Species), geom="polygon",level=0.95,alpha=0.2) +
  scale_x_sqrt(limits = c(-0.1,3.5), 
               breaks = c(0.0001,1:4), 
               labels = 0:4,
               expand = c(0.00,0))

enter image description here

This option stretches the x-axis along a square-root transform, spreading out the points clustered near zero. Then it computes an ellipse over this new space.

  • Advantage: looks like an ellipse still.
  • Disadvantage: in order to get it to play nice and label the Var=0 point on the x axis, you have to use expand = c(0,0), which clips the limits exactly, and so requires a bit more fiddling with manual limits/breaks/labels, including choosing a very small value (0.0001) to be represented as 0.
  • Disadvantage: the x values aren't linearly distributed along the axis, which requires a bit more cognitive load when reading the figure.

Option 2:

ggplot(df, aes(sqrt(Var), Ele, color = Species)) +
  geom_point() + 
  stat_ellipse() +
  coord_trans(x = ggforce::power_trans(2)) + 
  scale_x_continuous(breaks = sqrt(0:4), labels = 0:4,
                     name = "Var")

enter image description here

This option plots the pre-transformed sqrt(Var) (notice the aes(...)). It then calculates the ellipses based on this new approximately normal value. Then it stretches out the x-axis so that the values of Var are once again linearly spaced, which distorts the ellipse in the same transformation.

  • Advantage: looks cool.
  • Advantage: values of Var are easy to interpret on the x-axis.
  • Advantage: you can see the density near Var=0 with the points and the wide flat end of the "egg" easily.
  • Advantage: the pointy end shows you how low the density is at those values.
  • Disadvantage: looks unfamiliar and requires explanation and additional cognitive load to interpret.
Brian
  • 7,900
  • 1
  • 27
  • 41
  • greatly appreciate the additional answer that specifically addresses non-normality, which is certainly applicable to some of my real variables. – B. Davis Aug 19 '17 at 02:50