4

I'm trying to understand how ggproto works to write my own geoms.

I wrote geom_myerrorbarh (analogous to geom_errorbarh, but only with x,y, xwidth arguments). The figure below shows that everything works correctly at a linear scale. However, if you use the log10 scale, it is different from geom_errorbarh.

I noticed that when using scale_x_log10(), x=log10(x) is converted first, and then xmin=x-xwidth; xmax=x+xwidth (see setup_data argument). But it should be xmin=log10(x-width); xmax=log10(x+xwidth).

How to solve this problem?

library(grid)
library(ggplot2)
library(patchwork)
theme_set(theme_minimal())
GeomMyerrorbarh <- ggproto("GeomMyerrorbarh", Geom,
                         required_aes = c("x", "y", "xwidth"),
                         draw_key = draw_key_path,
                         setup_data = function(data, params){
                           transform(data, xmin = x - xwidth, xmax = x + xwidth)
                         },
                         draw_group = function(data, panel_scales, coord) {
                           ## Transform the data first
                           coords <- coord$transform(data, panel_scales)

                           ## Construct a grid grob
                           grid::segmentsGrob(
                             x0 = coords$xmin,
                             x1 = coords$xmax,
                             y0 = coords$y,
                             y1 = coords$y,
                             gp = gpar(lwd = coords$size, 
                                       col = coords$colour,
                                       alpha = coords$alpha))
                           
                         })

geom_myerrorbarh <- function(mapping = NULL, data = NULL, stat = "identity",
                           position = "identity", na.rm = FALSE, 
                           show.legend = NA, inherit.aes = TRUE, ...) {
  
  ggplot2::layer(
    geom = GeomMyerrorbarh, mapping = mapping,  
    data = data, stat = stat, position = position, 
    show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

df <- data.frame(x = c(1, 2), 
                 y = c(1, 2),
                 xerr = c(0.1, 0.2))

p1 <- ggplot(df, aes(x, y)) +
  geom_point() +
  geom_errorbarh(aes(xmin = x - xerr, xmax = x + xerr), 
                 height=0, size=4, alpha=0.2, color='red') +
  geom_myerrorbarh(aes(xwidth = xerr)) + 
  labs(subtitle = 'Linear scale x')

p2 <- p1 + 
  scale_x_log10() + 
  labs(subtitle = 'Log10 scale x')

# Plot:
# Red transparent region - geom_errorbarh
# Black line - geom_myerrorbarh
p1 | p2

Plot

1 Answers1

1

Oh... this is a fun one to dissect.

To pinpoint when various changes take place, I ran debugonce(ggplot_build), to see what goes on underneath the hood when a ggplot object is being printed, and the following steps show up when I print p2:

# earlier steps omitted

  data <- by_layer(function(l, d) l$layer_data(plot$data), 
    layers, data, "computing layer data")
  data <- by_layer(function(l, d) l$setup_layer(d, plot), 
    layers, data, "setting up layer")
  layout <- create_layout(plot$facet, plot$coordinates)
  data <- layout$setup(data, plot$data, plot$plot_env)
  data <- by_layer(function(l, d) l$compute_aesthetics(d, 
    plot), layers, data, "computing aesthetics")
  data <- lapply(data, scales_transform_df, scales = scales)

# later steps omitted

Let's run through the steps sequentially, & print out our data object in console to see what has happened to it after every step

Step 1: Computing layer data

[[1]]
  x y xerr
1 1 1  0.1
2 2 2  0.2

[[2]]
  x y xerr
1 1 1  0.1
2 2 2  0.2

[[3]]
  x y xerr
1 1 1  0.1
2 2 2  0.2

Nothing interesting to see here. The inputted data frame df has simply been replicated for each layer of the ggplot object.

Step 2: Setting up layer

# no change from above

Moving on.

Step 3: Creating layout & running layout$setup on data

[[1]]
  x y xerr PANEL
1 1 1  0.1     1
2 2 2  0.2     1

[[2]]
  x y xerr PANEL
1 1 1  0.1     1
2 2 2  0.2     1

[[3]]
  x y xerr PANEL
1 1 1  0.1     1
2 2 2  0.2     1

Panel column added. Irrelevant for our investigation since we aren't messing around with facets (i.e. PANEL = 1 throughout).

Step 4: Computing aesthetics

[[1]]
  x y PANEL group
1 1 1     1    -1
2 2 2     1    -1

[[2]]
  xmin xmax x y PANEL group
1  0.9  1.1 1 1     1    -1
2  1.8  2.2 2 2     1    -1

[[3]]
  xwidth x y PANEL group
1    0.1 1 1     1    -1
2    0.2 2 2     1    -1

Finally, the different data layers are starting to distinguish themselves from one another. For each layer, new columns are added based on its specific aesthetic mappings, and unused columns from the original dataset are stripped away. A group column has also been added at the back.

Step 5: Scales transformation

[[1]]
        x y PANEL group
1 0.00000 1     1    -1
2 0.30103 2     1    -1

[[2]]
        x        xmin       xmax y PANEL group
1 0.00000 -0.04575749 0.04139269 1     1    -1
2 0.30103  0.25527251 0.34242268 2     1    -1

[[3]]
        x xwidth y PANEL group
1 0.00000    0.1 1     1    -1
2 0.30103    0.2 2     1    -1

Here, the 2nd and 3rd data layers have become truly different from one another. In layer 2, the scales transformation log10(.) has been applied directly to x, xmin and xmax, while layer 3 only received the same transformation for x.

There are two issues here. I have a workaround for one issue, but it's useless because the second issue remains.

Issue 1: No transformation on xwidth.

If we dig into scales_transform_df to see how it works, we'll find an exhaustive list of column names that scale_x_log10 will consider, when performing transformations. This can be assessed at the surface debugging level with scales$scales[[1]]$aesthetics, and corresponds to ggplot_global$x_aes:

 [1] "x"          "xmin"       "xmax"       "xend"       "xintercept" "xmin_final"
 [7] "xmax_final" "xlower"     "xmiddle"    "xupper"     "x0"  

Okay, we can rename "xwidth" to one of the above, no biggie. Call it xmiddle, for example, & we'll go from xmin = log_10(x) - xwidth; xmax = log(x) + xwidth (OP's original situation) to xmin = log_10(x) - log_10(xwidth); xmax = log_10(x) + log_10(xwidth). That's closer, but still not good enough, which brings us to...

Issue 2: The data transformation defined in GeomMyerrorbarh's setup_data function happens much, much later.

In my copy of the ggplot_build.ggplot function, the scales transformation happens in line 18, and the calculation for xmin = x - width / xmax = x + width defined in setup_data is called by compute_geom_1() in line 28. If we want the log_10(.) transformation applied to the calculated xmin / xmax values, these calculations have to happen before the scales transformation.

Is it worth the trouble to address this within ggplot_build?

I'm leaning towards no, because I think it's not a Geom's core job to perform data transformations.

I'm not familiar with the thinking behind the function's design, but I imagine a change such as bringing up the geom's setup_data (or, equivalently, shoving down scales_transform_df) will be a non-trivial one, potentially breaking other things along the way.

This use case sounds like it can be more easily served with a wrapper function around one or more existing geom_*() functions that accept the final xmin / xmax values, and perform data transformations within the wrapper.

Has it occurred for other Geoms?

Somewhat surprisingly (to me at least), yes.

This exact problem shows up in the ggplot2 package's own geom_tile function, as its underlying GeomTile performs the data transformation in setup_data too. Here's a simple illustration to trigger it, using my current version (ggplot2 3.4.0):

library(ggplot2)

df <- data.frame(
  x = rep(c(3, 6, 8, 10, 13), 2),
  y = rep(c(1, 2), each = 5),
  z = factor(rep(1:5, each = 2)),
  w = rep(diff(c(0, 4, 6, 8, 10, 14)), 2)
)

p1 <- ggplot(df, aes(fill = z)) +
  geom_rect(aes(xmin = x - w/2, xmax = x + w/2,
                ymin = y - 0.5, ymax = y + 0.5), 
            colour = "grey50", alpha = 0.5, linewidth = 1) +
  geom_tile(aes(x = x, y = y, width = w, height = 1),
            colour = "grey50", alpha = 0.5, linewidth = 1) +
  ggtitle("linear scales") +
  theme_void() +
  theme(legend.position = "none")

p2 <- p1 + scale_x_log10() + ggtitle("transformed x scale")
p3 <- p1 + scale_y_log10() + ggtitle("transformed y scale")
p4 <- p1 + scale_x_log10() + scale_y_log10() + ggtitle("transformed both scales")

library(patchwork)
(p1 | p2) / (p3 | p4)

The geom_rect layer accepts the aesthetic mappings c(xmin, xmax, ymin, ymax), while geom_tile accepts c(x, y, width, height). They look identical when default linear scales are used, but go out of sync once transformed scales are introduced in either direction. The geom_tile version even overlaps with itself!

4 plots patched together for comparison

That said, transforming scales while drawing tiles (for a heatmap?) seems like a rather niche use case, and I haven't seen this issue brought up elsewhere before. Perhaps a cautionary note in the help files, warning against using scale transformations, would suffice, unless the community has more pressing arguments for its usefulness.

Z.Lin
  • 28,055
  • 6
  • 54
  • 94