4

I have a function to generate scatter plots from data, where an argument is provided to select which column to use for coloring the points. Here is a simplified version:

library(ggplot2)

plot_gene <- function (df, gene) {
   ggplot(df, aes(x, y)) + 
     geom_point(aes_string(col = gene)) +
     scale_color_gradient()
}

where df is a data.frame with columns x, y, and then a bunch of gene names. This works fine for most gene names; however, some have dashes and these fail:

print(plot_gene(df, "Gapdh")) # great!
print(plot_gene(df, "H2-Aa")) # Error: object "H2" not found

It appears the gene variable is getting parsed ("H2-Aa" becomes H2 - Aa). How can I get around this? Is there a way to indicate that a string should not go through eval in aes_string?

Reproducible Input

If you need some input to play with, this fails like my data:

df <- data.frame(c(1,2), c(2,1), c(1,2), c(2,1))
colnames(df) <- c("x", "y", "Gapdh", "H2-Aa")

For my real data, I am using read.table(..., header=TRUE) and get column names with dashes because the raw data files have them.

merv
  • 67,214
  • 13
  • 180
  • 245

1 Answers1

4

Normally R tries very hard to make sure you have column names in your data.frame that can be valid variable names. Using non-standard column names (those that are not valid variable names) will lead to problems when using functions that use non-standard evaluation type syntax. When focused to use such variable names you often have to wrap them in back ticks. In the normal case

ggplot(df, aes(x, y)) + 
  geom_point(aes(col = H2-Aa)) +
  scale_color_gradient()
# Error in FUN(X[[i]], ...) : object 'H2' not found

would return an error but

ggplot(df, aes(x, y)) + 
  geom_point(aes(col = `H2-Aa`)) +
  scale_color_gradient()

would work.

You can paste in backticks if you really want

geom_point(aes_string(col = paste0("`", gene, "`")))

or you could treat it as a symbol from the get-go and use aes_q instread

geom_point(aes_q(col = as.name(gene)))

The latest release of ggplot support escaping via !! rather than using aes_string or aes_q so you could do

geom_point(aes(col = !!rlang::sym(gene)))
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Even command like `colnames(test) <- c("x", "y", "z", "H2 - O")` allows to set non-standard column names. good to know about `as.name` – MKR Feb 06 '18 at 20:40
  • 2
    Sure, you can set them to what ever you want, but when importing data like through `read.table`, then are properly renamed (unless you disable that). One should carefully consider whether non-standard column names are really worth all the extra trouble. – MrFlick Feb 06 '18 at 20:42
  • I literally used `read.table` with no other arguments than `header = TRUE`. It did not clean my column names as you suggest. – merv Feb 06 '18 at 20:50
  • Wow. I find what very surprising. Are you sure you didn't set `check.names = FALSE` or rename them some other way? If I run `read.table(text="a-b,c\n1,2", header=T, sep=",")` I see that the `-` is turned into a `.` – MrFlick Feb 06 '18 at 20:52
  • Here's what I did to import my data: `nspCounts <- do.call(cbind, lapply(naiveSpleenFiles, function(x) { read.table(gzfile(x), header=TRUE) }))` where `naiveSpleenFiles` is a vector of the zipped files to import. – merv Feb 06 '18 at 20:57