-1

I´m trying to build a function which would receive: a dataframe (data), variable(s) to group by (groupby), and the name of a dependent variable (var); The function will then: a. create a plot of the means of var, separated by group(s) in groupby. In addition, a nice to have would be adding an anova at the end.

I´ll start with the end: my problem is obviously how to use (string) values in further manipulations in a user defined function.

I unfortunately have problems parsing groupby, which I couldn´t solve after a couple of days trying: I tried using: !!!rlang::parse_exprs, strsplit, etc... but with no success. Currently it looks like something like that (that´s the simplified version with less aesthetics..):

grp_comp <- function(data, groupby, var){
  data %>%
    filter(!is.na(var)) %>%
    group_by(!!!rlang::parse_exprs(groupby)) %>%
    summarize(n = n(),
              mean = mean(!!!rlang::parse_expr(var)),
              sd = sd(!!!rlang::parse_expr(var)),
              se = sd / sqrt(n)) -> ddata
  gg <- unlist(rlang::parse_exprs(groupby))
    if(length(as.vector(rlang::parse_exprs(groupby))) == 1){
    g <- ggplot(ddata, aes(x = as.character(gg[1]), 
                            y = mean)) + 
      geom_point()}
  else{ 
    g <- ggplot(ddata, aes(x = as.character(gg[1]), 
                          y = mean, 
                          shape = as.character(gg[2]), 
                          color= as.character(gg[2])),
                group = as.character(gg[2]))}
  form <- unlist(strsplit(groupby, ';', fixed = T)) 
  form <- paste(form, collapse = " + ")
  form <- paste(var, " ~ ", form)
  form
    data%>%
    filter(!is.na(var)) %>%
    aov(formula = form) -> anova
  summary(anova) -> anova
  l <- list(ddata, g, anova)
  l
  }

My problems are: a. groupby could contain one or two variables. I can´t manage to use groupby as an argument for group_by in the ggplots. Either I get: Error: Discrete value supplied to continuous scale in case I use: x = gg[1], or I use: x = as.factor(gg[1]) or: as.character and get the following plot (i.e. x is only labeled "BPL", but not grouped by the factor).

enter image description here

b. when I try to use two (instead of one) groupby factors, things get even worse and the plot is completely empty... c. I manage to create the right formula for the anova, but when I try to actually calculate it I encounter: Error: $ operator is invalid for atomic vectors -> any ideas why? d. not critical, but any ideas for using the second, optional group as color & shape in aes() in case the argument contains two groups, without using the if ?

Many many thanks in advance!

Guy

guy
  • 23
  • 3

1 Answers1

1

It's not clear how you want to call this function, but you could do something like:

library(tidyverse)

grp_comp <- function(data, groupby, var){
  ddata <- data %>%
    filter(!is.na({{var}})) %>%
    group_by(!!!rlang::parse_exprs(groupby)) %>%
    summarize(n = n(),
              mean = mean({{var}}),
              sd = sd({{var}}),
              se = sd / sqrt(n))

  gg <- unlist(rlang::parse_exprs(groupby))
  
  g <- if(length(as.vector(rlang::parse_exprs(groupby))) == 1) 
         ggplot(ddata, aes(x = !!gg[[1]], y = mean)) + geom_point()
       else {
         ggplot(ddata, aes(x = !!gg[[1]], y = mean, shape = factor(!!gg[[2]]), 
                           color= !!gg[[2]], group = !!gg[[2]])) + geom_point()
       }
  
  form <- unlist(strsplit(groupby, ';', fixed = T)) 
  form <- paste(form, collapse = " + ")
  form <- paste(deparse(substitute(var)), " ~ ", form)

  data%>%
    filter(!is.na({{var}})) %>%
    aov(formula = as.formula(form)) -> anova
  summary(anova) -> anova
  list(ddata, g, anova)
}

This allows:

grp_comp(iris, "Species", Sepal.Length)
#> [[1]]
#> # A tibble: 3 x 5
#>   Species        n  mean    sd     se
#>   <fct>      <int> <dbl> <dbl>  <dbl>
#> 1 setosa        50  5.01 0.352 0.0498
#> 2 versicolor    50  5.94 0.516 0.0730
#> 3 virginica     50  6.59 0.636 0.0899
#> 
#> [[2]]
#> 
#> [[3]]
#>              Df Sum Sq Mean Sq F value Pr(>F)    
#> Species       2  63.21  31.606   119.3 <2e-16 ***
#> Residuals   147  38.96   0.265                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And

grp_comp(mtcars, c("gear", "cyl"), mpg)
#> `summarise()` has grouped output by 'gear'. You can override using the
#> `.groups` argument.
#> [[1]]
#> # A tibble: 8 x 6
#> # Groups:   gear [3]
#>    gear   cyl     n  mean     sd     se
#>   <dbl> <dbl> <int> <dbl>  <dbl>  <dbl>
#> 1     3     4     1  21.5 NA     NA    
#> 2     3     6     2  19.8  2.33   1.65 
#> 3     3     8    12  15.0  2.77   0.801
#> 4     4     4     8  26.9  4.81   1.70 
#> 5     4     6     4  19.8  1.55   0.776
#> 6     5     4     2  28.2  3.11   2.2  
#> 7     5     6     1  19.7 NA     NA    
#> 8     5     8     2  15.4  0.566  0.400
#> 
#> [[2]]
#> 
#> [[3]]
#>             Df Sum Sq Mean Sq F value   Pr(>F)    
#> gear         1  259.7   259.7   24.87 2.63e-05 ***
#> cyl          1  563.4   563.4   53.94 4.32e-08 ***
#> Residuals   29  302.9    10.4                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Created on 2022-08-27 with reprex v2.0.2

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Hi! Thank you so much, this works perfectly! I see you´re using several different ways to call, use and parse expressions: a. !!!rlang::parse_exprs(groupby)) (groupby is a string) b. rlang::parse_exprs(groupby)) - i.e. without !!! c. {{var}} (var is not a string) d. !!gg[[1]] (gg is a parsed, unlisted groupby) e. deparse(substitute(var)) While some of those are clear to me, others aren´t - and mainly why and when do you use each and when. Would it be possible to maybe shortly explain the differences? – guy Aug 28 '22 at 13:18
  • @guy the `deparse(substitute(var))` just converts the name of `var` to a character string to build the formula. The {{var}} notation just allows you to use `var` directly as if it was a normal column name. The `parse_exprs` are stored as symbols until they are required to be defused with the !! operator. – Allan Cameron Aug 28 '22 at 15:53