4

After a lot of trial and error and consultation with previous answers such as How to detect if bare variable or string I think I have gotten most of what I need done myself. But I'm eager to understand if I'm making some bad assumptions or approaching the problem foolishly before I carry my "solution" into production.

Consider the following data:

library(dplyr)
library(purrr)
library(tidyselect)

set.seed(1111)
dat1 <- data.frame(Region = rep(c("r1","r2"), each = 100),
                   State = rep(c("NY","MA","FL","GA"), each = 10),
                   Loc = rep(c("a","b","c","d","e","f","g","h"),each = 5),
                   ID = rep(c(1:10), each = 2),
                   var1 = rnorm(200),
                   var2 = rnorm(200),
                   var3 = rnorm(200),
                   var4 = rnorm(200),
                   var5 = rnorm(200))

I want to write a function that does quite a few things but I'll start with a minimum reproducible example. I want to get tidied aov results back either for a singular case var1 ~ State or for a pair of matched lists using map2 with one list containing "outcomes" the other "predictors". They're never identical from use to use and the variables, unlike my example, rarely lend themselves to easy solutions like starts_with.

Two specific issues and a generic question.

Issue #1 - I've given up on allowing end users (including me) to pass in bare variable names always gets me in trouble later. In accordance with the reference above is something like my code the fastest most reliable way to catch them and tell the user? (I put a comment in the code to indicate where I'm talking about.

Issue #2 - Through basically trail and error I think I solved my other problem which is in generating some text for use later as a label. I found lots of solutions when I'm not using the function with map2 but only this one seems to work with map2. It seems so convoluted I can't believe it's a good choice... (again comments in code to show where)

Generic question: I've added the recommended tidyselect::all_of because these might be ambiguous lists, why am I still having to guard against the .x and .y being seen as calls as opposed to just markers for iteration?

MyFunction <- function(data,
                 groupvar,
                 var) {
  # Issue #1 is this best way to warn/stop user?
  lst <- as.list(match.call())

  if (is.symbol(lst$groupvar) || is.symbol(lst$var)) {
    stop("Please quote all variables")
  }

  # Issue #2 I want the group label but if I don't include
  # this if logic it errors with " Error: Can't convert a call to a string"
  # when I run it with purrr::map2
  if (!is.call(groupvar)) {
     grouplabel <- rlang::as_name(rlang::enquo(groupvar))
  }

  data <-
    dplyr::select(
      .data = data,
      var = {{ var }},
      groupvar = {{ groupvar }}
    )

  aov_object <- aov(var ~ groupvar, data = data)
  aov_results <- broom::tidy(aov_object) %>%
    mutate(term = if_else(term != "Residuals", grouplabel, term))
  return(aov_results)
}

# Expected output

MyFunction(data = dat1, groupvar = "State", var = "var1") # works
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 State         3   1.75  0.582     0.485   0.693
#> 2 Residuals   196 235.    1.20     NA      NA

MyFunction(data = dat1, groupvar = State, var = var1) # warns appropriately
#> Error in MyFunction(data = dat1, groupvar = State, var = var1): Please quote all variables

# Quick test of `map2`
grouping_vars <- names(dat1[,1:3])
names(grouping_vars) <- names(dat1[,1:3])

outcome_vars <- names(dat1[,5:7])
names(outcome_vars) <- names(dat1[,5:7])

names(outcome_vars) <- paste(outcome_vars, "~", grouping_vars)

# get multiple results this is where issue #2 comes in but this is what I want it to look like.

map2(.x = outcome_vars,
     .y = grouping_vars,
     .f = ~ MyFunction(dat = dat1,
                 var = tidyselect::all_of(.x),
                 groupvar = tidyselect::all_of(.y)))
#> $`var1 ~ Region`
#> # A tibble: 2 x 6
#>   term         df    sumsq meansq statistic p.value
#>   <chr>     <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
#> 1 Region        1   0.0512 0.0512    0.0427   0.836
#> 2 Residuals   198 237.     1.20     NA       NA    
#> 
#> $`var2 ~ State`
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 State         3   5.05  1.68       2.07   0.106
#> 2 Residuals   196 159.    0.814     NA     NA    
#> 
#> $`var3 ~ Loc`
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 Loc           7   5.09  0.727     0.772   0.612
#> 2 Residuals   192 181.    0.943    NA      NA
Chuck P
  • 3,862
  • 3
  • 9
  • 20

2 Answers2

4

It seems to me that since you are insistent on passing strings as variable names it would be simpler and more efficient to change the formula to match the variables using as.formula rather than changing the data. This also prevents you having to separately name the grouping variable inside the function.

The following function is shorter and about twice as fast in benchmarking as the original, but the behaviour remains unchanged:

MyFunctionNew <- function(data, groupvar, var) 
{  
  lst <- as.list(match.call())
  if (is.symbol(lst$groupvar) || is.symbol(lst$var)) 
    stop("Please quote all variables")

  broom::tidy(aov(as.formula(paste(var, "~", groupvar)), data = data)) %>%
    mutate(term = if_else(term != "Residuals", groupvar, term))
}

You can see that it still works inside map2:

map2(.x = outcome_vars,
     .y = grouping_vars,
     .f = ~ MyFunctionNew(dat = dat1,
                       var = tidyselect::all_of(.x),
                       groupvar = tidyselect::all_of(.y)))
#> $`var1 ~ Region`
#> # A tibble: 2 x 6
#>   term         df    sumsq meansq statistic p.value
#>   <chr>     <dbl>    <dbl>  <dbl>     <dbl>   <dbl>
#> 1 Region        1   0.0512 0.0512    0.0427   0.836
#> 2 Residuals   198 237.     1.20     NA       NA    
#> 
#> $`var2 ~ State`
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 State         3   5.05  1.68       2.07   0.106
#> 2 Residuals   196 159.    0.814     NA     NA    
#> 
#> $`var3 ~ Loc`
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 Loc           7   5.09  0.727     0.772   0.612
#> 2 Residuals   192 181.    0.943    NA      NA    

In terms of screening variables to ensure they are character strings, I don't think this is idiomatic R usage, and could cause some confusion to casual users of your function. In other words, it violates the principle of least astonishment.

For example, as a naive user, I would expect to be able to specify the grouping variable programatically like this:

MyVar <- "State"
MyFunction(data = dat1, groupvar = MyVar, var = "var1")

However, I get an error telling me that all variables should be quoted.

This also means that your function won't work within base R loops and *apply functions:

lapply(c("State", "Region", "ID"), function(x) MyFunction(dat1, x, "var1"))
#> Error in MyFunction(dat1, x, "var1") : Please quote all variables 

I think this is far more confusing and limiting than just allowing an error to be thrown when an unquoted column name is used. Therefore, I think your production function should be something like:

MyFunction <- function(data, groupvar, var) 
{  
  broom::tidy(aov(as.formula(paste(var, "~", groupvar)), data = data)) %>%
    mutate(term = if_else(term != "Residuals", groupvar, term))
}

Which performs like this:

MyFunction(data = dat1, groupvar = "State", var = "var1") 
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 State         3   1.75  0.582     0.485   0.693
#> 2 Residuals   196 235.    1.20     NA      NA    

MyFunction(data = dat1, groupvar = MyVar, var = "var1")
#> # A tibble: 2 x 6
#>   term         df  sumsq meansq statistic p.value
#>   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
#> 1 State         3   1.75  0.582     0.485   0.693
#> 2 Residuals   196 235.    1.20     NA      NA    

MyFunction(data = dat1, groupvar = State, var = var1) 
#>  Error in paste(var, "~", groupvar) : object 'State' not found 

I think most R users would realise why they were getting this last error, since it is pretty clear. It is also an error that regular R users will have seen many times. If you have less faith in your users, perhaps you could try wrapping the function body in a tryCatch that converts a "symbol not found error" to a "please use quotes" error.

Ultimately, it may be best to write the function so that it takes naked symbols, but I get the impression you are keen to avoid that and so I won't labour the point here.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you @Allan-cameron. Going through this as my coffee kicks in. To be clear I'm not "against" a function that accepts bare **or** quoted variables. I've just never found a way to do both well, and if I have to chose I'd go quoted since I've found "bares" inside `map2` or `pmap` calls especially tricky. – Chuck P Jun 02 '20 at 11:56
  • @ChuckP I think you always have to comprimise in some way. I think taking symbols leads to lots of other complexities and comprimises, though if done well it feels more "professional". Taking quoted variable names is a perfectly respectable design decision as long as it is handled consistently. – Allan Cameron Jun 02 '20 at 12:08
  • Thanks Allan ironically I always used to spend all sorts of time making sure I could "take" bares because I'm lazy and prefer them but they **ALWAYS** seem to get me afoul later. And I have been bitten by so many packages that I love, take bare variables in the singular case and then become a nightmare when I want to run them through `purrr` workflows. – Chuck P Jun 02 '20 at 12:11
  • Thank you Allan. I left it open to see if there were any more takers but your answer is certainly elegant and thorough. – Chuck P Jun 07 '20 at 14:42
1

I have resolved issue #1. Your function works whether the variable names are quoted or not.

MyFunction <- function(data,
                       groupvar,
                       var) {
  # Issue #1 is this best way to warn/stop user?
  lst <- as.list(match.call())

  if (is.symbol(lst$groupvar)) {
    q <- paste0("groupvar")
    varname <- expr('$'(lst,!!q))
    gval <- eval_tidy(varname)
    groupvarc <- as.character(gval)
  }else{groupvarc <- eval_tidy(lst$groupvar)}

  if (is.symbol(lst$var)) {
    v <- paste0("var")
    varnam <- expr('$'(lst,!!v))
    vval <- eval_tidy(varnam)
    varc <- as.character(vval)
  }else{varc <- eval_tidy(lst$var)}

  grouplabel <- groupvarc[1] 

  data <- dplyr::select(.data = data,
                        var = varc[[1]],
                        groupvar = groupvarc[[1]] )

  aov_object <- aov(var ~ groupvar, data = data)
  aov_results <- broom::tidy(aov_object)  %>%
     mutate(term = if_else(term != "Residuals", grouplabel, term))
  return(aov_results)
}

MyFunction(data = dat1, groupvar = "State", var = "var1") # works

MyFunction(data = dat1, groupvar = State, var = var1) # Also works

For multiple variables you will need to make it a function and cycle it through lapply. Also, it will tidy up my repeating the same code two times for issue #1. I hope this helps you to move forward.

YBS
  • 19,324
  • 2
  • 9
  • 27
  • No sorry issue #2 is by far the most important. The function must work with `purrr` `map` series calls. I'm not against `lapply` but I never use it. Sorry I thought the title and the wording of the question made the requirements for the solution clear. – Chuck P Jun 02 '20 at 11:59
  • 1
    I understand. Issue #1 might be a problem for some others. I know I could not find a quick solution in that situation. This solution might help others in that case. – YBS Jun 02 '20 at 14:16