After a lot of trial and error and consultation with previous answers such as How to detect if bare variable or string I think I have gotten most of what I need done myself. But I'm eager to understand if I'm making some bad assumptions or approaching the problem foolishly before I carry my "solution" into production.
Consider the following data:
library(dplyr)
library(purrr)
library(tidyselect)
set.seed(1111)
dat1 <- data.frame(Region = rep(c("r1","r2"), each = 100),
State = rep(c("NY","MA","FL","GA"), each = 10),
Loc = rep(c("a","b","c","d","e","f","g","h"),each = 5),
ID = rep(c(1:10), each = 2),
var1 = rnorm(200),
var2 = rnorm(200),
var3 = rnorm(200),
var4 = rnorm(200),
var5 = rnorm(200))
I want to write a function that does quite a few things but I'll start with a minimum reproducible example. I want to get tidied
aov
results back either for a singular case var1 ~ State
or for a pair of matched lists using map2
with one list containing "outcomes" the other "predictors". They're never identical from use to use and the variables, unlike my example, rarely lend themselves to easy solutions like starts_with
.
Two specific issues and a generic question.
Issue #1 - I've given up on allowing end users (including me) to pass in bare variable names always gets me in trouble later. In accordance with the reference above is something like my code the fastest most reliable way to catch them and tell the user? (I put a comment in the code to indicate where I'm talking about.
Issue #2 - Through basically trail and error I think I solved my other problem which is in generating some text for use later as a label. I found lots of solutions when I'm not using the function with map2
but only this one seems to work with map2. It seems so convoluted I can't believe it's a good choice... (again comments in code to show where)
Generic question: I've added the recommended tidyselect::all_of
because these might be ambiguous lists, why am I still having to guard against the .x
and .y
being seen as calls as opposed to just markers for iteration?
MyFunction <- function(data,
groupvar,
var) {
# Issue #1 is this best way to warn/stop user?
lst <- as.list(match.call())
if (is.symbol(lst$groupvar) || is.symbol(lst$var)) {
stop("Please quote all variables")
}
# Issue #2 I want the group label but if I don't include
# this if logic it errors with " Error: Can't convert a call to a string"
# when I run it with purrr::map2
if (!is.call(groupvar)) {
grouplabel <- rlang::as_name(rlang::enquo(groupvar))
}
data <-
dplyr::select(
.data = data,
var = {{ var }},
groupvar = {{ groupvar }}
)
aov_object <- aov(var ~ groupvar, data = data)
aov_results <- broom::tidy(aov_object) %>%
mutate(term = if_else(term != "Residuals", grouplabel, term))
return(aov_results)
}
# Expected output
MyFunction(data = dat1, groupvar = "State", var = "var1") # works
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 1.75 0.582 0.485 0.693
#> 2 Residuals 196 235. 1.20 NA NA
MyFunction(data = dat1, groupvar = State, var = var1) # warns appropriately
#> Error in MyFunction(data = dat1, groupvar = State, var = var1): Please quote all variables
# Quick test of `map2`
grouping_vars <- names(dat1[,1:3])
names(grouping_vars) <- names(dat1[,1:3])
outcome_vars <- names(dat1[,5:7])
names(outcome_vars) <- names(dat1[,5:7])
names(outcome_vars) <- paste(outcome_vars, "~", grouping_vars)
# get multiple results this is where issue #2 comes in but this is what I want it to look like.
map2(.x = outcome_vars,
.y = grouping_vars,
.f = ~ MyFunction(dat = dat1,
var = tidyselect::all_of(.x),
groupvar = tidyselect::all_of(.y)))
#> $`var1 ~ Region`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Region 1 0.0512 0.0512 0.0427 0.836
#> 2 Residuals 198 237. 1.20 NA NA
#>
#> $`var2 ~ State`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 5.05 1.68 2.07 0.106
#> 2 Residuals 196 159. 0.814 NA NA
#>
#> $`var3 ~ Loc`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Loc 7 5.09 0.727 0.772 0.612
#> 2 Residuals 192 181. 0.943 NA NA