0

I am trying to combine flexible modelling functions (using tidyeval) and then mapping over data in a nested dataframe (and attempting to learn tidy evaluation along the way). I am running into the problems of inlining expressions with the captured call (I think). Any suggestion, examples, tips, or best practices for writing wrappers to simplify repetitive modelling tasks and then using them with purrr::map etc?

The example below is based on the section wrapping modelling functions from 20 Evaluation | Advanced R using the mtcars data.

library(rlang)
library(tidyverse)

lm_wrap <- function(data, traits, resp, env = caller_env(), ...) {
  
  traits <- enexpr(traits)
  resp <- enexpr(resp)
  data <- enexpr(data)
  dots <- enexprs(...)

  lm_call <- inject(lm(!!resp ~ !!traits, data = !!data, !!!dots),  env)
  
  return(lm_call)
}

The wrapper function works for single cases

lm_wrap(traits = hp, data = mtcars, resp = mpg)

#Call:
#lm(formula = mpg ~ hp, data = mtcars)

#Coefficients:
#(Intercept)           hp  
# 30.09886     -0.06823

But looks like it runs into the problems of inlining expressions, at least as per this somewhat related example 20 Evaluation | Advanced R

mt_nested <- mtcars %>% group_by(cyl) %>% nest() %>%
  mutate(model = map(data, lm_wrap, resp = mpg, traits = hp))

mt_nested$model[[1]]$call

#lm(formula = mpg ~ hp, data = list(mpg = c(21, 21, 21.4, 18.1, 
#19.2, 17.8, 19.7), disp = c(160, 160, 258, 225, 167.6, 167.6, 
#145), hp = c(110, 110, 110, 105, 123, 123, 175), drat = c(3.9, 
#3.9, 3.08, 2.76, 3.92, 3.92, 3.62), wt = c(2.62, 2.875, 3.215, 
#3.46, 3.44, 3.44, 2.77), qsec = c(16.46, 17.02, 19.44, 20.22, 
#18.3, 18.9, 15.5), vs = c(0, 0, 1, 1, 1, 1, 0), am = c(1, 1, 
#0, 0, 0, 0, 1), gear = c(4, 4, 3, 3, 4, 4, 5), carb = c(4, 4, 
#1, 1, 4, 4, 6)))

Thanks in advance,

M.

Matt
  • 1
  • 1
  • Hello Matt. What are specifically the problems you see? – Ric Nov 08 '22 at 23:43
  • Hi Ric, the issue here is that dataset itself is being unquoted into the call in call (I think I have the termonolgoy right here), which means the captured call is suboptimal, particularly with a large dataset. – Matt Nov 09 '22 at 01:23

3 Answers3

1

The problem is that you are trying to mix different environments. The caller's, where data in the formula might be defined, and your function's, where data has been passed to.

One solution is to create the formula separately in env with expressions injected, then call lm() in the local environment. Also note that enexprs(...) is going to be broken in various unobvious ways. Instead I just passed the dots to lm().

lm_wrap <- function(data, traits, resp, ..., env = caller_env()) {
  traits <- enexpr(traits)
  resp <- enexpr(resp)

  # First create the formula in the right environment.
  # Formulas keep track of the env they've been created in.
  f <- inject(!!resp ~ !!traits,  env)
  
  # Now inject the formula inside a local call
  inject(lm(!!f, data = data, ...))
}

The second round of injection makes sure that the formula itself is recorded in the call rather than the symbol f.

Lionel Henry
  • 6,652
  • 27
  • 33
  • I think this runs into the same problem as my solution, in that the `call` member of the resultant `lm` object retains the local name `data` rather than the symbol of the passed data frame, which I believe is what the OP wants – Allan Cameron Nov 09 '22 at 08:58
  • Thanks Lionel. As Allan mentioned this function will work with map. But if used on single dataset, with the wrong call, subsequent use of something like update would fail ("'data' must be a data.frame" etc). The passing of the dots directly does not appear to work with something like "subset = am == 1", where as via enexprs(...) it does. – Matt Nov 09 '22 at 22:42
  • `enexprs(...)` will be broken as soon as you start wrapping your function in another function that is used in a package namespace and pass dots across levels. This is because it flattens the hierarchy of scopes to the current evaluation environment. – Lionel Henry Nov 10 '22 at 07:32
  • Given the state of R metaprogramming tools I think the set of requirements you're expecting from this function makes it almost impossible to get 100% correct. – Lionel Henry Nov 10 '22 at 07:35
  • Thanks Lionel. Part of this exercise has been to learn what is possible or not. Passing the dots is secondary in reality. The main point was to see if a common function with formula would work via nested datasets and map and with normal non-nested datasets, and maintain the correct dataset call, without resorting to the solutions past/substitute solutions. – Matt Nov 11 '22 at 01:43
0

You can build the call with data being quoted:

library(rlang)
library(tidyverse)

lm_wrap <- function(data, traits, resp, env = caller_env(), ...) {
  
  traits <- enexpr(traits)
  resp <- enexpr(resp)
  dots <- enexprs(...)
  
  formula <- inject(formula(!!resp ~ !!traits,  env = env), env)
  
  do.call("lm", c(formula = formula, data = quote(data), inject(!!dots, env)))
}


mt_nested <- mtcars %>% 
  group_by(cyl) %>% 
  nest() %>%
  mutate(model = map(data, lm_wrap, resp = mpg, traits = hp))

mt_nested$model[[1]]$call
#> lm(formula = mpg ~ hp, data = data)

If you want the call to contain the substituted value of data rather than always saying data, you can do substitute(data) and evaluate the call in the parent frame. You could do this in base R as follows:

lm_wrap <- function(data, traits, resp, ...) {
  
  f <- paste(deparse(substitute(resp)), deparse(substitute(traits)), sep = "~")
  f <- as.formula(f)
  do.call("lm", c(f, substitute(data), ...), envir = parent.frame())
}

Testing this, we get the value of data in the call object being .x[[i]], which is how the data chunk is referred to inside the body of map:

mt_nested <- mtcars %>% 
  group_by(cyl) %>% 
  nest() %>%
  mutate(model = map(data, lm_wrap, resp = mpg, traits = hp))

mt_nested$model[[1]]$call
#> lm(formula = mpg ~ hp, data = .x[[i]])

And if we call the function directly we get the expected mtcars in the call

lm_wrap(mtcars, mpg, hp)$call
#> lm(formula = hp ~ mpg, data = mtcars)

Created on 2022-11-09 with reprex v2.0.2

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks Allan. This solves the issue for using map. But if I then use this lm_wrap on a single dataset (i.e., lm_wrap(mtcars, disp, mpg)) the call will be lm(formula = mpg ~ disp, data = data) rather than lm(formula = mpg ~ disp, data = mtcars) which messes up downstream options/predictions. In optimising this code, I would like to have both options (normal use and via map) working optimally for a single function. – Matt Nov 09 '22 at 01:23
  • This solution is on the right track (using different evaluation environments for the formula and for `lm()`). However it's very hard to create `do.call()` calls with a mix of injected and quoted data. I posted another solution with simplified formula creation, dots management, and `lm()` call. – Lionel Henry Nov 09 '22 at 07:16
0

I guess you want something like that:

library(rlang)
library(tidyverse)


lm_wrap <- function(data, traits, resp, env = caller_env(), ...) {
  traits <- enexpr(traits)
  resp <- enexpr(resp)
  data <- enexpr(data)
  dots <- enexprs(...)
  lm_call <- inject(lm(!!resp ~ !!traits, data = !!data, !!!dots),  env)
  return(lm_call)
}

mt_nested <- mtcars %>% group_by(cyl) %>% 
  group_modify( ~ tibble(
    data = list(.x), 
    model = list(lm_wrap(mtcars %>% filter(cyl==!!.y$cyl), resp = mpg, traits = hp))))

mt_nested$model[[1]]$call

#> lm(formula = mpg ~ hp, data = mtcars %>% filter(cyl == 4))
Ric
  • 5,362
  • 1
  • 10
  • 23