How does filter() in dplyr evaluate what's inside the () in a customized function?

Question

I am trying to write a function that takes two column names and upper and/or lower boundary for each column names, so that that way I can subset the data with column names and boundary of my choice.

Using mtcars as an example, if I want to subset the data by saying I only want rows that has cyl > 4 and mpg > 15, in this case my function would take in two column names cyl and mpg, also two lower boundary for each column name which are 4 and 15. Of course in the function, I have the choice to assign a upper boundary to it to keep column names(variables) within certain range.

So I came up with something like below, a function which takes two variable names of your choice and upper and/or lower boundaries for each variable.

If I only give an upper or lower boundary for this variable then it would give me anything less than or more than this boundary, if I give the function both upper and lower boundary it gives me back the rows that fall into the range.

comb_function<-function(df,var1,var2,var1_lower=NULL,var1_upper=NULL,var2_upper=NULL,var2_lower=NULL){
   var1<-enexpr(var1)
   var2<-enexpr(var2)
 #####for var2,if upper boundary are given by user,do this#####{
    filter1<-expr(`$`(df,!!var2))<=var2_upper
    #for var1, if upper boundary are given by user,do this# {
      filter2<-expr(`$`(df,!!var1))<=var1_upper}
    #for var 1,if lower boundary are given by user, do this#{
      filter2<-expr(`$`(df,!!var1))>=var1_lower}
    #for var1, if both are given by user, do this#{
      filter2<-expr(`$`(df,!!var1))>=var1_lower&expr(`$`(df,!!var1))<=var1_upper}
  }
  #####for var2,if lower boundary are given by user,do this#####{
    filter1<-expr(`$`(df,!!var2))>=var2_lower 
    #for var1,if upper boundary are given by user,do this#{
      filter2<-expr(`$`(df,!!var1))<=var1_upper}
    #for var1,if lower boundary are given by user,do this#{
      filter2<-expr(`$`(df,!!var1))>=var1_lower}
    #if both are given by the user,do this{
      filter2<-expr(`$`(df,!!var1))>=var1_lower&expr(`$`(df,!!var1))<=var1_upper}
  }
  #####for var2,if both are given by user,do this#####{
    filter1<-expr(`$`(df,!!var2))<=var2_upper&expr(`$`(df,!!var2))>=var2_lower
    #for var1,if upper boundary are given by user,do this#{
      filter2<-expr(`$`(df,!!var1))<=var1_upper}
    #for var1,if lower boundary are given by user,do this#{
      filter2<-expr(`$`(df,!!var1))>=var1_lower}
    #if both are given by user, do this#{
      filter2<-expr(`$`(df,!!var1))>=var1_lower&expr(`$`(df,!!var1))<=var1_upper}
  }
   output<-df%>%filter(filter1,filter2)%>%summarise(count=n(),avgcyl=mean(cyl,na.rm=TRUE))
    return(output)
}

When I call this function using mtcars as an example

final1<-comb_function(df=mtcars,var1=mpg,var2=cyl,var1_lower =15,var2_lower=4,var2_upper=6)

I got 0 count and NaN for avgcrl in final1. So when filter() evaluates what inside the (), it only gets FALSE, no TRUE, I think that's why no rows gets returned.

I have a theory for why this is happening. If I do this:

x<-expr(cyl);eval(expr(expr(`$`(mtcars,!!x))<=6))

It returns:

[1]FALSE

which is apparently not what I expected to have. If I do this:

eval(expr(`$`(mtcars,!!x)))<=6

It returns

[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

which is what I want for the filter() function inside my function. So I am guessing when filter() evaluates what's inside (), it automatically put the bracket around the whole expression, just like

eval(expr(expr(`$`(mtcars,!!x))<=6))

did, which only gives back one FALSE. So if this is really the reason like I expected, how do I let filter() know what I really want is for it to evaluate like this:

eval(filter1<-expr(`$`(df,!!var2)))<=var2_upper

not this:

eval(filter1<-expr(`$`(df,!!var2))<=var2_upper)

If what I guessed is not what's going on, please help me as well.

I would recommend simplifying your question as much as possible. The first snippet of code in particular is very complex and hard to read. — Lionel Henry, Nov 09 '19 at 10:37

score 3 · Answer 1 · answered Nov 09 '19 at 11:09

In general, I would highly recommend to stay away from all this quoting and evaluating. The tidy eval framework provides alternative tools that are much easier to work with.

Using mtcars as an example, if I want to subset the data by saying I only want rows that has cyl > 4 and mpg > 15

A typical wrapper function would look like this:

filter2 <- function(data, var1, var2, lower1, lower2) {
  data %>%
    filter(
      {{ var1 }} > .env$lower1,
      {{ var2 }} > .env$lower2
    )
}

With the {{ operator, we're interpolating the input expressions inside the data context. This means you can supply R code that refers to column names directly.
With .env$, we are asking for the lower variables inside the function environment. This means that if the data frame contains columns lower1 and lower2, these won't interfere. Another way of forcing evaluation in the environment is to use !!.

mtcars %>% filter2(cyl, mpg, 4, 15) %>% head()
#>   mpg cyl disp  hp drat  wt qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.6   16  0  1    4    4
#> 2  21   6  160 110  3.9 2.9   17  0  1    4    4
#> 3  21   6  258 110  3.1 3.2   19  1  0    3    1
#> 4  19   8  360 175  3.1 3.4   17  0  0    3    2
#> 5  18   6  225 105  2.8 3.5   20  1  0    3    1
#> 6  19   6  168 123  3.9 3.4   18  1  0    4    4

================================

The rest of this answer tries to unpack some of the puzzles you brought up. This might be useful to get a better sense of the evaluation model in R, but again you're better off finding simpler approaches to solving your issues.

Let's take:

x<-expr(cyl);eval(expr(expr(`$`(mtcars,!!x))<=6))
#> [1] FALSE

Reformatting a bit:

x <- expr(cyl)
eval(expr(expr(`$`(mtcars,!!x)) <= 6))

Removing the unnecessary complexity:

eval(expr(expr(mtcars$cyl) <= 6))

Let's look at the intermediate result:

expr(expr(mtcars$cyl) <= 6)
#> expr(mtcars$cyl) <= 6

The outer expr() returns an expression instructing R to:

Create a new expression (with the inner expr())
Compare that expression to 6

Unfortunately, R expressions are comparable even though it doesn't make any sense. In an ideal world this would be an error:

quote(foo) < 10
#> [1] FALSE

Probably what you'd like to do is to compute the column subsetting described in the expression first, and then compare with <=:

eval(expr(mtcars$cyl)) <= 6
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#> [11]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#> [21]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
#> [31] FALSE  TRUE

Another note. You write:

eval(filter1<-expr(`$`(df,!!var2)))

Reformatting and simplifying:

eval(filter1 <- expr(mtcars$cyl)))

Here's what's happening when you evaluate this:

eval() asks R to return its first argument, so it can evaluate it.
R sees that the argument to eval() is a <- call. It then starts to evaluate it.
The RHS is a defused expression describing how to subset mtcars. This RHS is assigned to the LHS filter1.
<- returns the RHS, invisibly. This is what eval() gets as argument.
eval() proceeds to compute the mtcars subsetting.

Thanks!They are very useful. I also have a followup question for you which I have to post by giving an answer to my own question. I @ your username in the answer so you know that was a response to your answer. Thanks again!! — xiahfyj, Nov 10 '19 at 23:08

score 1 · Answer 2 · answered Nov 11 '19 at 09:22

In https://stackoverflow.com/a/58793418/1725177 xiahfyj asked how to compute the filters in a separate step than filter(). In general separate computations can be performed with transmute(). This function takes inputs and returns a data frame containing one column per input. The inputs are computed within the data frames, and within groups if there are any.

filter3 <- function(data, var1, var2, lower1, lower2) {
  filters <- data %>% transmute(
    filter_a = {{ var1 }} > .env$lower1,
    filter_b = {{ var2 }} > .env$lower2
  )

  data %>%
    filter(!!!unname(filters))
}

The data frames of evaluated filter columns can then be spliced into filter(). The force-splicing operator !!! transforms its argument to multiple inputs in the surrounding call (here, a call to filter()).

In the case of filter(), the data frame of inputs must be unnamed because there's a special check in filter() to throw an error for named inputs, in order to catch a common typo when the writes a = foo instead of a == foo.

We are planning to support data frame inputs in the next major version of dplyr, and auto-splice them. In that case the last step will become as simple as:

  data %>%
    filter(filters)

score 0 · Answer 3 · answered Nov 10 '19 at 22:11

@Lionel Henry Thanks! I do have a followup question on your example.

filter2 <- function(data, var1, var2, lower1, lower2) {
  data %>%
    filter(
      {{ var1 }} > .env$lower1,
      {{ var2 }} > .env$lower2
    )
}

What if I want something like below block? Basically I want to take what you have inside filter() out of it and store them in some variables beforehand. I know below function doesn't work. I wonder how should I edit it to make it work.

filter2 <- function(data, var1, var2, lower1, lower2) {
filter_a<-{{ var1 }} > .env$lower1
filter_b<-{{ var2 }} > .env$lower2
  data %>%
    filter(filter_a,filter_b)
}

The reason I want this is because for the purpose of my function, what's inside filter()will be dynamic. For example, I would need something like this:

###if both lower and upper boundary for var1 are given by the user,do below:
   filter_a<-{{ var1 }} > .env$lower1&{{ var1 }} < .env$upper1
###if only upper are given.do below:
   filter_a<-{{ var1 }} < .env$upper1
###if only lower are given, do below:
   filter_a<-{{ var1 }} > .env$lower1

This is also why I had so many if statements in my original long and "hard to read" question.

How does filter() in dplyr evaluate what's inside the () in a customized function?

3 Answers3