reference another column with tidyeval in filter

Question

I have a tibble with column foo that contains the name of another column in the tibble. I'd like to filter based on the column that is named in foo:

mtcars %>%
  mutate(foo = c(rep("carb", 16), rep("gear", 16))) %>%
  filter(!!sym(foo) == 4)
#> Error in is_symbol(x): object 'foo' not found

It seems to be looking for foo in the global environment, so I think I need a way to specify that foo should be evaluated in the context of the tibble.

Desired result would be the same as running:

rbind(
  mtcars[1:16,] %>% mutate(foo = "carb") %>% filter(carb == 4),
  mtcars[17:32,] %>% mutate(foo = "gear") %>% filter(gear == 4)
)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb  foo
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 carb
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 carb
#> 3  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 carb
#> 4  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 carb
#> 5  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 carb
#> 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 carb
#> 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 carb
#> 8  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 gear
#> 9  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 gear
#> 10 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 gear
#> 11 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 gear
#> 12 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 gear

Will it vary every row, or would be in chunks? – akrun May 09 '19 at 03:57 — akrun, May 09 '19 at 03:57

akrun · Answer 1 · 2019-05-09T04:32:28.493

3

If 'foo' is already a column with "am" as string, select the first element of 'foo', convert it to symbol, evaluate (!!) and filter those rows where the value of 'am' is 1

library(dplyr)
library(rlang)
mtcars %>%
   mutate(foo = "am") %>%
   filter(!! sym(foo[1]) == 1)
#     mpg cyl  disp  hp drat    wt  qsec vs am gear carb foo
#1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  am
#2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  am
#3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  am
#4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1  am
#5  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2  am
#6  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1  am
#7  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1  am
#8  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2  am
#9  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2  am
#10 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4  am
#11 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  am
#12 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8  am
#13 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2  am

If this varies every row, then an efficient option would be row/column indexing

df1 <- mtcars %>%
           mutate(foo = c(rep("carb", 16), rep("gear", 16)))
i1 <- cbind(seq_len(nrow(df1)), match(df1$foo, names(df1)))
subset(df1, df1[-ncol(df1)][i1] == 4)
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb  foo
#1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 carb
#2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 carb
#7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 carb
#10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 carb
#11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 carb
#15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 carb
#16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 carb
#18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 gear
#19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 gear
#20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 gear
#26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 gear
#32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 gear

or an option is get with rowwise

df1 %>%
    rowwise %>% 
    filter(get(foo) == 4)

Or use the row/column indexing in filter

df1 %>% 
      filter(.[cbind(row_number(), match(foo, names(.)))] == 4)

edited May 09 '19 at 04:32

answered May 09 '19 at 03:45

akrun

874,273
37
540
662

Closer to the second one, except `foo` needs to be a column, rather than a separate object. I'll make a harder example so this is clearer – lost May 09 '19 at 03:48
`foo` can vary, so `[1]` won't work. See improved example and test. – lost May 09 '19 at 03:54
I like the `get` solution, but `rowwise` has a big efficiency penalty. I know `eval_tidy` can be used to evaluate expressions in a given context, so I think it could work here but I can't get the syntax right. – lost May 09 '19 at 04:08
@lost it also works with `rowwise` `df1 %>% rowwise %>% filter(eval_tidy(sym(foo)) == 4)` – akrun May 09 '19 at 04:25
yeah, this works, but takes a while to run on my actual tibble. Is there an `eval_tidy` solution that doesn't require `rowwise`? – lost May 09 '19 at 04:28
1

@lost I doubt so, because it requires a row index as well, otherwise, it is not going to understand which row the value should be selected – akrun May 09 '19 at 04:28
1

@lost if you need a quicker option, try `df1 %>% filter(.[cbind(row_number(), match(foo, names(.)))] == 4)` – akrun May 09 '19 at 04:31
1

Yes, the point @akrun made about row ambiguity explains why the syntax `filter(!!sym(foo) == 4)` just wouldn't work. For each of the 32 rows in mtcars, you'd be telling dplyr to evaluate whether the 32 entries in carb or gear equal 4. – bschneidr May 09 '19 at 04:35
@bschneidr so are `tidyeval` functions not vectorized like other functions, then? c.f., if I do `mtcars %>% mutate(am1 = am + 1)` I don't have to tell `dplyr` to do this one row at a time. – lost May 09 '19 at 04:39
The problem isn't that the `tidyeval` function `sym` isn't vectorized. The problem is that, even if `sym` were vectorized, the code `!!sym(foo)` for each value of `foo` would return the entire length-32 vector found by "looking-up" the object in the data frame whose name is given by `foo`. So the result of vectorizing would be a list of 32 vectors, each of length 32. – bschneidr May 09 '19 at 05:02
1

I have added an answer that is vectorised. @akrun your first example doesn't work for me. Do you have an object `foo` in your workspace? – Lionel Henry May 09 '19 at 07:35
1

I think we can state the vectorisation issue as follows: an expression may represent a vectorised operation, but here you have constructed a problem where you need to create as many expressions as there are rows, because the expression depends on row values. – Lionel Henry May 09 '19 at 07:46

Lionel Henry · Answer 2 · 2019-05-09T12:19:55.547

2

I would avoid tidy eval here and work with values. First create the vector foo containing the relevant values from carb and gear, then filter it:

mtcars %>%
  mutate(foo = c(carb[1:16], gear[1:16])) %>%
  filter(foo == 4)

If the provenance of the values is variable:

df <- mtcars[1:5, ]
cols <- c("cyl", "vs", "am", "gear", "carb")

assemble_from <- function(data, cols) {
  map2_dbl(seq_along(cols), cols, function(i, c) data[[i, c]])
}

df %>%
  mutate(foo = assemble_from(df, cols)) %>%
  filter(foo %in% 1:3)

# Or more simply
df %>%
  filter(assemble_from(df, cols) %in% 1:3)

edited May 09 '19 at 12:19

answered May 09 '19 at 07:34

Lionel Henry

6,652
27
33

1

Good point, I have added an example of how to reconstruct a vector rowwise. – Lionel Henry May 09 '19 at 12:20
coming back to this now, is there still not a way to tideval a column's name? It's inconvenient having to create a bespoke function for this, and forces me to interrupt a pipe or use `{}` for what should be a pretty simple operation. – lost May 27 '21 at 00:08
I read again the post but I'm a bit confused by the question. With `filter()` we generally compare two columns together using a vectorised predicate that compares all values row by row. Here you have a column of column names. So each row should be compared against a different column? This seems like a pretty complex operation to me. – Lionel Henry May 28 '21 at 07:05
I've added a new answer that hopefully better answers the question. – Lionel Henry May 28 '21 at 07:47

score 1 · Answer 3 · answered May 28 '21 at 07:46

Usually filter() expressions compare two columns with a vectorised predicate that looks at each values row by row. In this case we have a column of column names that determines which column to look at for each row. We can solve this problem using get() and rowwise().

Tidyeval tools are made for columns that are defined externally (e.g. in a function argument). Here the columns are defined inside the data frame. That's unusual because column names defined in a data frame necessarily vary row by row. In any case, since the columns are defined in the data frame, get() seems the best tool to use here.
Since we have a row-by-row problem, we need to transform the data frame to a rowwise df. This way expressions inside mutate() etc will be evaluated once by row. Note that rowwise() patterns have a performance cost because they are not vectorised, so they should be limited to a small part of the code.

mtcars %>%
  mutate(foo = c(rep("carb", 16), rep("gear", 16))) %>%
  rowwise() %>%
  filter(get(foo) == 4)

reference another column with tidyeval in filter

3 Answers3