Filter Dataframe with Tidy Evaluation

Question

I am handling a large dataset. First, for certain columns (X1, X2, ...), I am trying to identify a range of value (a, b) consists of repeated value (a > n, b > n). Next, I wish to filter row based on the condition which matches respective columns to result given in the previous step.

Here is a reproducible example simulating the scenario I am facing,

library(tidyverse)

set.seed(1122)

vecs <- lapply(X = 1:2, function(x) rep(c(1, 2, 3), times = 10) %>% sample() %>% head(10))
names(vecs) <- paste0("col_", 1:2)
dat <- vecs %>% as.data.frame()
dat

   col_1 col_2
1      3     2
2      1     1
3      1     1
4      1     2
5      1     2
6      3     3
7      3     3
8      2     1
9      1     3
10     2     2

I am able to identify the range by the following method,

# Which col has repeated value more than 3 appearances?
more_than_3 <- function(df, var){
    var <- rlang::sym(var)

    df %>% 
        group_by(!!var) %>% 
        summarise(n = n()) %>% 
        filter(n > 3) %>% 
        pull(!!var) %>% 
        range()
}
cols_name <- c("col_1", "col_2")
some_range <- purrr::map(cols_name, more_than_3, df = dat)
names(some_range) <- cols_name
some_range

$col_1
[1] 1 1

$col_2
[1] 2 2

However, to filter out values that fall outside the upper limit, this is what I do.

dat %>% 
    filter(col_1 <= some_range[["col_1"]][2], 
           col_2 <= some_range[["col_2"]][2])

  col_1 col_2
1     1     1
2     1     1
3     1     2
4     1     2

I believe there must be a more efficient and elegant way of filtering the result based on tidy evaluation. Can someone point me to the right direction?

Many thanks in advance.

Can you tell us what the expected output is for your test example? — Spacedman, Apr 27 '18 at 07:12
Hi @Spacedman The filtering result shown above is my expected result. I wish to replace it with `tidy eval` principle. — ThomasJc, Apr 27 '18 at 07:21
I can't reproduce the results of your filtering above. (Also tidy evaluation is not meant to be efficient, it's meant to create a certain style of interfaces.) — Lionel Henry, Apr 27 '18 at 07:54
Hi @lionel, I edited my code to make it reproducible. Also, thank you for your wonderful answer. — ThomasJc, Apr 27 '18 at 08:27

score 3 · Accepted Answer · answered Apr 27 '18 at 07:58

First let's try to create a small function that creates a single filter expression for one column. This function will take a symbol and then transform to string but it could be the other way around:

new_my_filter_call_upper <- function(sym, range) {
  col_name <- as.character(sym)

  col_range <- range[[col_name]]
  if (is.null(col_range)) {
    stop(sprintf("Can't find column `%s` to compute range", col_name), call. = FALSE)
  }

  expr(!!sym < !!col_range[[2]])
}

Let's try it:

new_my_filter_call_upper(quote(foobar), some_range)
#> Error: Can't find column `foobar` to compute range

# It works!
new_my_filter_call_upper(quote(col_1), some_range)
#> col_1 < 3

Now we're ready to create a pipeline verbs that will take a data frame and bare column names.

# Probably cleaner to pass range as argument. Prefix with dot to allow
# columns named `range`.
my_filter <- function(.data, ..., .range) {
  # ensyms() guarantees there won't be complex expressions
  syms <- rlang::ensyms(...)

  # Now let's map the function to create many calls:
  calls <- purrr::map(syms, new_my_filter_call_upper, range = .range)

  # And we're ready to filter with those expressions:
  dplyr::filter(.data, !!!calls)
}

Let's try it:

dat %>% my_filter(col_1, col_2, .range = some_range)
#>   col_1 col_2 NA.
#> 1     2     1   1
#> 2     2     2   1

score 0 · Answer 2 · answered Apr 27 '18 at 07:29

We could use map2

library(purrr)
map2(dat, some_range,  ~ .x < .y[2]) %>%
      reduce(`&`) %>%
      dat[.,]
#     col_1 col_2
#1     2     2
#2     1     1
#3     1     2
#6     1     1

Or with pmap

pmap(list(dat, some_range %>% 
                     map(2)), `<`) %>% 
      reduce(`&`) %>%
      dat[.,]

Filter Dataframe with Tidy Evaluation

2 Answers2