Dynamic Columns in Dplyr using NSE on the RHS

Question

I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.

Here is a sample dataframe:

dt = data.frame(
    A = c(1:20), 
    B = c(11:30), 
    C = c(21:40),
    AA = rep(1, 20), 
    BB = rep(2, 20)
)

Here is a table of sample operations to be performed:

evaluation = data.frame(
   New_Var = c("AA", "BB"), 
   Operation = c("(A*2) > B", "(B*2) <= C"), 
   Result = c("True", "False")
) %>% mutate_all(as.character)

What I am trying to do is the following:

for (i in 1:nrow(evaluation)) {

  var = evaluation$New_Var[i]

  dt = dt %>% 
    rowwise() %>% 
    mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])), 
                           evaluation$Result[i], 
                           !!var))

}

my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.

UPDATED:

I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?

enter image description here

I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.

It's not clear what you want different from your result. You want the "AA" in `dt$AA` to be the result of `A*2>B` which is `FALSE` or `0`, or you want the value of `A` there (1, 2, 3, 4, 5)? — Brian, Jul 10 '19 at 21:29
@Brian thanks for pointing that out, i just updated it above. I want the "AA" in dt$AA to equal the values of dt$AA in the false part of the ifelse statement. This would be effectively mean (1,1,1,1,1) for the highlighted text in the image above. Just to be on the same page -> ifelse(logic, true part , false part) The false part of the ifelse is not working properly, or I am not using correct syntax in the false part of the ifelse statement. — Jst2Wond3r, Jul 11 '19 at 14:46
What is the intended type of columns `AA` and `BB`? You're initializing them to be integers and then attempting to [conditionally] store strings there in your loop. — Artem Sokolov, Jul 11 '19 at 15:14
@ArtemSokolov this is just an example, the result can be character, it doesn't really matter in this case. I set AA and BB as character thinking maybe that was causing the issue, but it did not resolve it. — Jst2Wond3r, Jul 11 '19 at 15:26
It matters because attempting to overwrite an integer column with character throws an error. The solution I posted below works when the type is consistent. If you try running it on the original data, you will get ``Error: Column `AA` can't be converted from numeric to character`` — Artem Sokolov, Jul 11 '19 at 15:31

score 2 · Answer 1 · answered Jul 11 '19 at 01:37

Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.

Data

library(rlang)
library(dplyr)
library(purrr)

operations <- tibble(
  old_var = exprs(A, B),
  new_var = exprs(AA, BB),
  test = exprs(2*A > B, 2*B <= C),
  result = exprs("True", "False")
)

original <- tibble(
  A = sample.int(30, 10), 
  B = sample.int(30, 10), 
  C = sample.int(30, 10)
)

original

# A tibble: 10 x 3
       A     B     C
   <int> <int> <int>
 1     4    20     5
 2    30    29    11
 3     1    27    14
 4     2    21     4
 5    17    19    24
 6    14    25     9
 7     5    22    22
 8     6    13     7
 9    25     4    21
10    12    11    12

Functions

# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
    dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}

generic_ops <- function(dat, ops) {
  pmap(ops, generic_mutate, dat = dat) %>% 
    reduce(full_join)
}

generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.

generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.

Results

original %>%
  generic_ops(operations)

Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
       A     B     C AA    BB   
   <int> <int> <int> <chr> <chr>
 1     4    20     5 4     20   
 2    30    29    11 True  29   
 3     1    27    14 1     27   
 4     2    21     4 2     21   
 5    17    19    24 True  19   
 6    14    25     9 True  25   
 7     5    22    22 5     22   
 8     6    13     7 6     13   
 9    25     4    21 True  False
10    12    11    12 True  11

The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.

Artem Sokolov · Accepted Answer · 2019-07-11T16:54:22.937

Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.

dt = tibble(
  A = c(1:20), 
  B = c(10:29),      ## Note the change
  C = c(21:40),
  AA = rep("a", 20), ## Note initialization with strings
  BB = rep("b", 20)  ## Ditto
)

To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.

for( i in 1:nrow(evaluation) )
{
  var <- rlang::sym(evaluation$New_Var[i])
  op <- rlang::parse_expr(evaluation$Operation[i])

  dt = dt %>% rowwise() %>% 
    mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
#        A     B     C AA    BB   
#    <int> <int> <int> <chr> <chr>
#  1     1    10    21 a     False
#  2     2    11    22 a     False
#  3     3    12    23 a     b    
#  4     4    13    24 a     b    
#  5     5    14    25 a     b    
#  6     6    15    26 a     b    
#  7     7    16    27 a     b    
#  8     8    17    28 a     b    
#  9     9    18    29 a     b    
# 10    10    19    30 True  b    
# 11    11    20    31 True  b    
# 12    12    21    32 True  b    
# 13    13    22    33 True  b    
# 14    14    23    34 True  b    
# 15    15    24    35 True  b    
# 16    16    25    36 True  b    
# 17    17    26    37 True  b    
# 18    18    27    38 True  b    
# 19    19    28    39 True  b    
# 20    20    29    40 True  b

Felipe Gerard · Answer 3 · 2019-07-11T14:59:43.357

How's this:

evaluation = data.frame(
   Old_Var = c('A', 'B'),
   New_Var = c("AA", "BB"), 
   Operation = c("(A*2) > B", "(B*2) <= C"), 
   Result = c("True", "False")
) %>% mutate_all(as.character)

for (i in 1:nrow(evaluation)) {

  old <- sym(evaluation$Old_Var[i])
  new <- sym(evaluation$New_Var[i])
  op <- sym(evaluation$Operation[i])
  res <- sym(evaluation$Result[i])

  dt <- dt %>% 
    mutate(!!new := ifelse(!!op, !!res, !!old))

}

EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow @Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...

dt <- tibble(
  A = c(1:20), 
  B = c(11:30), 
  C = c(21:40),
  AA = rep(1, 20), 
  BB = rep(2, 20)
)

evaluation = tibble(
  Old_Var = c('A', 'B'),
  New_Var = c("AA", "BB"), 
  Operation = c('(A*2) > B', '(B*2) <= C'), 
  Result = c("True", "False")
)

for (i in 1:nrow(evaluation)) {

  old <- evaluation$Old_Var[i]
  new <- evaluation$New_Var[i]
  op <- evaluation$Operation[i]
  res <- evaluation$Result[i]

  dt <- dt %>% 
    mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))

}

I get the following errror when I using your code: Error in ifelse(`(A*2) > B`, True, A) : object '(A*2) > B' not found So, can I not reference the same column name with the same variable in the ifelse statement? Let's say you have column "Delta" and you would write ifelse statement in mutate such as: mutate(Delta = ifelse(Delta > 0, 5, Delta). In this case you can reference Delta again, but in your example it seems that when using NSE you need a different variable for LHS and RHS variables, thus the "!!new" and (!!op, !!res, !!old)? — Jst2Wond3r, Jul 11 '19 at 13:51

score 0 · Answer 4 · answered Jul 13 '19 at 14:27

One way is to rework the conditions first, then pass them to mutate :

conds <- parse(text=evaluation$Operation) %>%
  as.list() %>%
  setNames(evaluation$New_Var) %>%
  imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds  
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#> 
#> $BB
#> ifelse((B * 2) <= C, "True", BB)

dt %>% mutate(!!!conds)
#>     A  B  C   AA BB
#> 1   1 11 21    1  2
#> 2   2 12 22    1  2
#> 3   3 13 23    1  2
#> 4   4 14 24    1  2
#> 5   5 15 25    1  2
#> 6   6 16 26    1  2
#> 7   7 17 27    1  2
#> 8   8 18 28    1  2
#> 9   9 19 29    1  2
#> 10 10 20 30    1  2
#> 11 11 21 31 True  2
#> 12 12 22 32 True  2
#> 13 13 23 33 True  2
#> 14 14 24 34 True  2
#> 15 15 25 35 True  2
#> 16 16 26 36 True  2
#> 17 17 27 37 True  2
#> 18 18 28 38 True  2
#> 19 19 29 39 True  2
#> 20 20 30 40 True  2

Dynamic Columns in Dplyr using NSE on the RHS

4 Answers4

Data

Functions

Results