2

I would like to loop over a vector of variable names with purrr, then use the variables inside a function with dplyr, as with the following code:

library(dplyr)
library(purrr)

#creating index
index<-c('Sepal.Length', 'Sepal.Width')

#mapping over index with lambda function
map(index, ~iris %>% filter (.x > mean(.x)))

I was expecting to see a list of two data.frames, as in

list(Sepal.Length = iris %>% filter (Sepal.Length > mean(Sepal.Length)),
     Sepal.Width = iris %>% filter (Sepal.Width > mean(Sepal.Width)))

Is there a way to use the .x variables as column names within the data.frames in the lambda function?

I think it may have something to do with data masking and non-standard evaluation, and I suspect rlang may be helpful here, but I am not familiar with the subject. Thank you

GuedesBF
  • 8,409
  • 5
  • 19
  • 37
  • Very interesting GuedesBF. Could you please elaborate in which situation we could use this procedure. What is the idea behind. I really want to know?? Thank you! – TarJae Aug 16 '21 at 22:55
  • 1
    Hi, @TarJae, thank you. I have a dataset with 300+columns, including a `dummified` grouping variable spread over several columns, and several other `data` variables. I would like to `summarise(across(data_variables, ~something)` for every group defined by `dummy_x==1`, `dummy_y==1`, so a vector of `c("dummy_1", "dummy2"...)` could help determine the variables beforehand. The question aimed to understand the procedure as in the first of akrun's answers, which could make things easier. – GuedesBF Aug 16 '21 at 23:08
  • The actual procedure is a bit more complex, but I used a minimal reprex for the exact `vector of characters as variable names` issue – GuedesBF Aug 16 '21 at 23:16

4 Answers4

3

Those are strings. We need to convert to symbol and evaluate (!!)

library(purrr)
library(dplyr)
out <- map(index, ~iris %>%
       filter (!! rlang::sym(.x) > mean(!! rlang::sym(.x))))
names(out) <- index

-output

> str(out)
List of 2
 $ Sepal.Length:'data.frame':   70 obs. of  5 variables:
  ..$ Sepal.Length: num [1:70] 7 6.4 6.9 6.5 6.3 6.6 5.9 6 6.1 6.7 ...
  ..$ Sepal.Width : num [1:70] 3.2 3.2 3.1 2.8 3.3 2.9 3 2.2 2.9 3.1 ...
  ..$ Petal.Length: num [1:70] 4.7 4.5 4.9 4.6 4.7 4.6 4.2 4 4.7 4.4 ...
  ..$ Petal.Width : num [1:70] 1.4 1.5 1.5 1.5 1.6 1.3 1.5 1 1.4 1.4 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Sepal.Width :'data.frame':   67 obs. of  5 variables:
  ..$ Sepal.Length: num [1:67] 5.1 4.7 4.6 5 5.4 4.6 5 4.9 5.4 4.8 ...
  ..$ Sepal.Width : num [1:67] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 ...
  ..$ Petal.Length: num [1:67] 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.5 1.5 1.6 ...
  ..$ Petal.Width : num [1:67] 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.1 0.2 0.2 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

-testing with OP's expected

> expected <- list(Sepal.Length = iris %>% filter (Sepal.Length > mean(Sepal.Length)),
+      Sepal.Width = iris %>% filter (Sepal.Width > mean(Sepal.Width)))
> 
> identical(out, expected)
[1] TRUE

Or subset with cur_data()

map(index, ~ iris %>%
     filter(cur_data()[[.x]] > mean(cur_data()[[.x]])))

Or use across or if_all, which takes directly string

map(index, ~ iris %>%
           filter(across(all_of(.x), ~ . > mean(.))))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Awesome, thanks. Isnt it possible to use `{{ }}` instead of `!!` ? – GuedesBF Aug 16 '21 at 22:14
  • @GuedesBF that is used with unquoted inputs in a function. Your inputs are strings – akrun Aug 16 '21 at 22:15
  • 1
    The `{{}}` does similar to `enquo` with `||` – akrun Aug 16 '21 at 22:16
  • ok, thanks. I am new to it, and still very confused when it comes to `rlang` and non-standard evaluation – GuedesBF Aug 16 '21 at 22:17
  • 1
    @GuedesBF eg. when you create a function `f1 <- function(dat, colnm) dat %>% summarise(Mean = mean({{colnm}})); f1(iris, Sepal.Length)`. The input column name is not a string whereas in your case, it is a character vector whch we loop in `map` – akrun Aug 16 '21 at 22:19
  • Thanks again for the thorough explanation. If you don't mind, I like to let the question sit around for a couple minutes before accepting an answer. – GuedesBF Aug 16 '21 at 22:21
3

A solution like

map(index, function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)

seems to balance the functionality of dplyr (e.g., doing sensible things with NA values) without excessive encumbrances of non-standard evaluation, while also highlighting some useful R idioms like use of [[ to extract columns by name that are likely to be useful when non-standard evaluation becomes just to cumbersome.

Personally I would use lapply() instead of map() and save myself from having to learn another package. If I wanted named list elements I'd do that 'up front' rather than adding after the fact

names(index) <- index
lapply(index, function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)

or maybe

lapply(setNames(nm=index), function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)

If this were a common scenario in my code (or even if this were a one-off) I might write a short helper function

filter_greater_than_column_mean <- function(i, x)
    dplyr::filter( x, x[[i]] > mean(x[[i]]) )

lapply(index, filter_greater_than_column_mean, iris)

If I were being a dilettante in my own way and trying to be more general, I might get overly complicated with

filter_by_column_mean <- function(i, x, op = `>`) {
    idx <- op(x[[i]], mean(x[[i]]))
    dplyr::filter(x, idx)
}
lapply(index, filter_by_column_mean, iris)
lapply(index, filter_by_column_mean, iris, `<=`)

or even

filter_by_column <- function(i, x, op = `>`, aggregate = mean) {
    idx <- op(x[[i]], aggregate(x[[i]]))
    dplyr::filter(x, idx)
}
lapply(index, filter_by_column, iris, op = `<=`)
lapply(index, filter_by_column, iris, `<=`, median)

Now that I'm not using non-standard evaluation, I might aim for base R's subset(), which also does sensible things with NA. So

filter_by_column <- function(i, x, aggregate = mean, op = `>`) {
    idx <- op(x[[i]], aggregate(x[[i]]))
    subset(x, idx)
}

I know this means I've learned a bunch of things about base R, and maybe I should instead have learned about !! versus !!! versus ..., but at any rate I've learned

  • Functions like mean are 'first class', I can assign the symbol representing the function (e.g., mean) to a variable (e.g., aggregate) and then use the variable as a function (aggregate(...)).
  • Operators like < are actually functions, and lhs < rhs can be written as `<`(lhs, rhs) (and to write that I had to learn how to write backticks in markdown!)

More prosaically

  • The FUN argument to lapply() takes arguments in addition to the argument being iterated over. These can be named or unnamed, with the usual rules of argument matching (match first by name, then by position) applying.
  • [[ can be used to subset by name, avoiding the need for seq_along() or other less robust operations that rely on numerical index.
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
2

Base R:

index<-c('Sepal.Length', 'Sepal.Width')
df <- iris

setNames(
  lapply(
    seq_along(index),
    function(i){
       mu <- mean(df[,index[i]], na.rm = TRUE)
       df[df[,index[i],drop = TRUE] > mu, ]
    }
  ),
  index
)
hello_friend
  • 5,682
  • 1
  • 11
  • 15
  • 1
    subsetting the `index` by position inside the data.frame subsetting, with `[index[i]]` is really simple and useful. Thank you. – GuedesBF Aug 16 '21 at 23:13
  • No worries, can be simplified to: `setNames( lapply( index, function(x){ mu <- mean(df[,x], na.rm = TRUE) df[df[ , x, drop = TRUE] > mu,] } ), index )` – hello_friend Aug 16 '21 at 23:15
2

You can use .data -

library(dplyr)
library(purrr)

index<-c('Sepal.Length', 'Sepal.Width')

map(index, ~iris %>% filter (.data[[.x]] > mean(.data[[.x]])))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213