A solution like
map(index, function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)
seems to balance the functionality of dplyr (e.g., doing sensible things with NA
values) without excessive encumbrances of non-standard evaluation, while also highlighting some useful R idioms like use of [[
to extract columns by name that are likely to be useful when non-standard evaluation becomes just to cumbersome.
Personally I would use lapply()
instead of map()
and save myself from having to learn another package. If I wanted named list elements I'd do that 'up front' rather than adding after the fact
names(index) <- index
lapply(index, function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)
or maybe
lapply(setNames(nm=index), function(i, x) filter(x, x[[i]] > mean(x[[i]])), iris)
If this were a common scenario in my code (or even if this were a one-off) I might write a short helper function
filter_greater_than_column_mean <- function(i, x)
dplyr::filter( x, x[[i]] > mean(x[[i]]) )
lapply(index, filter_greater_than_column_mean, iris)
If I were being a dilettante in my own way and trying to be more general, I might get overly complicated with
filter_by_column_mean <- function(i, x, op = `>`) {
idx <- op(x[[i]], mean(x[[i]]))
dplyr::filter(x, idx)
}
lapply(index, filter_by_column_mean, iris)
lapply(index, filter_by_column_mean, iris, `<=`)
or even
filter_by_column <- function(i, x, op = `>`, aggregate = mean) {
idx <- op(x[[i]], aggregate(x[[i]]))
dplyr::filter(x, idx)
}
lapply(index, filter_by_column, iris, op = `<=`)
lapply(index, filter_by_column, iris, `<=`, median)
Now that I'm not using non-standard evaluation, I might aim for base R's subset()
, which also does sensible things with NA
. So
filter_by_column <- function(i, x, aggregate = mean, op = `>`) {
idx <- op(x[[i]], aggregate(x[[i]]))
subset(x, idx)
}
I know this means I've learned a bunch of things about base R, and maybe I should instead have learned about !!
versus !!!
versus ..., but at any rate I've learned
- Functions like
mean
are 'first class', I can assign the symbol representing the function (e.g., mean
) to a variable (e.g., aggregate
) and then use the variable as a function (aggregate(...)
).
- Operators like
<
are actually functions, and lhs < rhs
can be written as `<`(lhs, rhs)
(and to write that I had to learn how to write backticks in markdown!)
More prosaically
- The
FUN
argument to lapply()
takes arguments in addition to the argument being iterated over. These can be named or unnamed, with the usual rules of argument matching (match first by name, then by position) applying.
[[
can be used to subset by name, avoiding the need for seq_along()
or other less robust operations that rely on numerical index.