I have some R code that looks like this:
library(dplyr)
library(datasets)
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
Giving:
Source: local data frame [6 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.6 3.6 1.0 0.2 setosa
3 5.0 2.3 3.3 1.0 versicolor
4 5.1 2.5 3.0 1.1 versicolor
5 4.9 2.5 4.5 1.7 virginica
6 6.0 3.0 4.8 1.8 virginica
This groups by species, and for each group keeps only the two with the shortest Petal.Length. I have some duplication in my code, because I do this several times for different columns and numbers. E.g.:
iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(Petal.Width, ties.method = 'random')<=3) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Width, ties.method = 'random')<=3) %.% ungroup()
I want to extract this into a function. The naive approach doesn't work:
keep_min_n_by_species <- function(expr, n) {
iris %.% group_by(Species) %.% filter(rank(expr, ties.method = 'random') <= n) %.% ungroup()
}
keep_min_n_by_species(Petal.Width, 2)
Error in filter_impl(.data, dots(...), environment()) :
object 'Petal.Width' not found
As I understand it, the expression rank(Petal.Length, ties.method = 'random') <= 2
is evaluated in a different context, introduced by the filter
function, that provides a meaning for the Petal.Length
expression. I can't just swap in a variable for Petal.Length, because it will be evaluated in the wrong context. I've tried using different combinations of substitute
and eval
, having read this page: Non-standard evaluation. I can't figure out an appropriate combination. I think the problem might be that I don't just want to pass through an expression from the caller (Petal.Length
) through to filter
for it to evaluate - I want to construct a new bigger expression (rank(Petal.Length, ties.method = 'random') <= 2
) and then pass that whole expression through to filter
for it to evaluate.
- How can I refactor this expression into a function?
- More generally, how should I go about extracting an R expression into a function?
- Even more generally, am I approaching this with the wrong mindset? In more mainstream languages I'm familiar with (e.g. Python, C++, C#), this is a relatively straightforward operation that I want to do all the time to remove duplication in my code. In R it seems (to me, at least) that non-standard evaluation can make it a very non-obvious operation. Should I be doing something else entirely?