45

I am trying to perform an inner join two tables using dplyr, and I think I'm getting tripped up by non-standard evaluation rules. When using the by=("a" = "b") argument, everything works as expected when "a" and "b" are actual strings. Here's a toy example that works:

library(dplyr)
data(iris)

inner_join(iris, iris, by=c("Sepal.Length" = "Sepal.Width"))

But let's say I was putting inner_join in a function:

library(dplyr)
data(iris)

myfn <- function(xname, yname) {
    data(iris)
    inner_join(iris, iris, by=c(xname = yname))
}

myfn("Sepal.Length", "Sepal.Width")

This returns the following error:

Error: cannot join on columns 'xname' x 'Sepal.Width': index out of bounds

I suspect there is some fancy expression, deparsing, quoting, or unquoting that I could do to make this work, but I'm a bit murky on those details.

Peter
  • 4,219
  • 4
  • 28
  • 40
  • Hadley calls this ["non-standard evaluation"](http://adv-r.had.co.nz/Computing-on-the-language.html) (NSE) – smci Jun 06 '16 at 04:25
  • Seems not so much NSE as delivering the "by.x" and "by.y" column names in a different manner. The `by`-argument becomes `c("Sepal.Length" = "Sepal.Width")` and so what would be the `by.x` argument to `merge` becomes a real R name. In fact it's almost the opposite of NSE as I view it. – IRTFM Jul 24 '17 at 17:33

4 Answers4

60

You can use

myfn <- function(xname, yname) {
    data(iris)
    inner_join(iris, iris, by=setNames(yname, xname))
}

The suggested syntax in the ?inner_join documentation of

by = c("a"="b")   # same as by = c(a="b")

is slightly misleading because both those values aren't proper character values. You're actually created a named character vector. To dynamically set the values to the left of the equals sign is different from those on the right. You can use setNames() to set the names of the vector dynamically.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 14
    Note that by using `setNames`, the order of the arguments is inverted with respect to the original use in `inner_join`. To have the same order of the arguments, i.e. first `xname` then `yname`, you could use `by=setNames(nm=xname, yname)`. – fabern Nov 13 '17 at 15:38
  • Isn't there any way to include the `!!xname, !!yname` syntax here? In other dplyr verbs like rename you can access the values of character objects by using the two exclamation marks before the object. – Alberto Agudo Dominguez Sep 30 '22 at 11:17
  • 1
    @AlbertoAgudoDominguez the development version of dplyr has a `join_by` function that will allow you to use !! but as of now that is not part of the official dplyr release as far as I can tell. – MrFlick Sep 30 '22 at 13:02
3

I like MrFlick's answer and fber's addendum, but I prefer structure. For me setNames feels as something at the end of a pipe, not as an on-the-fly constructor. On another note, both setNames and structure enable the use of variables in the function call.

myfn <- function(xnames, ynames) {
  data(iris)
  inner_join(iris, iris, by = structure(names = xnames, .Data = ynames))
}

x <- "Sepal.Length"

myfn(x, "Sepal.Width")

A named vector argument would run into problems here:

myfn <- function(byvars) {
  data(iris)
  inner_join(iris, iris, by = byvars)
}

x <- "Sepal.Length"

myfn(c(x = "Sepal.Width"))

You could solve that, though, by using setNames or structure in the function call.

deSKase
  • 303
  • 5
  • 8
2

I know I'm late to the party, but how about:

myfn <- function(byvar) {
  data(iris)
  inner_join(iris, iris, by=byvar)
}

This way you can do what you want with:

myfn(c("Sepal.Length"="Sepal.Width"))
Felipe Gerard
  • 1,552
  • 13
  • 23
  • 1
    I don't that as substantially different than the MrFlick answer. – IRTFM Jul 24 '17 at 17:35
  • They are very similar. The only difference is when you are joining by multiple dimensions. I think it's clearer to do `myfn(c("a" = "b", "c" = "d"))` than `myfn(c("a", "c"), c("b", "d"))`, but it's a matter of taste I guess. – Felipe Gerard Jul 25 '17 at 23:00
  • Yeah. I can see your point. You are using redundant quotes though. You could call yours with `myfn(c(a = "b", c = "d"))`, and to my eyes that would be even clearer since it use the usual entry of the names, not to mention fewer characters that need a shift-key. – IRTFM Jul 26 '17 at 00:58
1

I faced a nearly identical challenge as @Peter, but needed to pass multiple different sets of by = join parameters at one time. I chose to use the map() function from the tidyverse package, purrr.

This is the subset of the tidyverse that I used.

library(magrittr)
library(dplyr)
library(rlang)
library(purrr)

First, I adapted myfn to use map() for the case posted by Peter. 42's comment and Felipe Gerard's answer made it clear that the by argument can take a named vector. map() requires a list over which to iterate.

    myfn_2 <- function(xname, yname) {
      by_names <- list(setNames(nm = xname, yname ))

      data(iris)

      # map() returns a single-element list. We index to retrieve dataframe.

      map( .x = by_names, 
           .f = ~inner_join(x = iris, 
                            y = iris, 
                            by = .x)) %>% 
        `[[`(1)
    }

myfn_2("Sepal.Length", "Sepal.Width")

I found that I didn't need quo_name() / !! in building the function.

Then, I adapted the function to take a list of by parameters. For each by_i in by_grps, we could extend x and y to add named values on which to join.

by_grps <- list(  by_1 = list(x = c("Sepal.Length"), y = c("Sepal.Width")), 
                  by_2 = list(x = c("Sepal.Width"), y = c("Petal.Width"))
                )

myfn_3 <- function(by_grps_list, nm_dataset) {
  by_named_vectors_list <- lapply(by_grps_list, 
                                  function(by_grp) setNames(object = by_grp$y,
                                                            nm = by_grp$x))
  map(.x = by_named_vectors_list, 
      .f = ~inner_join(nm_dataset, nm_dataset, by = .x))
}

myfn_3(by_grps, iris)
SoFarther
  • 524
  • 5
  • 10