How to join multiple data frames using dplyr?

Question

I want to left_join multiple data frames:

dfs <- list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Reduce(left_join, dfs)
#   a b  c  d
# 1 1 a  4 NA
# 2 2 b NA  7
# 3 3 c  5  8

This works because they all have the same b column, but Reduce doesn't let me specify additional arguments that I can pass to left_join. Is there a work around for something like this?

dfs <- list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, d = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)

Update

This kind of works: Reduce(function(...) left_join(..., by = c("b" = "d")), dfs) but when by is more than one element it gives this error: Error: cannot join on columns 'b' x 'd': index out of bounds

Can't you do `Reduce(function(...) left_join(..., other args here), dfs)`? — Rich Scriven, Dec 17 '15 at 21:22
I hadn't thought of that. Yes, I can! Feel free to write it as your answer — nachocab, Dec 17 '15 at 21:25
Hmm, actually, this breaks if `by` has more than one element. See update — nachocab, Dec 17 '15 at 21:52
How do you expect the function to know which columns to join on? This is not really what the *_join functions where designed to do. If all of the columns have a common ID column you can use then your update solves the issue. Otherwise you are going to have to specify by hand which columns to use for the join for each pair. — Bishops_Guest, Jul 15 '16 at 22:36

joel.wilson · Accepted Answer · 2016-12-05T06:28:48.283

It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.

Using left_join()

dfs <- list(
              df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
              df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
              df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
         )

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx = left_join(..., by = setNames(col2,col1))
  return(xxx)
}
Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8

Using merge() :

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
  return(xxx)
}

Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8

just to note : I had to keep the "key" variables as the first column, because after merging they automatically become the first columns in the dataframe — joel.wilson, Dec 03 '16 at 07:21

score 9 · Answer 2 · edited Dec 03 '16 at 07:27

9

Would this work for you?

jnd.tbl <- df1 %>%
    left_join(df2, by='b') %>%
    left_join(df3, by='d')

edited Dec 03 '16 at 07:27

Hong Ooi

56,353
13
134
187

answered Dec 17 '15 at 21:24

elesk01s

131
6

5

Doesn't generalize very well (which seems to be the point of the question). – Gregor Thomas Dec 17 '15 at 21:33
1

This would seem to be the best solution. It generalises quite well, in fact: if you want to join another table, just add one more line to the pipeline. The code would end up looking not too different from a SQL multiple-join query. – Hong Ooi Dec 03 '16 at 07:26
@HongOoi but the by= variable varies right? please check my answer on this – joel.wilson Dec 03 '16 at 08:48

score 4 · Answer 3 · answered Feb 11 '22 at 14:15

Yet another solution:

library(purrr)
library(dplyr)

dfs = list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)

purrr::reduce(dfs, dplyr::left_join, by = 'b')

How to join multiple data frames using dplyr?

3 Answers3

Linked