28

I want to left_join multiple data frames:

dfs <- list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Reduce(left_join, dfs)
#   a b  c  d
# 1 1 a  4 NA
# 2 2 b NA  7
# 3 3 c  5  8

This works because they all have the same b column, but Reduce doesn't let me specify additional arguments that I can pass to left_join. Is there a work around for something like this?

dfs <- list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, d = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)

Update

This kind of works: Reduce(function(...) left_join(..., by = c("b" = "d")), dfs) but when by is more than one element it gives this error: Error: cannot join on columns 'b' x 'd': index out of bounds

nachocab
  • 13,328
  • 21
  • 91
  • 149
  • 4
    Can't you do `Reduce(function(...) left_join(..., other args here), dfs)`? – Rich Scriven Dec 17 '15 at 21:22
  • I hadn't thought of that. Yes, I can! Feel free to write it as your answer – nachocab Dec 17 '15 at 21:25
  • Hmm, actually, this breaks if `by` has more than one element. See update – nachocab Dec 17 '15 at 21:52
  • 3
    How do you expect the function to know which columns to join on? This is not really what the *_join functions where designed to do. If all of the columns have a common ID column you can use then your update solves the issue. Otherwise you are going to have to specify by hand which columns to use for the join for each pair. – Bishops_Guest Jul 15 '16 at 22:36
  • try purrr:reduce() instead? – crazyhottommy Aug 09 '17 at 20:48

3 Answers3

11

It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.

Using left_join()

dfs <- list(
              df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
              df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
              df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
         )

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx = left_join(..., by = setNames(col2,col1))
  return(xxx)
}
Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8

Using merge() :

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
  return(xxx)
}

Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8
joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • just to note : I had to keep the "key" variables as the first column, because after merging they automatically become the first columns in the dataframe – joel.wilson Dec 03 '16 at 07:21
9

Would this work for you?

jnd.tbl <- df1 %>%
    left_join(df2, by='b') %>%
    left_join(df3, by='d')
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
elesk01s
  • 131
  • 6
  • 5
    Doesn't generalize very well (which seems to be the point of the question). – Gregor Thomas Dec 17 '15 at 21:33
  • 1
    This would seem to be the best solution. It generalises quite well, in fact: if you want to join another table, just add one more line to the pipeline. The code would end up looking not too different from a SQL multiple-join query. – Hong Ooi Dec 03 '16 at 07:26
  • @HongOoi but the by= variable varies right? please check my answer on this – joel.wilson Dec 03 '16 at 08:48
4

Yet another solution:

library(purrr)
library(dplyr)

dfs = list(
  df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
  df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
  df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)

purrr::reduce(dfs, dplyr::left_join, by = 'b')

Matthias Munz
  • 3,583
  • 4
  • 30
  • 47