0

I would like to create a function that merges two different tables, using two different columns from each data set.

df_1 <- data.frame(state_symbol = c('MG', 'SP', 'BA'),
                   city = c('Sao Paulo', 'Sao Paulo', 'Brumado'),
                   collected_data = c('red', 'green', 'blue'))

df_2 <- data.frame(S_Symb = c('MG', 'BA', "SP"),
                   municip = c('Sao Paulo', 'Brumado', 'Sao Paulo'),
                   population = c(123, 456, 789))

left_join(x = df_1,
          y = df_2,
          by = c("state_symbol" = "S_Symb",
                 "city" = "municip"))`

This produces the correct result:

  state_symbol      city collected_data population
1           MG Sao Paulo            red        123
2           SP Sao Paulo          green        789
3           BA   Brumado           blue        456

Now I want to use this code inside a function. I tried the following:

my_join <- function(tab_1,
                    tab_2,
                    df_1_city_col,
                    df_1_state_col,
                    df_2_city_col,
                    df_2_state_col) {
    
    output <- dplyr::left_join(x = tab_1,
                               y = tab_2,
                               by = c({{df_1_city_col}} = {{df_2_city_col}},
                                     {{df_1_state_col}} = {{df_2_state_col}}))
    return(output)
}

But that produced the following error:

> my_join <- function(tab_1,
+                     tab_2,
+                     df_1_city_col,
+                     df_1_state_col,
+                     df_2_city_col,
+                     df_2_state_col) {
+     
+     output <- dplyr::left_join(x = tab_1,
+                                y = tab_2,
+                                by = c({{df_1_city_col}} = {{df_2_city_col}},
Error: unexpected '=' in:
"                               y = tab_2,
                               by = c({{df_1_city_col}} ="

Apparently, functions don't deal well with the notation c(something = something_else). The = inside that parenthesis is too much for it and it instantaneously produces an error no matter what I try.

Just for reference, my function call would look like this:

my_join(tab_1 = df_1,
        tab_2 = df_2,
        df_1_city_col = 'city',
        df_1_state_col = 'state_symbol',
        df_2_city_col = 'municip',
        df_2_state_col = 'S_Symb')

I saw some other similar answered questions, but couldn't find one where the join needs to occur between two columns from each input table.

Any help will be appreciated. Thanks.

EsterRB
  • 15
  • 2

1 Answers1

2

Note that c("state_symbol" = "S_Symb","city" = "municip") actually creates a named vector which you can create with setNames to use it inside a function.

my_join <- function(tab_1,
                    tab_2,
                    df_1_city_col,
                    df_1_state_col,
                    df_2_city_col,
                    df_2_state_col) {
  
  output <- dplyr::left_join(x = tab_1,
                             y = tab_2,
                             by = setNames(c(df_2_city_col, df_2_state_col),
                                            c(df_1_city_col, df_1_state_col)))
  return(output)
}

my_join(tab_1 = df_1,
        tab_2 = df_2,
        df_1_city_col = 'city',
        df_1_state_col = 'state_symbol',
        df_2_city_col = 'municip',
        df_2_state_col = 'S_Symb')

#  state_symbol      city collected_data population
#1           MG Sao Paulo            red        123
#2           SP Sao Paulo          green        789
#3           BA   Brumado           blue        456

In base R, you can use by.x and by.y to specify the columns to merge.

my_join <- function(tab_1,
                    tab_2,
                    df_1_city_col,
                    df_1_state_col,
                    df_2_city_col,
                    df_2_state_col) {
  
  output <- merge(tab_1, tab_2, 
                  by.x = c(df_1_city_col, df_1_state_col), 
                  by.y = c(df_2_city_col, df_2_state_col), 
                  all.x = TRUE)
  
  return(output)
}
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you so much. I did not know that `c("state_symbol" = "S_Symb","city" = "municip")` was creating a named vector. Could you perhaps comment further on that? Also, It is good to know a base R solution. – EsterRB Jun 30 '21 at 15:59
  • If you print `c("state_symbol" = "S_Symb","city" = "municip")` in the console you'll see the structure of the named vector it creates. The goal is to create something similar with `setNames`. – Ronak Shah Jun 30 '21 at 23:06