2

There might be a *_join version for this I'm missing here, but I have two data frames, where

  1. The merging should happen in the first data frame, hence left_join
  2. I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
  3. The second data frame contains more rows than the first one.

Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.

x <- data.frame(id = c(1, 2, 3),
                a  = c("A", "B", NA))

  id    a
1  1    A
2  2    B
3  3 <NA>

y <- data.frame(id = c(1, 2, 3, 4),
                a  = c("A", "B", "C", "D"),
                q  = c("u", "v", "w", "x"))

  id a q
1  1 A u
2  2 B v
3  3 C w
4  4 D x

and the desired output would be:

  id a q
1  1 A u
2  2 B v
3  3 C w

I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?

library(tidyverse)
x %>%
  left_join(., y %>% select(id, q), by = c("id")) %>%
  rows_update(., y %>% filter(id %in% x$id), by = "id")
deschen
  • 10,012
  • 3
  • 27
  • 50

1 Answers1

3

You can left_join and use coalesce to replace missing values.

library(dplyr)

x %>%
  left_join(y, by = 'id') %>%
  transmute(id, a = coalesce(a.x, a.y), q)

#  id a q
#1  1 A u
#2  2 B v
#3  3 C w
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Ah I see. Thanks. Seems I need to learn some more tidyverse verbs. Only thing, as with every toy example I only showed one variable, but in real life I have ~70-100 "a" variables, so I have a bunch of a.x/a.y, b.x, b.y variables. Any suggestion how to automate? (I guess it could work via pivoting_longer), but maybe there's a more direct approach? – deschen Jan 06 '21 at 14:03
  • I think this is what you can use https://stackoverflow.com/questions/54971918/dplyr-mutate-at-coalesce-dynamic-names-of-columns – Ronak Shah Jan 06 '21 at 14:49
  • Thanks, I'll have a look. Although looking at the proposed solution, it seems the combination of left_join and rows_update is not a bad one given that whole split/map/!!! complication. – deschen Jan 06 '21 at 15:47
  • OK, I got it working with this split and map_df solution. However, the issue with that one is that it sorts columns alphabetically, which I don't want. – deschen Jan 06 '21 at 18:33
  • Not sure if I understand. Can you make up a small example to demonstrate that? – Ronak Shah Jan 07 '21 at 01:25
  • Sure. Using the same toy data of `x` and `y`, the order of columns is id, a, q. However, using the split.default and map_df approach via `x %>% left_join(., y, by = "id") %>% split.default(str_remove(names(.), "\\..*")) %>% map_df(reduce, coalesce)` gives the following order: a, id, q. – deschen Jan 07 '21 at 07:00
  • 1
    `x %>% left_join(y, by = 'id') %>% split.default(str_remove(names(.), "\\..*")) %>% map_df(reduce, coalesce) %>% select(names(y))` – Ronak Shah Jan 07 '21 at 07:31
  • True, this would work if all columns of x are part of y and x wouldn't contain columns not part of y. But that's the case in my real life scenario (sorry, I tried to keep the example simple), but let's assume x has columns id, a, test and y has column id, a, q and the combined data is supposed to have columns id, a, test, q. – deschen Jan 07 '21 at 08:33
  • You can do `select(names(x), names(y))`. Anyway, I don't think this is what your actual question was about. – Ronak Shah Jan 07 '21 at 08:36
  • Thanks Ronak. And you are right. Although this new solution would work, it doesn't seem to be much more efficient or easier or shorter than my initial approach. But it's good to know that there are alternatives. – deschen Jan 07 '21 at 09:04