Matching data with duplicates for self-update with different sources in R

Question

I have a set of data with duplicates:

x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)

And separate sources giving their corresponding values.

y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])

Normally, one would use this code to update x$num with data from y.

x$alph <- y$alph[match(x$num,y$num)]

Doing the same for z would nonetheless overwrite what was already in place from y and replace them with NAs.

How can I code so that data can be cumulatively updated? Using:

x$alph[which(x$num %in% z$num)] <- y$alph[which(z$num %in% x$num)]

doesn't work because of the duplicate.

www · Answer 1 · 2017-11-11T22:16:42.240

Here I provided three options using tidyverse. x2, x4, and x5 are the final output.

We can create a combined data frames from y and z, and then perform a join with x.

# Load packages
library(tidyverse)

# Create example data frames
x <- tibble(num=c(1,2,3,2,5,5,8), alph=NA)
y <- tibble(num=1:4, alph=LETTERS[1:4])
z <- tibble(num=5:10, alph=LETTERS[5:10])

# Create combined table from y and z
yz <- bind_rows(y, z)

# Perform join
x2 <- x %>% 
  select(-alph) %>%
  left_join(yz, by = "num")
x2
# # A tibble: 7 x 2
#     num  alph
#   <dbl> <chr>
# 1     1     A
# 2     2     B
# 3     3     C
# 4     2     B
# 5     5     E
# 6     5     E
# 7     8     H

Or use reduce to merge all data frames, then select the one that is not NA to construct a new data frame.

x3 <- reduce(list(x, y, z), left_join, by = "num")
x4 <- tibble(num = x3$num,
             alph = apply(x3[, -1], 1, function(x) x[!is.na(x)]))
x4
# # A tibble: 7 x 2
#     num  alph
#   <dbl> <chr>
# 1     1     A
# 2     2     B
# 3     3     C
# 4     2     B
# 5     5     E
# 6     5     E
# 7     8     H

Or after the reduce and join, Use gather to remove NA values.

x3 <- reduce(list(x, y, z), left_join, by = "num")
x5 <- x3 %>% 
  gather(Type, alph, -num, na.rm = TRUE) %>%
  select(-Type)
x5
# # A tibble: 7 x 2
#     num  alph
#   <dbl> <chr>
# 1     1     A
# 2     2     B
# 3     3     C
# 4     2     B
# 5     5     E
# 6     5     E
# 7     8     H

Matching data with duplicates for self-update with different sources in R

1 Answers1