2

Context

My question is similar to these two:

But I want to keep multiple rows, if there is a 'conflict' between two rows. (Also these questions start AFTER the merging of data frames. For my question it would be possible to provide an answer that modifies the merging process. I can guarantee, that the IDs are unique for each data frame. Of course it is also possible to provide a more general solution, that starts after the merge.)

The Problem

I want to merge two data frames but ignore included NAs.

Example

Given

> my_data1
  ID m n    o    p q
1  A K U <NA> <NA> K
2  B L V    Q    A L
3  C M W <NA>    S M
4  D N X    W    D N
5  E O Y <NA> <NA> O

and

> my_data2
  ID m n    o    p r
1  A K U    E <NA> U
2  B L V <NA>    B V
3  C M W    R <NA> W
4  D N X <NA> <NA> X
5  E O Y    T    F Y

Merging them returns

> my_data
   ID m n    o    p    q    r
1   A K U    E <NA> <NA>    U
2   A K U <NA> <NA>    K <NA>
3   B L V    Q    A    L <NA>
4   B L V <NA>    B <NA>    V
5   C M W    R <NA> <NA>    W
6   C M W <NA>    S    M <NA>
7   D N X    W    D    N <NA>
8   D N X <NA> <NA> <NA>    X
9   E O Y    T    F <NA>    Y
10  E O Y <NA> <NA>    O <NA>

But instead I want to obtain

> my_data
   ID m n    o    p    q    r
1   A K U    E <NA>    K    U
2   B L V    Q    A    L <NA>
3   B L V <NA>    B <NA>    V
4   C M W    R    S    M    W
5   D N X    W    D    N    X
6   E O Y    T    F    O    Y
  • ID == B can not be merged, because of column p.
  • ID == A is quite special because of both NAs in column p.
  • IDs C, D, and E are relatively easy and can be solved by the code below.

Code

my_data1 <- data.frame(ID = LETTERS[1:5],
                       m = LETTERS[11:15],
                       n = LETTERS[21:25],
                       o = c(NA, 'Q', NA, 'W', NA),
                       p = c(NA, 'A', 'S', 'D', NA),
                       q = LETTERS[11:15])

my_data2 <- data.frame(ID = LETTERS[1:5],
                       m = LETTERS[11:15],
                       n = LETTERS[21:25],
                       o = c('E', NA, 'R', NA, 'T'),
                       p = c(NA, 'B', NA, NA, 'F'),
                       r = LETTERS[21:25])

my_data <- merge(my_data1, my_data2, all = TRUE)
# or
my_data <- bind_rows(my_data1, my_data2) |>
 arrange(ID)

# Solution (incomplete)

my_data <- my_data |>
 filter(ID != 'A', ID != 'B') |> # Error without this line
 select(-m, -n) |> # Error without this line
 group_by(ID) |>
 summarise(across(.cols = everything(),
                  .fns = ~ na.omit(.x)))

my_data |>
 view()
user1
  • 404
  • 1
  • 5
  • 18

2 Answers2

1

I think you need a bind_rows, then rearrange the columns to move non-NA to the top (for each group), and then filtering to keep rows with some non-NA in the last four columns.

library(dplyr)
bind_rows(my_data1, my_data2) %>%
  arrange(across(everything())) %>%
  group_by(ID) %>%
  mutate(across(everything(), ~ .[order(is.na(.))])) %>%
  ungroup() %>%
  filter(rowSums(!is.na(cur_data()[c("o","p","q","r")])) > 0)
# # A tibble: 6 × 7
#   ID    m     n     o     p     q     r    
#   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 A     K     U     E     <NA>  K     U    
# 2 B     L     V     Q     A     L     V    
# 3 B     L     V     <NA>  B     <NA>  <NA> 
# 4 C     M     W     R     S     M     W    
# 5 D     N     X     W     D     N     X    
# 6 E     O     Y     T     F     O     Y    
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Wow, this is a great idea! But: (1) the value `V` moves from row 3 to row 2. And (2) I do not want to group by columns `m` and `n`; all rows (`m` to `r`) should also be merged, if the values are identical – user1 Apr 05 '23 at 15:14
  • @user1, point 1: I understand, but there is no clear indication of _why_ we should implement that move for some columns and not for others; point 2: okay, take `m` and `n` out of the `group_by`. Fixed in a recent edit. – r2evans Apr 05 '23 at 15:51
  • @r2evans (1) "In a first step" the rows should be viewed as a whole. In the example there is a conflict for the two rows with ID == B. Thereby these rows should not be altered at all. (2) You now completely ignore columns m and n. What I wish for is a solution, that compares the values in each group and merges if they are identical or NA. Example: For ID == E it should be compared whether `O == O` and `Y == Y`. – user1 Apr 05 '23 at 16:01
0

At the end we probably don't want to end up with duplicated IDs. We could think about a tie-breaking rule or set to NA if disagreement, I use the latter.

We could use pmax with na.rm=TRUE to compare columns. If both are not NA and there's disagreement, result is NA.

(Note: I set my_data1[4, 2] to a different value to demonstrate)

f <- \(data1, data2, id='ID', verbose=TRUE) {
  stopifnot(identical(data1[[id]], data2[[id]]))
  cn <- intersect(names(data1), names(data2))
  m1 <- lapply(setNames(cn, cn), \(x, ...) {
    u <- data1[[x]]; v <- data2[[x]]
    pm <- pmax(u, v, na.rm=TRUE)
    w <- which(u != v)
    if (verbose && length(w) > 0) {
      message(sprintf('Disagreement row %s, column "%s", set to NA', w, x))
    }
    pm[w] <- NA
    pm
  }) |> as.data.frame()
  idn <- which(cn == id)
  m2 <- merge(data1[setdiff(names(data1), cn[-idn])], 
              data2[setdiff(names(data2), cn[-idn])])
  merge(m1, m2)
}

f(data1=my_data1, data2=my_data2, id='ID', verbose=TRUE)
# Disagreement row 4, column "m", set to NA
# Disagreement row 2, column "p", set to NA
#   ID    m n o    p q r
# 1  A    K U E <NA> K U
# 2  B    L V Q <NA> L V
# 3  C    M W R    S M W
# 4  D <NA> X W    D N X
# 5  E    O Y T    F O Y

Data:

my_data1 <- structure(list(ID = c("A", "B", "C", "D", "E"), m = c("K", "L", 
"M", "L", "O"), n = c("U", "V", "W", "X", "Y"), o = c(NA, "Q", 
NA, "W", NA), p = c(NA, "A", "S", "D", NA), q = c("K", "L", "M", 
"N", "O")), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))

my_data2 <- structure(list(ID = c("A", "B", "C", "D", "E"), m = c("K", "L", 
"M", "N", "O"), n = c("U", "V", "W", "X", "Y"), o = c("E", NA, 
"R", NA, "T"), p = c(NA, "B", NA, NA, "F"), r = c("U", "V", "W", 
"X", "Y")), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5"))
jay.sf
  • 60,139
  • 8
  • 53
  • 110