2

I'm currently working on a longitudinal data base in R. Therefore, I have a lot of missing values, because the values of the variables which have been unchanged since the last interview are not added in the new database. For example in the first wave the sex is defined as boy or girl and it doesn't change between the first wave and the second wave, so they are not giving the sex in the second wave again.

Basically, what I would like to do is to merge the data I have selected for the second wave and merge it with the data from the first wave, in order to eliminate some NAs. However, I would like to only keep the columns I have selected from the second wave. For the moment, and after looking on the internet, I was only able to merge the two datasets but I'm not able to only keep the data from the second wave.

Here is my code:

library("rqdatatable")

x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)


y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)

final <- natural_join(x, y, 
                   by = "ID",
                   jointype = "LEFT")

What I would like to get after my merge is:

    z = data.frame(
  ID = c(1,2,3,4),
  S1 = c(1, 3, 5,0),
  S2 = c(2, 0, 2,2)
)

Do you have any idea of how I can solve my problem? It would be very time consuming to merge everything and to select the variables I want again.

Many thanks and best regards!

Jpaete
  • 59
  • 4

2 Answers2

3

We could use inner_join in combination with coalesce

library(dplyr)

x %>% 
  inner_join(y, by="ID") %>% 
  mutate(S1 = coalesce(S1.x, S1.y),
         S2 = coalesce(S2.x, S2.y)) %>% 
  select(ID, S1, S2)
  
  ID S1 S2
1  1  1  2
2  2  3  0
3  3  5  2
4  4  0  2
TarJae
  • 72,363
  • 6
  • 19
  • 66
3

Here is a base r function that joins the data like in the question. It can also be call via a pipe, in this case R's pipe operator introduced in R 4.1.

x <- data.frame(
  ID = c(1,2,3,4),
  S1 = c(1, 3, NA,0),
  S2 = c(2, NA, 2,2)
)

y <- data.frame(
  ID = c(1, 2, 3, 4,5,6,7,8),
  S1 = c(1, 2, 5, 1,3,6,8,2),
  S3 = c(3, 3, 3, 3,7,1,6,9),
  S2 = c(0,0,0,0,0,0,0,0),
  S4 = c(0,0,0,0,0,0,0,0)
)


joinSpecial <- function(x, y, idcol = "ID"){
  idcolx <- which(names(x) == idcol)
  idcoly <- which(names(y) == idcol)
  idx <- which(names(x) %in% names(y))
  idy <- which(names(y) %in% names(x))
  idx <- idx[idx != idcolx]
  idy <- idy[idy != idcoly]
  i <- match(x[[idcolx]], y[[idcoly]])
  x[idx] <- mapply(\(a, b, i){
    na <- is.na(a)
    a[na] <- b[i][na]
    a
  }, x[idx], y[idy], MoreArgs = list(i = i), SIMPLIFY = FALSE)
  x
}

joinSpecial(x, y)
#>   ID S1 S2
#> 1  1  1  2
#> 2  2  3  0
#> 3  3  5  2
#> 4  4  0  2

x |> joinSpecial(y)
#>   ID S1 S2
#> 1  1  1  2
#> 2  2  3  0
#> 3  3  5  2
#> 4  4  0  2

Created on 2022-03-18 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    Going to `s`tep through this again with `debugonce` (or thrice), but certainly a word or two about the `mapply(\(a,b,i){` truth table construction, whose a & b seem mysterious while effective, and this is the first time I've seen any inner join characterized in `base`, but perhaps I haven't looked carefully. – Chris Mar 19 '22 at 04:15
  • Actually, I think Chris makes a good point. When I try to run the code, I have an error message : Error : '}' unexpected in "}". Furthermore, I have two error signs at the same line as x[idx] = mapply(\(a,b,i){ : "Unexpected token '\' " and "unexpected token ',' " . Do you know how I can solve this? – Jpaete Mar 19 '22 at 07:47
  • 1
    @Jpaete Those errors are probably because your version of R is pre R 4.1 when the new anonymous functions `\(...)` were introduced. try `mapply(function(a, b, i) etc)`. – Rui Barradas Mar 19 '22 at 08:08