1

I have two data frames with a unifying index v1. I need to create a third data frame with no NA values, if a number is available.

I have tried various combinations of the join functions from dplyr and the rbind.fill function in plyr.

# Given
v1 <- c("a", "b", "c", "d")
df1 <- cbind.data.frame(v1, v2 = c(1,NA,3,NA))
df2 <- cbind.data.frame(v1, v2 = c(NA,2,NA,4))

# I would like
df3 <- cbind.data.frame(v1, v2 = c(1,2,3,4))

How can I make this possible?

M--
  • 25,431
  • 8
  • 61
  • 93
Falnésio
  • 125
  • 1
  • 9
  • 1
    Or `setDT(df1)[is.na(v2), v2 := setDT(df2)[.SD, v2, on = .(v1)]]` as described here: [Replace missing values (NA) in one data set with values from another where columns match](https://stackoverflow.com/questions/32638845/replace-missing-values-na-in-one-data-set-with-values-from-another-where-colum) – Henrik May 13 '19 at 19:49

4 Answers4

3

An option would be to do a full_join on 'v1' and then coalesce the 'v2' columns

library(dplyr)
full_join(df1, df2, by = 'v1') %>%
    transmute(v1, v2 = coalesce(v2.x, v2.y))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    @Gainz ```setDT(df1)[df2, v2i := i.v2, on='v1'][is.na(v2), v2 := v2i][,v2i:= NULL][]``` I am pretty sure that I got this from akrun; I had this comment beside this line within one of my scripts saying credit goes to them. – M-- May 13 '19 at 19:36
2

Here's a simple base solution:

> df3 = df2
> df3$v2 = ifelse(is.na(df1$v2),df2$v2,df1$v2)
> df3
  v1 v2
1  a  1
2  b  2
3  c  3
4  d  4

the ifelse picks the number from one column or the other based on the NA nature of one column.

It doesn't cover the case if both or neither of the source v2 columns are NA but there's none in your sample and you don't mention it...

Spacedman
  • 92,590
  • 12
  • 140
  • 224
1

I am sure that I got this from @akrun but could not find the thread. Posting an answer for future reference:

library(data.table)
setDT(df1)[df2, v2i := i.v2, on='v1'][is.na(v2), v2 := v2i][,v2i:= NULL][]
#>    v1 v2
#> 1:  a  1
#> 2:  b  2
#> 3:  c  3
#> 4:  d  4

Created on 2019-05-13 by the reprex package (v0.2.1)



@Henrik's comment also suggested the following:

setDT(df1)[is.na(v2), v2 := setDT(df2)[.SD, v2, on = .(v1)]]

but I personally prefer the first solution over this mostly because of the sequence of getting the answer (joining dataframes first and replacing NA later).

M--
  • 25,431
  • 8
  • 61
  • 93
  • 1
    You find some arguments for the "update join" idiom [here](https://stackoverflow.com/a/54313203/1851712) – Henrik May 13 '19 at 20:39
0

We can use {powerjoin} :

library(powerjoin)
power_left_join(df1, df2, by = "v1", conflict = coalesce_xy)
#   v1 v2
# 1  a  1
# 2  b  2
# 3  c  3
# 4  d  4
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167