0

I'm joining the following two tibbles using full_join:

library(dplyr)
library(tibble)

tibble(id=c(1:2, NA), b = c("mouse", "cat", "fish"), c = 6:8) %>% 
full_join(tibble(id=1:3, b = c("mouse", "", "fish"), c = 6:8))

This will give me:

A tibble: 5 x 3
 id     b     c
 1 mouse     6
 2   cat     7
NA  fish     8
 2           7
 3  fish     8

Ideally though, I would like to get something like that:

A tibble: 5 x 3
 id     b     c
 1 mouse     6
 2   cat     7
 3  fish     8

Where all NAs or missing values are being replaced by a more "superior" row that has more information. How can I do this?

www
  • 38,575
  • 12
  • 48
  • 84
Nina
  • 91
  • 7

3 Answers3

1

Regarding your specific example, I think your question is how to combine two tibbles while each of them has different incomplete observations (rows). Before conducting full_join, you can clear your data a little bit by using filter_all.

filter_all(all_vars( (!is.na(.)) & (.!="") ) )

So it means to select row that meet two conditions (for all variables): is not NA and "".

tb1 <- tibble(id=c(1:2, NA), b = c("mouse", "cat", "fish"), c = 6:8) 
tb1 <- tb1%>% filter_all(all_vars((!is.na(.))&(.!="")))
> tb1
# A tibble: 3 x 3
   id    b         c
  <int> <chr> <int>
1     1 mouse     6
2     2 cat       7

Do the same thing for the anther set and then do the full_join.

tb2 <- tibble(id=1:3, b = c("mouse", "", "fish"), c = 6:8)
tb2 <- tb2 %>% filter_all(all_vars((!is.na(.))&(.!="")))
full_join(tb1,tb2)
Joining, by = c("id", "b", "c")
# A tibble: 3 x 3
     id b         c
  <int> <chr> <int>
1     1 mouse     6
2     2 cat       7
3     3 fish      8
Tony416
  • 596
  • 6
  • 11
0

Assuming the two tibbles are called a and b, we can first replace any character considered as missing values to be NA (in this case, "" is considered to be NA). b[b == ""] <- NA is a quick way to do this.

After that, we can do full_join by the column you know is complete (in this case, column c). Finally, we can use the coalesce function to replace NA with two associated columns. dat is the final output.

library(dplyr)
library(tibble)

# Create example data frame
a <- tibble(id=c(1:2, NA), b = c("mouse", "cat", "fish"), c = 6:8)
b <- tibble(id=1:3, b = c("mouse", "", "fish"), c = 6:8)

# Replace "" with NA
b[b == ""] <- NA

# Conduct full_join by c
dat <- full_join(a, b, by = "c") %>%
  # Use coalesce to merge column and replace NA
  mutate(id = coalesce(.$id.x, .$id.y),
         b = coalesce(.$b.x, .$b.y)) %>%
  # Select columns
  select(id, b, c)

# View the results
dat
# # A tibble: 3 x 3
#      id b         c
#   <int> <chr> <int>
# 1     1 mouse     6
# 2     2 cat       7
# 3     3 fish      8
www
  • 38,575
  • 12
  • 48
  • 84
0

We could use safe_left_join from my package safejoin, and use dplyr::coalesce to handle the column conflicts.

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)

a <- tibble(id=c(1:2, NA), b = c("mouse", "cat", "fish"), c = 6:8)
b <- tibble(id=1:3, b = c("mouse", "", "fish"), c = 6:8)

solution:

b[b==""] <- NA
safe_left_join(a, b, by = "c", conflict = coalesce)
# # A tibble: 3 x 3
#      id b         c
#   <int> <chr> <int>
# 1     1 mouse     6
# 2     2 cat       7
# 3     3 fish      8
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167