How to collapse a dataframe with duplicate IDs and varying missing values per ID, so that NAs are replaced by values from the duplicate ID? (in R)

Question

I have a large dataframe where each value in the ID column represents a person. I want to collapse the dataframe so each ID (person) populates fewer rows (fewer duplicated IDs) but I only want to collapse the IDs if values missing in other columns of a row with ID 4 (for example) are replaced with non-missing values form another row also with ID 4 --- all using R code

Example dataframe below.

ID <- c(1, 1, 2, 4, 4, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou')
gender <- c(NA, 'female', 'male', 'male', NA, 'female')

(df <- data.frame(id, name, gender))

  ID name gender
1  1 kate   <NA>
2  1 <NA> female
3  2  jim   male
4  4 <NA>   male
5  4  dan   <NA>
6  5  lou female

The result would be a dataframe that collapses the missing values by ID, so information from a duplicated ID informs what should exist in missing column values for that same ID in another row.

Desired result:

  ID name gender
1  1 kate female
3  2  jim   male
4  4  dan   male
6  5  lou female

The problem is that sometimes, we have a dataframe like:

ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender))

  ID      name gender
1  1      kate   <NA>
2  1      <NA> female
3  2       jim   male
4  4      <NA>   male
5  4       dan   <NA>
6  5       lou female
7  5 lou smith female
8  5      <NA> female

And I don't want to remove a duplicated ID row if it has conflicting information to its counterpart. In this case, I'd just want the result to be:

  ID      name gender
1  1      kate female
2  2       jim   male
4  4       dan   male
5  5       lou female
6  5 lou smith female

Try `library(dplyr);df1 %>% group_by(ID) %>% summarise_all(na.omit)` Do you have cases where there are more than one non-NA per group — akrun, Jan 10 '19 at 16:21
Yep I do. The actual dataframe is huge with many columns. Some rows have missing data in many columns and some duplicated IDs/people have conflicting info for the same variable, and I want to preserve those as independent columns for now; I'll probably just concatenate those observations later. — Samantha Karlaina Rhoads, Jan 10 '19 at 16:32
Try if this works `df2 %>% group_by(ID) %>% summarise_all(funs(list(.[complete.cases(.)]))) %>% unnest` — akrun, Jan 10 '19 at 16:32
The problem with that is all the nested columns don't have the same number of elements. I'll edit my post to clarify that too. — Samantha Karlaina Rhoads, Jan 10 '19 at 16:35
In that case, it is unclear how to compensate for the number of elements. May be keep it as a list and remove the `unnest` part — akrun, Jan 10 '19 at 16:37
As I said the rules are not clear for those case. May be `df2 %>% group_by(ID) %>% summarise_all(funs(list(.[complete.cases(.)]))) %>% unnest(gender, .drop = FALSE) %>% unnest %>% distinct` — akrun, Jan 10 '19 at 16:43
I actually might have it! `do.call(rbind, lapply(split(df2, df2$ID), zoo::na.locf)) %>% dplyr::distinct()` seems to potentially work! — Samantha Karlaina Rhoads, Jan 10 '19 at 16:44
Yours works too I believe! Akrun's slightly edited: `df2 %>% group_by(ID) %>% summarise_all(funs(list(.[complete.cases(.)]))) %>% tidyr::unnest(gender, .drop = FALSE) %>% tidyr::unnest() %>% dplyr::distinct()`. Awesome! Thank you! — Samantha Karlaina Rhoads, Jan 10 '19 at 16:46

score 2 · Answer 1 · answered Jan 10 '19 at 16:46

If we are replacing the NA with adjacent non-NA, and get the distinct rows, then with tidyverse, use fill

library(tidyverse)
df2 %>% 
   group_by(ID) %>% 
   fill(name, gender) %>% 
   fill(name, gender, .direction = 'up') %>%
   distinct
# A tibble: 5 x 3
# Groups:   ID [4]
#     ID name      gender
#  <int> <chr>     <chr> 
#1     1 kate      female
#2     2 jim       male  
#3     4 dan       male  
#4     5 lou       female
#5     5 lou smith female

data

df2 <- structure(list(ID = c(1L, 1L, 2L, 4L, 4L, 5L, 5L, 5L), name = c("kate", 
NA, "jim", NA, "dan", "lou", "lou smith", NA), gender = c(NA, 
"female", "male", "male", NA, "female", "female", "female")),
  class = "data.frame", row.names = c("1", 
 "2", "3", "4", "5", "6", "7", "8"))

Sweet! Yeah! It looks like it worked for me: `df2 %>% dplyr::group_by(ID) %>% tidyr::fill(name, gender) %>% tidyr::fill(name, gender, .direction = 'up') %>% dplyr::distinct()` — Samantha Karlaina Rhoads, Jan 10 '19 at 16:50

score 2 · Answer 2 · edited Jan 10 '19 at 17:01

library(dplyr)

ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender, stringsAsFactors = FALSE))


df2

df2 %>%  
  group_by(ID) %>% 
  mutate(name_max = max(name, na.rm = T), 
         gender_max = max(gender, na.rm = T)) %>% 
ungroup %>% 
mutate(name   = if_else(is.na(name), name_max, name), 
       gender = if_else(is.na(gender), gender_max, gender))   %>% 
  select(ID, name, gender) %>%  
  distinct %>%  
  head(10)

slightly edited:

df2 %>%  
  group_by(ID) %>% 
  mutate(name_max = max(as.character(name), na.rm = T), 
         gender_max = max(as.character(gender), na.rm = T)) %>% 
  ungroup %>% 
  mutate(name   = if_else(is.na(name), name_max, as.character(name)), 
         gender = if_else(is.na(gender), gender_max, 
as.character(gender)))   %>% 
  select(ID, name, gender) %>%  
  distinct()

I just have to make sure the `gender` and `name` vars aren't factors and it seems like we're good! `df2 %>% group_by(ID) %>% mutate(name_max = max(as.character(name), na.rm = T), gender_max = max(as.character(gender), na.rm = T)) %>% ungroup %>% mutate(name = if_else(is.na(name), name_max, as.character(name)), gender = if_else(is.na(gender), gender_max, as.character(gender))) %>% select(ID, name, gender) %>% distinct()` — Samantha Karlaina Rhoads, Jan 10 '19 at 16:55

How to collapse a dataframe with duplicate IDs and varying missing values per ID, so that NAs are replaced by values from the duplicate ID? (in R)

2 Answers2

data