I have a large dataframe where each value in the ID column represents a person. I want to collapse the dataframe so each ID (person) populates fewer rows (fewer duplicated IDs) but I only want to collapse the IDs if values missing in other columns of a row with ID 4 (for example) are replaced with non-missing values form another row also with ID 4 --- all using R
code
Example dataframe below.
ID <- c(1, 1, 2, 4, 4, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou')
gender <- c(NA, 'female', 'male', 'male', NA, 'female')
(df <- data.frame(id, name, gender))
ID name gender
1 1 kate <NA>
2 1 <NA> female
3 2 jim male
4 4 <NA> male
5 4 dan <NA>
6 5 lou female
The result would be a dataframe that collapses the missing values by ID, so information from a duplicated ID informs what should exist in missing column values for that same ID in another row.
Desired result:
ID name gender
1 1 kate female
3 2 jim male
4 4 dan male
6 5 lou female
The problem is that sometimes, we have a dataframe like:
ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender))
ID name gender
1 1 kate <NA>
2 1 <NA> female
3 2 jim male
4 4 <NA> male
5 4 dan <NA>
6 5 lou female
7 5 lou smith female
8 5 <NA> female
And I don't want to remove a duplicated ID row if it has conflicting information to its counterpart. In this case, I'd just want the result to be:
ID name gender
1 1 kate female
2 2 jim male
4 4 dan male
5 5 lou female
6 5 lou smith female