Last observation carried forward conditional on multiple columns

Question

I have a dataset with this structure:

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA) 
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1   1   1  NA
# 2   1  NA  NA
# 3   1  NA  NA
# 4   1  NA  NA
# 5   2   1  NA
# 6   2  NA   1
# 7   2  NA  NA
# 8   3  NA  NA
# 9   3   1  NA
# 10  3  NA  NA
# 11  3  NA   1

When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:

    library(tidyr)
    df2 <- df %>% group_by(ID) %>% fill(L40:K50)

This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:

    ID = c(1,1,1,1,2,2,2,3,3,3,3) 
    L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
    K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)  
    df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1   1   1  NA
# 2   1   1  NA
# 3   1   1  NA
# 4   1   1  NA
# 5   2   1  NA
# 6   2  NA   1
# 7   2  NA   1
# 8   3  NA  NA
# 9   3   1  NA
# 10  3   1  NA
# 11  3  NA   1

akrun · Answer 1 · 2016-07-28T12:42:50.850

0

We can use na.locf

library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
#   ID L40 K50
#1:  1   1  NA
#2:  1   1  NA
#3:  1   1  NA
#4:  1   1  NA
#5:  2   1  NA
#6:  2  NA   1
#7:  3  NA   1
#8:  3  NA   1
#9:  3  NA   1

An option using dplyr would be

library(dplyr)
df %>% 
   mutate(ind = rowSums(is.na(.))) %>%
   group_by(ID)  %>%
   mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
   select(-ind)
#      ID   L40   K50
#   <dbl> <dbl> <dbl>
#1     1     1    NA
#2     1     1    NA
#3     1     1    NA
#4     1     1    NA 
#5     2     1    NA
#6     2    NA     1
#7     3    NA     1
#8     3    NA     1
#9     3    NA     1

edited Jul 28 '16 at 12:42

answered Jul 28 '16 at 09:28

akrun

874,273
37
540
662

No, that produces the exact same result that I want to avoid. I don't want the value of L40 in the 5th row to be carried forward into the 6th row. – udden2903 Jul 28 '16 at 09:29
@udden2903 How is id 2 different from id 3? – akrun Jul 28 '16 at 09:37
ID 2 as in ID=2. Sorry about the confusion. – udden2903 Jul 28 '16 at 09:38
@udden2903 Added a possible solution. Please check – akrun Jul 28 '16 at 09:40
Thanks. It works, although I was hoping to find a way of doing it using tidyr or dplyr. – udden2903 Jul 28 '16 at 11:26
@udden2903 I posted an option with `dplyr` – akrun Jul 28 '16 at 12:45
I had another look at this and realized that your code does not solve my problem. I moved around some missing values in my original post which would show you the incorrect result. I suspect that the problem is related to the "any" statement in the "mutate_each" function. As I understand it, this means that the last non-missing value will be carried forward if any of the rows in the by-group have more than one missing value. This causes a problem, as I want these values to be carried forward only if there are no recorded values in any of the columns in the next row. – udden2903 Jul 28 '16 at 15:42
@udden2903 Can you update your post with a new example that mimics the problem? – akrun Jul 29 '16 at 00:37
That's what I have done. Try running your code on the first block of code in my post. – udden2903 Jul 29 '16 at 05:25

score 0 · Answer 2 · answered Jul 29 '16 at 14:21

I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:

ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, date, L40, K50)

Here is what I did:

#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)

#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)

#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)

setkey(Dt1, ID, date)
setkey(Dt2, ID, date)

#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present, 
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()

This carries forward the name(s) of the diagnosis until the next observed diagnosis.

I can execute `df %>% gather(diagnos, dummy, L40:K50)` but alas when I try execute `df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1)` I get the following error `Error in filter(., dummy == 1) : object 'dummy' not found In addition: Warning message: In data.matrix(data) : NAs introduced by coercion` Any idea what is going wrong? I do not use tidyr so am not au fait with syntax, or not au fait enough to troubleshoot. — Farrel, Jan 16 '19 at 04:33

Last observation carried forward conditional on multiple columns

2 Answers2