0

I have a dataset with this structure:

ID = c(1,1,1,1,2,2,2,3,3,3,3) 
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA) 
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1   1   1  NA
# 2   1  NA  NA
# 3   1  NA  NA
# 4   1  NA  NA
# 5   2   1  NA
# 6   2  NA   1
# 7   2  NA  NA
# 8   3  NA  NA
# 9   3   1  NA
# 10  3  NA  NA
# 11  3  NA   1

When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:

    library(tidyr)
    df2 <- df %>% group_by(ID) %>% fill(L40:K50)

This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:

    ID = c(1,1,1,1,2,2,2,3,3,3,3) 
    L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
    K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)  
    df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1   1   1  NA
# 2   1   1  NA
# 3   1   1  NA
# 4   1   1  NA
# 5   2   1  NA
# 6   2  NA   1
# 7   2  NA   1
# 8   3  NA  NA
# 9   3   1  NA
# 10  3   1  NA
# 11  3  NA   1
Alex
  • 1,207
  • 9
  • 25
udden2903
  • 783
  • 6
  • 15

2 Answers2

0

We can use na.locf

library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
#   ID L40 K50
#1:  1   1  NA
#2:  1   1  NA
#3:  1   1  NA
#4:  1   1  NA
#5:  2   1  NA
#6:  2  NA   1
#7:  3  NA   1
#8:  3  NA   1
#9:  3  NA   1

An option using dplyr would be

library(dplyr)
df %>% 
   mutate(ind = rowSums(is.na(.))) %>%
   group_by(ID)  %>%
   mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
   select(-ind)
#      ID   L40   K50
#   <dbl> <dbl> <dbl>
#1     1     1    NA
#2     1     1    NA
#3     1     1    NA
#4     1     1    NA 
#5     2     1    NA
#6     2    NA     1
#7     3    NA     1
#8     3    NA     1
#9     3    NA     1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • No, that produces the exact same result that I want to avoid. I don't want the value of L40 in the 5th row to be carried forward into the 6th row. – udden2903 Jul 28 '16 at 09:29
  • @udden2903 How is id 2 different from id 3? – akrun Jul 28 '16 at 09:37
  • ID 2 as in ID=2. Sorry about the confusion. – udden2903 Jul 28 '16 at 09:38
  • @udden2903 Added a possible solution. Please check – akrun Jul 28 '16 at 09:40
  • Thanks. It works, although I was hoping to find a way of doing it using tidyr or dplyr. – udden2903 Jul 28 '16 at 11:26
  • @udden2903 I posted an option with `dplyr` – akrun Jul 28 '16 at 12:45
  • I had another look at this and realized that your code does not solve my problem. I moved around some missing values in my original post which would show you the incorrect result. I suspect that the problem is related to the "any" statement in the "mutate_each" function. As I understand it, this means that the last non-missing value will be carried forward if any of the rows in the by-group have more than one missing value. This causes a problem, as I want these values to be carried forward only if there are no recorded values in any of the columns in the next row. – udden2903 Jul 28 '16 at 15:42
  • @udden2903 Can you update your post with a new example that mimics the problem? – akrun Jul 29 '16 at 00:37
  • That's what I have done. Try running your code on the first block of code in my post. – udden2903 Jul 29 '16 at 05:25
0

I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:

ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1) 
df = data.frame(ID, date, L40, K50)

Here is what I did:

#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)

#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)

#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)

setkey(Dt1, ID, date)
setkey(Dt2, ID, date)

#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present, 
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()

This carries forward the name(s) of the diagnosis until the next observed diagnosis.

udden2903
  • 783
  • 6
  • 15
  • I can execute `df %>% gather(diagnos, dummy, L40:K50)` but alas when I try execute `df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1)` I get the following error `Error in filter(., dummy == 1) : object 'dummy' not found In addition: Warning message: In data.matrix(data) : NAs introduced by coercion` Any idea what is going wrong? I do not use tidyr so am not au fait with syntax, or not au fait enough to troubleshoot. – Farrel Jan 16 '19 at 04:33