2

There are similar problems to mine elsewhere on this site, but none of the answers encompass everything I need to do.

I have a dataframe that I'm trying to change into time varying. Subjects in the study can change from non-treatment to treatment, but not the other way. Subjects have multiple rows of treatment information, and I want to find the first occurrence of treatment, which is simple enough. The snag is that not everyone has an occurrence of the treatment, and hence whenever I run my algorithm for finding the first occurrence these people get deleted. To make my question clearer:

ID    treatment    start.date    stop.date  
1        0         01/01/2002    01/02/2002  
1        0         01/02/2002    01/03/2002  
1        1         01/03/2002    01/04/2002  
1        0         01/04/2002    01/05/2002  
2        0         01/01/2002    01/02/2002  
2        0         01/02/2002    01/03/2002  
3        0         01/01/2002    01/02/2002  
3        1         01/02/2002    01/03/2002
3        0         01/03/2002    01/04/2002  

As you can see, 2 never has the treatment. When I run the following algorithm, 2 is removed.

data$keep <- with(data, 
                     ave(treatment==1, ID ,FUN=function(x) if(1 %in% x) cumsum(x) else 2))
with(data, data[keep==0 | (treatment==1 & keep==1),]) 

Is there any way to extend this code so it keeps those who don't have a first occurrence and keeps every row up until the first occurrence for those who have it?

To summarise I want my data to look like this:

ID    treatment    start.date    stop.date    
1        0         01/01/2002    01/02/2002   
1        0         01/02/2002    01/03/2002    
1        1         01/03/2002    01/04/2002   
2        0         01/01/2002    01/02/2002    
2        0         01/02/2002    01/03/2002  
3        0         01/01/2002    01/02/2002  
3        1         01/02/2002    01/03/2002
smci
  • 32,567
  • 20
  • 113
  • 146
Lb93
  • 191
  • 1
  • 12
  • Given you're doing aggregations, you really should learn to do split-apply-combine with either `dplyr` or `data.table`. Anything less runs out of steam very quickly and the code is almost write-only; very cryptic to reuse or understand. – smci Jul 15 '15 at 07:55

1 Answers1

3

We could do this in different ways. One option with data.table will be using an if/else condition on 'treatment' column grouped by the 'ID' column. We check if there are no values in the treatment is equal to '1', then return the Subset of Data.table (.SD) i.e. (if(!any(treatment==1)) .SD) or else i.e. if '1' values are in 'treatment' return the position index of the first value in treatment which is equal to 1 (which(treatment==1)[1L]), get the sequence (seq) and use that numeric index to subset the datatable. (.SD)

library(data.table)#v1.9.5+
setDT(data)[, if(!any(treatment==1)) .SD 
              else .SD[seq(which(treatment==1)[1L])], by = ID]
#     ID treatment start.date  stop.date
#1:  1         0 01/01/2002 01/02/2002
#2:  1         0 01/02/2002 01/03/2002
#3:  1         1 01/03/2002 01/04/2002
#4:  2         0 01/01/2002 01/02/2002
#5:  2         0 01/02/2002 01/03/2002
#6:  3         0 01/01/2002 01/02/2002
#7:  3         1 01/02/2002 01/03/2002

Or a slightly more compact method would be to rely on difference between current and previous values in 'treatment' and check whether the difference is greater than or equal to 0. We can use diff or -. In this case, I am getting the difference between the treatment and the lag of the treatment (shift by default gives 'lag' values. It is a new function in the devel version of data.table)

setDT(data)[, .SD[(treatment-shift(treatment, fill=0))>=0], by = ID]

Or a similar approach using dplyr. We group by 'ID' and then filter the rows based on the difference between the current and previous values in the 'treatment'.

library(dplyr)
data %>% 
    group_by(ID) %>% 
    filter(c(0, diff(treatment)) >=0) 
#  ID treatment start.date  stop.date
#1  1         0 01/01/2002 01/02/2002
#2  1         0 01/02/2002 01/03/2002
#3  1         1 01/03/2002 01/04/2002
#4  2         0 01/01/2002 01/02/2002
#5  2         0 01/02/2002 01/03/2002
#6  3         0 01/01/2002 01/02/2002
#7  3         1 01/02/2002 01/03/2002

Or with ave from base R

data[with(data, as.logical(ave(treatment, ID, 
                  FUN=function(x) c(0, diff(x))>=0))),]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • out of curiosity, what does the `.SD` do? – Lb93 Jul 15 '15 at 07:57
  • 1
    @Lb93 .SD means `Subset of Datatable`. As it suggests, the it gets the subset of the dataset rows/columns based on the condition provided. – akrun Jul 15 '15 at 07:59
  • 2
    @Lb93, checkout the *Introduction to data.table vignette* [here](https://github.com/Rdatatable/data.table/wiki/Getting-started).. it should take ~10 minutes to go through.. should get you up to speed. – Arun Jul 15 '15 at 08:59