Aggregate dataframe by user, keeping rows for each user prior to first occurrence of treatment

Question

There are similar problems to mine elsewhere on this site, but none of the answers encompass everything I need to do.

I have a dataframe that I'm trying to change into time varying. Subjects in the study can change from non-treatment to treatment, but not the other way. Subjects have multiple rows of treatment information, and I want to find the first occurrence of treatment, which is simple enough. The snag is that not everyone has an occurrence of the treatment, and hence whenever I run my algorithm for finding the first occurrence these people get deleted. To make my question clearer:

ID    treatment    start.date    stop.date  
1        0         01/01/2002    01/02/2002  
1        0         01/02/2002    01/03/2002  
1        1         01/03/2002    01/04/2002  
1        0         01/04/2002    01/05/2002  
2        0         01/01/2002    01/02/2002  
2        0         01/02/2002    01/03/2002  
3        0         01/01/2002    01/02/2002  
3        1         01/02/2002    01/03/2002
3        0         01/03/2002    01/04/2002

As you can see, 2 never has the treatment. When I run the following algorithm, 2 is removed.

data$keep <- with(data, 
                     ave(treatment==1, ID ,FUN=function(x) if(1 %in% x) cumsum(x) else 2))
with(data, data[keep==0 | (treatment==1 & keep==1),])

Is there any way to extend this code so it keeps those who don't have a first occurrence and keeps every row up until the first occurrence for those who have it?

To summarise I want my data to look like this:

ID    treatment    start.date    stop.date    
1        0         01/01/2002    01/02/2002   
1        0         01/02/2002    01/03/2002    
1        1         01/03/2002    01/04/2002   
2        0         01/01/2002    01/02/2002    
2        0         01/02/2002    01/03/2002  
3        0         01/01/2002    01/02/2002  
3        1         01/02/2002    01/03/2002

Given you're doing aggregations, you really should learn to do split-apply-combine with either `dplyr` or `data.table`. Anything less runs out of steam very quickly and the code is almost write-only; very cryptic to reuse or understand. — smci, Jul 15 '15 at 07:55

akrun · Accepted Answer · 2015-07-15T08:37:48.503

We could do this in different ways. One option with data.table will be using an if/else condition on 'treatment' column grouped by the 'ID' column. We check if there are no values in the treatment is equal to '1', then return the Subset of Data.table (.SD) i.e. (if(!any(treatment==1)) .SD) or else i.e. if '1' values are in 'treatment' return the position index of the first value in treatment which is equal to 1 (which(treatment==1)[1L]), get the sequence (seq) and use that numeric index to subset the datatable. (.SD)

library(data.table)#v1.9.5+
setDT(data)[, if(!any(treatment==1)) .SD 
              else .SD[seq(which(treatment==1)[1L])], by = ID]
#     ID treatment start.date  stop.date
#1:  1         0 01/01/2002 01/02/2002
#2:  1         0 01/02/2002 01/03/2002
#3:  1         1 01/03/2002 01/04/2002
#4:  2         0 01/01/2002 01/02/2002
#5:  2         0 01/02/2002 01/03/2002
#6:  3         0 01/01/2002 01/02/2002
#7:  3         1 01/02/2002 01/03/2002

Or a slightly more compact method would be to rely on difference between current and previous values in 'treatment' and check whether the difference is greater than or equal to 0. We can use diff or -. In this case, I am getting the difference between the treatment and the lag of the treatment (shift by default gives 'lag' values. It is a new function in the devel version of data.table)

setDT(data)[, .SD[(treatment-shift(treatment, fill=0))>=0], by = ID]

Or a similar approach using dplyr. We group by 'ID' and then filter the rows based on the difference between the current and previous values in the 'treatment'.

library(dplyr)
data %>% 
    group_by(ID) %>% 
    filter(c(0, diff(treatment)) >=0) 
#  ID treatment start.date  stop.date
#1  1         0 01/01/2002 01/02/2002
#2  1         0 01/02/2002 01/03/2002
#3  1         1 01/03/2002 01/04/2002
#4  2         0 01/01/2002 01/02/2002
#5  2         0 01/02/2002 01/03/2002
#6  3         0 01/01/2002 01/02/2002
#7  3         1 01/02/2002 01/03/2002

Or with ave from base R

data[with(data, as.logical(ave(treatment, ID, 
                  FUN=function(x) c(0, diff(x))>=0))),]

@Lb93 .SD means `Subset of Datatable`. As it suggests, the it gets the subset of the dataset rows/columns based on the condition provided. — akrun, Jul 15 '15 at 07:59
@Lb93, checkout the *Introduction to data.table vignette* [here](https://github.com/Rdatatable/data.table/wiki/Getting-started).. it should take ~10 minutes to go through.. should get you up to speed. — Arun, Jul 15 '15 at 08:59

Aggregate dataframe by user, keeping rows for each user prior to first occurrence of treatment

1 Answers1