Keep the second occurrence in a column in R

Question

I have quite a simple dataset:

ID    Value     Time  
1    censored    1  
1    censored    2  
1   uncensored   3  
1   uncensored   4  
1    censored    5  
1    censored    6  
2    censored    1  
2   uncensored   2   
2   uncensored   3  
2   uncensored   4  
2    censored    5

I want to keep the first uncensored occurrence, and I want to keep the first censored occurrence after an uncensored one. for example:

ID   Value       Time
1    uncensored   3  
1    censored     5  
2    uncensored   2  
2    censored     5

Not everyone has their first censored date at time 5, that was just for an example.
Value is a binary variable: 1 for censored, and 0 for uncensored, but I've labelled them.

Thanks for the answers everyone, they were really helpful – Lb93 Jun 22 '15 at 09:27 — Lb93, Jun 22 '15 at 09:27

score 4 · Answer 1 · answered Jun 22 '15 at 09:13

4

You can do this with the standard split-apply-combine strategy:

do.call(rbind, lapply(split(d, d$ID), function(x) {
    u1 <- which(x$Value == "uncensored")[1]
    c1 <- which((x$Value == "censored") & seq_along(x$Value) > u1)[1]
    return(x[c(u1, c1),])
}))

Result:

     ID      Value Time
1.3   1 uncensored    3
1.5   1   censored    5
2.8   2 uncensored    2
2.11  2   censored    5

answered Jun 22 '15 at 09:13

Thomas

43,637
12
109
140

1

That's a nice one, you could probably speed it up with `data.table` using something like `setDT(d)[, list({u1 <- which(Value == "uncensored")[1];c1 <- which((Value == "censored") & seq_along(Value) > u1)[1];Value = c(u1, c1)}), by = ID]` – David Arenburg Jun 22 '15 at 09:24

David Arenburg · Accepted Answer · 2015-06-22T09:26:44.000

Here's another possible data.table solution

library(data.table)
setDT(df1)[, list(Value = c("uncensored", "censored"), 
                  Time =  c(Time[match("uncensored", Value)],
                          Time[(.N - match("uncensored", rev(Value))) + 2L])),
                  by = ID]
#    ID      Value Time
# 1:  1 uncensored    3
# 2:  1   censored    5
# 3:  2 uncensored    2
# 4:  2   censored    5

Or similarly, using which instead of match

setDT(df1)[, list(Value = c("uncensored", "censored"), 
                  Time =  c(Time[which(Value == "uncensored")[1L]],
                          Time[(.N - which(rev(Value) == "uncensored")[1L]) + 2L])),
                  by = ID]

akrun · Answer 3 · 2015-06-22T19:53:16.783

Try

library(data.table)
indx <- setDT(df1)[, gr:= rleid(Value), ID
][, c(.I[Value=='uncensored'][1L], .I[Value=='censored' & gr>1][1L]) , ID]$V1
df1[indx][,gr:=NULL]
#   ID      Value Time
#1:  1 uncensored    3
#2:  1   censored    5
#3:  2 uncensored    2
#4:  2   censored    5

Or using a similar idea as in @Thomas post

indx <-   setDT(df1)[, {
            i1 <-.I[Value=='uncensored'][1L]
            i2=.I[Value=='censored']
            list(c(i1,i2[i2>i1][1L]))   }, ID]$V1
df1[indx]
#    ID      Value Time
#1:  1 uncensored    3
#2:  1   censored    5
#3:  2 uncensored    2
#4:  2   censored    5

Or using dplyr

library(dplyr)
df1 %>%
   group_by(ID) %>%
   slice(which(Value=='uncensored')[1L]:n()) %>% 
   slice(match(c('uncensored', 'censored'), Value))
#    ID      Value Time
#1  1 uncensored    3
#2  1   censored    5
#3  2 uncensored    2
#4  2   censored    5

Steven Beaupré · Answer 4 · 2015-06-22T19:45:03.950

Since you mentioned Value is a binary variable, here's another idea using dplyr:

library(dplyr)
df %>% 
  group_by(ID) %>%
  ## convert the labels to binary
  ## 1 for censored, and 0 for uncensored 
  mutate(Value = ifelse(Value == "censored", 1, 0)) %>%
  ## filter first 'uncensored' value in each 'ID' group
  ## or the 'censored' values that have 'uncensored' as a predecessor   
  filter(Value == 0 & row_number(Value) == 1 | Value == 1 & lag(Value) == 0)

Which gives:

#Source: local data frame [4 x 3]
#Groups: ID
#
#  ID Value Time
#1  1     0    3
#2  1     1    5
#3  2     0    2
#4  2     1    5

score 0 · Answer 5 · answered Jun 22 '15 at 09:12

Try

result=c()
for(i in unique(df$ID)){
  subdf = df[which(df$ID) == i), ]
  idx = min(which(subdf$Value == 0))
  result = rbind(result, subdf[idx, ])
  idx = min(which(subdf$Value[-(1:idx)] == 1))
  result = rbind(result, subdf[idx, ])
}

assuming that the desired observations always exist.

score 0 · Answer 6 · answered Jun 22 '15 at 09:28

Following can be applied whenever you wish to identify rows that exhibit inertia w.r.t certain column, (even with categorical columns with multiple levels or numeric columns)

df <- read.table("clipboard")
a <- c(TRUE)
for (i in 1:(nrow(df)-1))
{
  a <- c(a,duplicated(df[i:(i+1),2])[2])
}
df[!a,]

Keep the second occurrence in a column in R

6 Answers6