7

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.

I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).

I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?

In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).

dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
    x <- as.numeric(dat[i,2:6])
    y <- max(grep(1, x))
    finalyear[i] <- y
    streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak

Thanks!

Dan Lewer
  • 871
  • 5
  • 12
  • Lots of answers - anyone want to create a larger dataset and benchmark them? How big is the dataset so its possible to make a representative test set for benchmarking? – Spacedman Sep 04 '15 at 13:15
  • There are about 250,000 cases and 25 years. All of the answers below solve my problem - thanks everyone! If people are interested, I could make a representative dataset for testing different approaches. – Dan Lewer Sep 04 '15 at 16:24

4 Answers4

4

For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:

dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <-  apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})

Or option 2, based on comment by @Spacedman:

dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <-  apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})

> dat
  ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • Neat, but beware it depends on `finalyear` being added on directly after the true/false data, in this case in column 7. – Spacedman Sep 04 '15 at 11:48
  • Thanks. I doubted whether I should do it this way, or call max(which(x)) twice. Will edit. – Heroka Sep 04 '15 at 11:50
4

We could use max.col and rowSums as a vectorized approach.

dat$finalyear <- max.col(dat[-1], 'last')

If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.

dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])

Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'

dat$streak <-  rowSums(dat[,2:6])==dat$finalyear
dat
#   ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
#2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
#3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
#4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
#5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE

Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by @ColonelBeauvel

library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'), 
            streak=rowSums(dat[-1])==finalyear)
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Here is a solution with dplyr and tidyr.

gather(data = dat,year,value,-ids) %>%
  mutate(year=as.integer(gsub("X","",year))) %>%
  group_by(ids) %>%
  summarize(finalyear=last(year[value]),
            streak=!any(value[first(year):finalyear] == FALSE))

output

  ids finalyear streak
1   1      2001   TRUE
2   2      2003  FALSE
3   3      2003   TRUE
4   4      2003  FALSE
5   5      2002   TRUE
scoa
  • 19,359
  • 5
  • 65
  • 80
1

Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:

> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
> 
> dat
  ids X1999 X2000 X2001 X2002 X2003 streak
1   1  TRUE  TRUE  TRUE FALSE FALSE   TRUE
2   2  TRUE FALSE  TRUE  TRUE  TRUE  FALSE
3   3  TRUE  TRUE  TRUE  TRUE  TRUE   TRUE
4   4 FALSE FALSE  TRUE  TRUE  TRUE  FALSE
5   5  TRUE  TRUE  TRUE  TRUE FALSE   TRUE

There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:

> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
  ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1   1  TRUE  TRUE  TRUE FALSE FALSE   TRUE         3
2   2  TRUE FALSE  TRUE  TRUE  TRUE  FALSE         5
3   3  TRUE  TRUE  TRUE  TRUE  TRUE   TRUE         5
4   4 FALSE FALSE  TRUE  TRUE  TRUE  FALSE         5
5   5  TRUE  TRUE  TRUE  TRUE FALSE   TRUE         4
Spacedman
  • 92,590
  • 12
  • 140
  • 224