1

I'm a several year lurker, but I've finally found something I couldn't figure out with just old posts. I have a data frame consisting of hundreds of countries, years, and an event variable with a binary indicator:

library('dplyr')
library('data.table')

country<-c("albania","albania","albania","albania","albania","albania","albania","albania","thailand","thailand","thailand","thailand","thailand","thailand","thailand","thailand")
year<-c(1960,1961,1962,1963,1964,1965,1966,1967,1972,1973,1974,1975,1976,1977,1978,1979)
event<-c(0,1,1,0,0,1,1,1,1,1,0,0,1,0,0,0)
input<-data.frame(country=country, year=year, event=event)

input    

    country year event
1   albania 1960     0
2   albania 1961     1
3   albania 1962     1
4   albania 1963     0
5   albania 1964     0
6   albania 1965     1
7   albania 1966     1
8   albania 1967     1
9  thailand 1972     1
10 thailand 1973     1
11 thailand 1974     0
12 thailand 1975     0
13 thailand 1976     1
14 thailand 1977     0
15 thailand 1978     0
16 thailand 1979     0

I would like to create a new data frame that displays multiple consecutive events for each country with their duration and starting year. For example:

output

   country start duration
1  albania 1961        2
2  albania 1965        3
3 thailand 1972        2
4 thailand 1976        1

I've read over, what I believe to be, most of the relevant posts about counting consecutive events by group with dplyr and data.table using rle() and rleid(), but I can't get them to where I want to be.

Following this example, I can't get a new data frame with more than one event length by country; not just the max, min, etc. and that ignores my need to grab the starting year of the event. Trying to build off this code to get to my desired state left me with lots of errors. The "base code" for the dplyr examples seems to be some starting point of:

output <- input %>%
group_by(country) %>%
do({
tmp <- with(rle(.$event == 1), lengths[values])
data.frame(country= .$country, Max = if (length(tmp) == 0) 0 else max(tmp))
 }) %>%
 slice(1L)

That obviously pulls the max, I struggled trying to alter it to pull each event.

Following the data.table / rleid models creates a newly mutated variable counting the duration of consecutive "events" but I was having trouble extracting the "end" years for multiple events within a country. Maybe some lag differencing function using the mutated variable and then extracting all rows with a negative value? Once the row for the end event is flagged the starting year will just be the current year - length. The base code for this approach is:

sum0 <- function(x) { x[x == 1] = sequence(with(rle(x), lengths[values == 1])); x }
setDT(input)[, duration := sum0(event), by = country]

input

     country year event duration
 1:  albania 1960     0        0
 2:  albania 1961     1        1
 3:  albania 1962     1        2 
 4:  albania 1963     0        0
 5:  albania 1964     0        0
 6:  albania 1965     1        1
 7:  albania 1966     1        2
 8:  albania 1967     1        3
 9: thailand 1972     1        1
10: thailand 1973     1        2
11: thailand 1974     0        0
12: thailand 1975     0        0
13: thailand 1976     1        1
14: thailand 1977     0        0
15: thailand 1978     0        0
16: thailand 1979     0        0

There are another 7-10 posts I looked through but didn't link as they were similar in nature to the two I referenced. I want to thank anyone in advance who has any advice. I hope I followed all protocol for asking a question; I tried to be careful and follow the rules. Thanks for all the great work you all do! You've gotten me through 5-6 years of learning R and JAGS.

Josh Brinks
  • 141
  • 6

2 Answers2

3

Here's what I would do (leaving dplyr out of it):

setDT(input)

input[, 
  if (first(event) == 1) .(year = first(year), N = .N)
, by=.(country, g = rleid(country, event))][, !"g"]

    country year N
1:  albania 1961 2
2:  albania 1965 3
3: thailand 1972 2
4: thailand 1976 1

Not very efficient, but hopefully easy enough to follow.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Better than my solution. – mt1022 Aug 22 '17 at 14:37
  • @mt1022 I disagree. Yours is more efficient and idiomatic, I think. – Frank Aug 22 '17 at 14:37
  • Looks right, I had to step out for a while. I'll check when I get back. Thanks for your help; I just started learning data.table so it's a bit foreign to me. – Josh Brinks Aug 22 '17 at 14:37
  • What's the differences in efficiency? I didn't get that point. – mt1022 Aug 22 '17 at 14:39
  • 1
    @mt1022 If you add `verbose = TRUE]` inside your first query, you'll see that it is optimized; more info in `?GForce`. Generally, this happens when `j` is a list of simple expressions. The `if (cond) .(x,y,z)` I'm using here cannot benefit from those and is a bit more quirky than necessary for the OP's task, I think. If there's ever a `having=` argument, then my way can translate into something better, though https://github.com/Rdatatable/data.table/issues/788 – Frank Aug 22 '17 at 14:40
  • 1
    I just checked it. You are right. For my solution GForce optimization is on and off for yours. – mt1022 Aug 22 '17 at 14:44
  • 1
    @mt1022 I think it's worth mentioning that benefit in your answer, but will leave it to you to judge/write. – Frank Aug 22 '17 at 14:49
  • Thank you this did what I wanted. Just have to parse through some of these data.table operators and figure out exactly how it works. – Josh Brinks Aug 22 '17 at 19:37
  • @JoshBrinks Ok, glad it works! Fyi, as mt1022 and I were discussing in comments above, his way has some advantages over this one. – Frank Aug 22 '17 at 19:41
2

Is this what you want:

library(data.table)

setDT(input)
input[, .(event = event[1], start = year[1], duration = .N),
      by = .(country, rleidv(event))][event == 1][
          , c('event', 'rleidv') := NULL][]

#     country start duration
# 1:  albania  1961        2
# 2:  albania  1965        3
# 3: thailand  1972        2
# 4: thailand  1976        1

As noted by Frank in comment, this solution is optimized by data.table in computation, which makes it more efficient. The if(cond) ... in j expression won't be optimized.

mt1022
  • 16,834
  • 5
  • 48
  • 71
  • As I said above: Looks right, I had to step out for a while. I'll check when I get back. Thanks for your help; I just started learning data.table so it's a bit foreign to me. – Josh Brinks Aug 22 '17 at 14:38