Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R

Question

I have an unbalanced panel dataset in R. The following will serve as an example:

dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), 
                 year=c(2001:2003,2000,2002,2000:2001,2003))

> dt
  name year
1    A 2001
2    A 2002
3    A 2003
4    B 2000
5    B 2002
6    C 2000
7    C 2001
8    C 2003

Now, I need to have at least 2 consecutive year observations for each name. Hence, I would like to remove row 4, 5, and 8. How do I best do that in R?

EDIT: Thanks to the comment below, I can make a bit clearer. If I had an extra observation (row 9) with name=C and year=2004, I would want to keep both row 8 and 9 along with the others.

What if row 9 contained `C 2004` ? Would you then want to keep it and row 8? — Josh O'Brien, Mar 02 '14 at 20:18
What if for some name you had this: `2000,2002,2003,2005,2007,2008`? Should you keep `2002,2003,2007,2008`? If so, that would result in non-consecutive years. Is that correct? — Julián Urbano, Mar 02 '14 at 20:18
@JuliánUrbano Yes, I would want to keep 2002,2003,2007,2008. Sorry that my question weren't clear, but I don't know how to say it better. — Mace, Mar 02 '14 at 20:27
I have tried to come up with a way to do this that combines `plyr::ddply` (to operate chunk-wise on the `name` values) and `rle()` (to select consecutive rows only), but haven't got anything working yet. — Ben Bolker, Mar 02 '14 at 21:08

score 4 · Accepted Answer · edited Mar 03 '14 at 07:52

4

My (hackish) way to do it would be:

is.consecutive = duplicated(rbind(dt,transform(dt, year=year+1), 
                                     transform(dt, year=year-1)),
                            fromLast=TRUE)[1:nrow(dt)]

is.consecutive contains a vector of booleans of the observations to be retained. For your example, this vector would be: TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE

Finally, you can easily use this vector to subset your data.frame, e.g. with:

dt[is.consecutive,]

edited Mar 03 '14 at 07:52

Thomas

43,637
12
109
140

answered Mar 02 '14 at 20:42

Jealie

6,157
2
33
36

Thanks for the answer! It requires the sorted name, year, right? – Mace Mar 02 '14 at 20:53
Not at all: both variables can be randomized along the rows :) – Jealie Mar 02 '14 at 21:04
it seems like the solution here use no information about `name`, this may cause issue when you have data like ( id year 1 A 2000 2 A 2002 3 B 2003), the 2nd and 3rd would be `TRUE` from the solution above, but they are not from the same unit at the first place. any thought on how to solve this? – Jia Gao Oct 09 '17 at 11:42
@JasonGoal try your example, it works fine... The information about name is kept through the time-shift transforms of: `transform(dt, year=year+1)` and `transform(dt, year=year-1)`. – Jealie Oct 11 '17 at 18:07

Henrik · Answer 2 · 2014-03-03T11:45:12.553

Here's a more (far too...?) convoluted alternative, where you can set the minimum length of runs of consecutive observations.

dt <- dt[order(dt$name, dt$year), ]

rl <- 2

do.call(rbind,
        by(dt, dt$name, function(x){
          run <- c(0, cumsum(diff(x$year) > 1))
          x[ave(run, run, FUN = length) >= rl, ]
        })
)
#     name year
# A.1    A 2001
# A.2    A 2002
# A.3    A 2003
# C.6    C 2000
# C.7    C 2001

rl <- 3

do.call(rbind,
        by(dt, dt$name, function(x){
          run <- c(0, cumsum(diff(x$year) > 1))
          x[ave(run, run, FUN = length) >= rl, ]
        })
)
#     name year
# A.1    A 2001
# A.2    A 2002
# A.3    A 2003

score 3 · Answer 3 · answered Mar 03 '14 at 02:06

3

Here a solution using ddply

library(plyr)
ddply(dt,"name",function(x) {
    cons_idx=which(diff(x$year)==1)
    cons_idx=sort(unique(c(cons_idx,cons_idx+1)))
    x[cons_idx,]
})

answered Mar 03 '14 at 02:06

cryo111

4,444
1
15
37

Subsetting a unbalanced panel dataset to have at least 2 consecutive observations in R

3 Answers3

Linked