5

I am working on panel data that looks like this:

d <- data.frame(id = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c"),
                time = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
                iz = c(0,1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1))
   id time iz
1   a    1  0
2   a    2  1
3   a    3  1
4   a    4  0
5   a    5  0
6   b    1  0
7   b    2  0
8   b    3  0
9   b    4  0
10  b    5  1
11  c    1  0
12  c    2  0
13  c    3  0
14  c    4  1
15  c    5  1

Here iz is an indicator for an event or a treatment (iz = 1). What I need is a variable that counts the periods before and after an event or the distance to and from an event. This variable would look like this:

  id time iz nvar
1   a    1  0   -1
2   a    2  1    0
3   a    3  1    0
4   a    4  0    1
5   a    5  0    2
6   b    1  0   -4
7   b    2  0   -3
8   b    3  0   -2
9   b    4  0   -1
10  b    5  1    0
11  c    1  0   -1
12  c    2  0   -2
13  c    3  0   -3
14  c    4  1    0
15  c    5  1    0

I have tried working with the answers given here and here but can't make it work in my case.

I would really appreciate any ideas how to approach this problem. Thank you in advance for all ideas and suggestions.

Niklas
  • 53
  • 4
  • 1
    Would there be only one event/treatment for each `id`? If not, and there could be multiple events, how would you want to handle `nvar` in between events? – Ben Jan 29 '21 at 13:28
  • Sorry for not clarifying. Ideally, such observations would be counted as "post" observations. I tried bot examples below and Grothediecks answer does just that while Wimpels answer counts them as "pre". – Niklas Jan 29 '21 at 14:06

3 Answers3

4

1) rleid This code applies rleid from data.table to each id and then generates a negative reverse sequence if that produces a run of 1's and a forward sequence otherwise, i.e. we assume that a forward positive sequence should be used except before the first run of ones. For the 1's in iz zero that out. There can be any number of runs in an id and it also supports id's with only 0's or only 1's. It assumes that time has no gaps.

library(data.table)

Seq <- function(x, s = seq_along(x)) if (x[1] == 1) -rev(s) else s
nvar <- function(iz, r = rleid(iz)) ave((1-iz) * r, r, FUN = Seq)
transform(d, nvar = (1-iz) * ave(iz, id, FUN = nvar))

giving:

   id time iz nvar
1   a    1  0   -1
2   a    2  1    0
3   a    3  1    0
4   a    4  0    1
5   a    5  0    2
6   b    1  0   -4
7   b    2  0   -3
8   b    3  0   -2
9   b    4  0   -1
10  b    5  1    0
11  c    1  0   -3
12  c    2  0   -2
13  c    3  0   -1
14  c    4  1    0
15  c    5  1    0

2) base This code uses only base R. It assumes that every id has at most one run of ones. There is no restriction on whether there are any zeros. Also it supports gaps in time. It applies nvar to the row numbers of each id. First it calculates the range rng of the times of the ones and then calculates the signed distance in the last line of nvar. The output is identical to that shown in (1). If we could assume that every id has exactly one run of 1's the if statement could be omitted.

nvar <- function(ix) with(d[ix, ], {
  if (all(iz == 0)) return(iz)
  rng <- range(time[iz == 1])
  (time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
})
transform(d, nvar = ave(1:nrow(d), id, FUN = nvar))

2a) This variation of (2) passes time and iz to nvar by encoding them as the real and imaginary parts of a complex vector in order to avoid having to deal with row numbers but it is otherwise the same as (2). We have omitted the if statement in (2) but it could be added back in if any id's have no ones.

nvar <- function(x, time = Re(x), iz = Im(x), rng = range(time[iz == 1])) 
  (time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
transform(d, nvar = Re(ave(time + iz * 1i, id, FUN = nvar)))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Note that this will not work correctly (i think) if time is not a nice sequence, but has 'gaps' – Wimpel Jan 29 '21 at 13:31
  • Indeed, just pointing it out in case TS oversimplified his sample data – Wimpel Jan 29 '21 at 13:34
  • Thank you both for the great answers. Both work perfectly. And as it happens, I don't have "gaps" in my time variable. But thank you for bringing up the possibility. – Niklas Jan 29 '21 at 14:07
  • Have added a second approach which uses time so gaps are ok. It assumes there is exactly one run of ones in each id. Since it seems you have no gaps and that every id has one run of ones it may be that both (1) and (2) work equally. (2) does not use any packages. – G. Grothendieck Jan 30 '21 at 11:36
2

Here is a solution that is a (tiny) bit more complex than the one from G.Grothendieck. But is will be able to handle non-sequential times.

library( data.table )
#make d a data.table
setDT(d)

#you can remove the trailing [], they are just for passing the output to the console...
#nvar = 0 where iz = 1
d[ iz == 1, nvar := 0 ][]
#calculate nvar for iz == 0 BEFORE iz == 1, using a forward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0, ]
d[ iz == 0, nvar := time - d1[ d0, x.time, on = .(id, time), roll = -Inf ] ][]
#calculate nvar for iz == 0 AFTER iz == 1, usning a backward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0 & is.na( nvar ), ]
d[ iz == 0 & is.na(nvar) , nvar := time - d1[ d0, x.time, on = .(id, time), roll = Inf ] ][]

#     id time iz nvar
#  1:  a    1  0   -1
#  2:  a    2  1    0
#  3:  a    3  1    0
#  4:  a    4  0    1
#  5:  a    5  0    2
#  6:  b    1  0   -4
#  7:  b    2  0   -3
#  8:  b    3  0   -2
#  9:  b    4  0   -1
# 10:  b    5  1    0
# 11:  c    1  0   -3
# 12:  c    2  0   -2
# 13:  c    3  0   -1
# 14:  c    4  1    0
# 15:  c    5  1    0
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • 1
    Thank you for your answer Wimpel! As the other answer was more to the point I accepted it as the answer that solved my question. However, I want to thank you for thinking about possible issues that could come up with the answer above. – Niklas Jan 29 '21 at 14:09
1

One dplyr and purrr option could be:

d %>%
 group_by(id) %>%
 mutate(nvar = map_dbl(.x = seq_along(iz), ~ min(abs(.x - which(iz == 1)))),
        nvar = if_else(cumsum(iz) == 0, -nvar, nvar))

   id     time    iz  nvar
   <fct> <dbl> <dbl> <dbl>
 1 a         1     0    -1
 2 a         2     1     0
 3 a         3     1     0
 4 a         4     0     1
 5 a         5     0     2
 6 b         1     0    -4
 7 b         2     0    -3
 8 b         3     0    -2
 9 b         4     0    -1
10 b         5     1     0
11 c         1     0    -3
12 c         2     0    -2
13 c         3     0    -1
14 c         4     1     0
15 c         5     1     0
tmfmnk
  • 38,881
  • 4
  • 47
  • 67