7

I want to number certain combinations of row in a dataframe (which is ordered on ID and on Time)

tc <- textConnection('
id              time       end_yn
abc             10         0
abc             11         0
abc             12         1
abc             13         0
def             10         0
def             15         1
def             16         0
def             17         0
def             18         1
')

test <- read.table(tc, header=TRUE)

The goal is to create a new column ("number") that numbers each row per id from 1 to n until end_yn == 1 is hit. After end_yn == 1, the numbering should start over.

Without taking the end_yn == 1 condition into account the rows can be numbered using:

DT <- data.table(test)
DT[, id := seq_len(.N), by = id]

However the expected outcome should be:

id              time       end_yn   number
abc             10         0        1
abc             11         0        2
abc             12         1        3 
abc             13         0        1 
def             10         0        1
def             15         1        2
def             16         0        1
def             17         0        2
def             18         1        3

How to incorporate the end_yn == 1 condition?

Max van der Heijden
  • 1,095
  • 1
  • 8
  • 16

1 Answers1

5

I'm guessing there are different ways to do this, but here's one:

DT[, cEnd := c(0,cumsum(end_yn)[-.N])] # carry the end value forward

DT[, number := seq_len(.N), by = "id,cEnd"] # create your sequence

DT[, cEnd := NULL] # remove the column created above

Setting id as the key for DT might be worth while.

BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • Very smart to carry the end value forward. This is exactly what I need and works much faster than my initial solution. Thanks! – Max van der Heijden Oct 19 '12 at 10:02
  • What if I would want to give all entries the same number until the end_yn condition is hit? So give number the value 1 until end_yn=1 for the first time, then 2 until end_yn=1 for the second time, then 3 etc. (per coockie_id). seq_len(.N) needs to be replaced I think, but I cannot figure out with what – Max van der Heijden Oct 24 '12 at 10:55
  • 1
    @MaxvanderHeijden, for that, you could try something like `DT[, number := cumsum(end_yn) + 1, by = "id"]`, but that would start numbering at `2` if `end_yn` is `1` for the first entry of an `id`. Try searching the data.table tag on SO, since a similar question may very well have been asked already. – BenBarnes Oct 24 '12 at 12:04
  • Thank you. Of course I could also use the cEnd variable, since that essentially also makes a unique ID per journey. However not on a per ID basis, but that I don't really need after changing some other commands :) – Max van der Heijden Oct 24 '12 at 13:59