4

I'm using a data.table in R to store a time series. I want to return a subset such that successive rows for the selected times are at least N seconds apart from the last row that was selected, e.g. if I have

library(data.table)
x <- data.table(t=c(0,1,3,4,5,6,7,10,16,17,18,20,21), v=1:13)
x
     t  v
 1:  0  1
 2:  1  2
 3:  3  3
 4:  4  4
 5:  5  5
 6:  6  6
 7:  7  7
 8: 10  8
 9: 16  9
10: 17 10
11: 18 11
12: 20 12
13: 21 13

and I want to sample rows that are at least 5 seconds apart, starting from the first row, then I should get a data.table with time/value pairs:

y <- x[...something...]
y
     t  v
 1:  0  1
 2:  5  5
 3: 10  8
 4: 16  9
 5: 21 13

The time samples don't have to be regularly spaced either, so I can't just take every M rows. Of course I could do this by looping through the data.table rows manually but I'm wondering if there's a more convenient way to express this using data.tables indexing.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Anthony
  • 2,256
  • 2
  • 20
  • 36

1 Answers1

4

Here are a couple ways to use rolling joins to find the set of rows, w, in your subset:

t_plus = 5

# one join per row visited
w   <- c()
nxt <- 1L
while(!is.na(nxt)){ 
  w   <- c(w, nxt) 
  nxt <- x[.(t[nxt]+t_plus), on=.(t), roll=-Inf, which=TRUE]
}

# join once on all rows
w0  <- x[.(t+5), on=.(t), roll=-Inf, which=TRUE]

w   <- c()
nxt <- 1L
while (!is.na(nxt)){ 
  w   <- c(w, nxt)
  nxt <- w0[nxt] 
}

Then you can subset like x[w].


Comments

In principle, there could be other subsets that satisfy the OP's condition "at least 5 seconds apart"; this is just the one found by iterating from the first row forward.

The second way is based on @DavidArenburg's answer to the Q&A Henrik linked above. Although the question seems the same, I couldn't get that approach to work fully here.

Generally, it's a bad idea to grow things in a loop in R (like I'm doing with w here). If you're running into performance problems, that might be a good area to improve in this code.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Seems like `findInterval` should also work here, but I can't figure it. – Frank Jan 23 '17 at 22:59
  • Yikes, I take it the answer is "no", there is not a convenient indexing paradigm for this. I'll probably just outsource it to Rcpp if my R implementation becomes a bottleneck. Thanks for the help. – Anthony Jan 25 '17 at 19:07
  • @Anthony Yeah, I think you're right about there not being a convenient way, but I also think you're underestimating the complexity of the rule "maximal set of rows that are all at least five seconds apart". If you don't care about the "maximal set" part, then it gets a lot easier and doesn't need to be computed iteratively: `x[x[.(t = seq(t[1L], t[.N], by=5*2)), on=.(t), roll=TRUE, which=TRUE, mult="first"]]` Choose up to one number from each 10-second interval et voila -- your condition is satisfied. Anyway, if you find a good Rcpp way, maybe you can post it as an answer so we can see it. – Frank Jan 25 '17 at 19:49