2

I have the following example data:

library(data.table)
set.seed(42)
t <- data.table(time=1:1000, period=round(runif(100,1,5)))
p <- data.table(id=1:10, cut=sample(1:100,5))


> t[62:71]
    time period
 1:   62      5
 2:   63      4
 3:   64      3
 4:   65      4
 5:   66      2
 6:   67      2
 7:   68      4
 8:   69      4
 9:   70      2
10:   71      1

> head(p)
   id cut
1:  1  63
2:  2  22
3:  3  99
4:  4  38
5:  5  91
6:  6  63

where t gives some vector of periods associated with time points, and p gives for each person a cutoff in time.

For each person in p, I would like to start at the person's cutoff and create a sequence of 4 time points by concatenating the periods. For example, for person 1, starting at time 63, the sequence would be 63, 63+4=67, 67+2=69 and 69+4=73.

Ideally, the output would then be:

> head(res)
   id  t1   t2   t3   t4
    1  63   67   69   73
    2  22   24   29   32
    3  99  103  105  109
    4  38   40   43   44
    5  91   95  100  103
    6  63   67   69   73

I learned before how to create the sequences using accumulate::purrr (iterative cumsum where sum determines the next position to be added). However, I wonder whether something like this can be done simultaneously for different persons using data.table or other packages but avoiding for-loops as the datasets are rather large.


edit: version where time values do not coincide with row indicies

library(data.table)
set.seed(42)
t <- data.table(time=1001:2000, period=round(runif(100,1,5)))
p <- data.table(id=1:10, cut=sample(1:100,5))

is similar as above, except for

> t[62:71]
    time period
 1: 1062      5
 2: 1063      4
 3: 1064      3
 4: 1065      4
 5: 1066      2
 6: 1067      2
 7: 1068      4
 8: 1069      4
 9: 1070      2
10: 1071      1

where t$time[i] does not equal i, which prohibits Jaap's first solution.

bumblebee
  • 1,116
  • 8
  • 20

2 Answers2

3

For-loops aren't necessarily bad or inefficient. When used correctly, they can be an efficient solution for your problem.

For your current problem I would use a for-loop with the -package which is efficient because the data.table is updated by reference:

res <- p[, .(id, t1 = cut)]

for(i in 2:4) {
  res[, paste0("t",i) := t[res[[i]], time + period] ]
}

which gives:

> res
    id t1  t2  t3  t4
 1:  1 63  67  69  73
 2:  2 22  24  29  32
 3:  3 99 103 105 109
 4:  4 38  40  43  44
 5:  5 91  95 100 103
 6:  6 63  67  69  73
 7:  7 22  24  29  32
 8:  8 99 103 105 109
 9:  9 38  40  43  44
10: 10 91  95 100 103

Alternatively, you can choose to update p as follows:

for(i in 2:4) {
  p[, paste0("t",i) := t[p[[i]], time + period]]
}
setnames(p, "cut", "t1")

which gives the same result.


For the updated example data, you should change the above method to:

for(i in 2:4) {
  p[, paste0("t",i) := t[match(p[[i]], t$time), time + period]]
}
setnames(p, "cut", "t1")
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thank you for the input and this idea! I tried implementing this solution when `time` and `cut` are POSIXct and `period` is difftime, but I keep getting the error that `i has not evaluated to logical, integer or double`. Am I correct that this implies that POSIXct objects cannot be used to key the `t` table? – bumblebee Feb 18 '19 at 16:24
  • Question solved, it was a simple joining-problem. I re-named `setnames(p, "cut", "time")` and added changed the loop to `t[p[[i]], time + period, on=.(time)]]` to get `t` and `p` to join on the joint `time` variable. Am I correct assuming that this costs time compared with directly indexing the `t` table? – bumblebee Feb 18 '19 at 17:00
  • In my solution I could use just the values to filter `t` because they were numeric and referring to rows. When that's not the case, you'll have to use other methods like joining as you showed in your 2nd comment. – Jaap Feb 18 '19 at 20:29
  • I was mistaken — my adjustment does not work (`logical error. i is not a data.table, but 'on' argument is provided.`) and using `p` in place of `p[[i]]` produces the same values for `t2`, `t3`, and `t4`. Somehow, I would need to make the `on=.(time=...)` part interactive and get the (changing) column name in the place of `...`. I have tried `p[, paste0("t",i) := t[p, time + period, on=.(time=as.name(names(p)[i]))]]` but this would not work. Is there any way? – bumblebee Feb 19 '19 at 17:00
  • @bumblebee Could you add example data to your question that mimics the situation you described in the comments above? I will then look at it tomorrow morning – Jaap Feb 19 '19 at 17:06
0

I would use a while() loop.

while (ncol(p) - 1 < 4) {
  p <- cbind(p, p[[ncol(p)]] + t$period[p[[ncol(p)]]])
} 

> head(p)
   id cut  V2  V2  V2
1:  1  63  67  69  73
2:  2  22  24  29  32
3:  3  99 103 105 109
4:  4  38  40  43  44
5:  5  91  95 100 103
6:  6  63  67  69  73
jay.sf
  • 60,139
  • 8
  • 53
  • 110