4
require(data.table)    
set.seed(333)
t <- data.table(old=1002:2001, dif=sample(1:10,1000, replace=TRUE))
t$new <- t$old + t$dif; t$foo <- rnorm(1000); t$dif <- NULL
i <- data.table(id=1:3, start=sample(1000:1990,3))


> i
   id start
1:  1  1002
2:  2  1744
3:  3  1656

> head(t)
    old  new        foo
1: 1002 1007 -0.7889534
2: 1003 1004  0.3901869
3: 1004 1014  0.7907947
4: 1005 1011  2.0964612
5: 1006 1007  1.1834171
6: 1007 1015  1.1397910

I would like to delete time points from points such that only those rows remain where new[i] = old[i-1], giving a continuous sequence of some fixed number of time points. Ideally, this would be done for all id in i simultaneously, where start gives the starting points. For example, if we choose n=5, we should obtain

> head(ans)
    id old  new        foo
1:   1 1002 1007 -0.7889534
2:   1 1007 1015  1.1397910
3:   1 1015 1022 -1.2193670
4:   1 1022 1024  1.2039050
5:   1 1024 1026  0.4388586
6:   2 1744 1750 -0.1368320

where lines 3 to 6 cannot be inferred above and foo is a stand in for other variables that need to be kept.

Can this be done efficiently in data.table, for example, using a clever combination of joins?

PS. This question is somewhat similar to an an earlier one of mine but I have modified the situation to make it clearer.

nicola
  • 24,005
  • 3
  • 35
  • 56
bumblebee
  • 1,116
  • 8
  • 20

1 Answers1

5

It seems to me that you need help from graph algorithms. If you want to start with 1002, you can try:

require(igraph)
g <- graph_from_edgelist(as.matrix(t[,1:2]))
t[old %in% subcomponent(g,"1002","out")]
#  1: 1002 1007 -0.78895338
#  2: 1007 1015  1.13979100
#  3: 1015 1022 -1.21936662
#  4: 1022 1024  1.20390482
#  5: 1024 1026  0.43885860
# ---                      
#191: 1981 1988 -0.22054875
#192: 1988 1989 -0.22812175
#193: 1989 1995 -0.04687776
#194: 1995 2000  2.41349730
#195: 2000 2002 -1.23425666

Of course you can do the above for each start you want and limiting the results for the first n rows. For instance, we can lapply over the i$start positions and then rbindlist all the values together, declaring an id column with the i$id values. Something like:

n <- 5
rbindlist(
    setNames(lapply(i$start, function(x) t[old %in% subcomponent(g,x,"out")[1:n]]), i$id),
    idcol="id")
#    id  old  new        foo
# 1:  1 1002 1007 -0.7889534
# 2:  1 1007 1015  1.1397910
# 3:  1 1015 1022 -1.2193666
# 4:  1 1022 1024  1.2039048
# 5:  1 1024 1026  0.4388586
# 6:  2 1744 1750 -0.1368320
# 7:  2 1750 1758  0.3331686
# 8:  2 1758 1763  1.3040357
# 9:  2 1763 1767 -1.1715528
#10:  2 1767 1775  0.2841251
#11:  3 1656 1659 -0.1556208
#12:  3 1659 1663  0.1663042
#13:  3 1663 1669  0.3781835
#14:  3 1669 1670  0.2760948
#15:  3 1670 1675  0.3745026
nicola
  • 24,005
  • 3
  • 35
  • 56
  • This solution makes sense to me! There is no `t$id` because `t` is valid for all of them and IDs only set the respective starting points. If I don't want to do it by hand, how would I get the `id` in there? – bumblebee Jul 08 '19 at 17:06
  • something like `i[, t[old %in% subcomponent(g, start, "out")], by=.(id)]` – chinsoon12 Jul 09 '19 at 00:36
  • Yes! But is it possible to limit the number of each sequence to `n`? It looks like this also needs to be done in the `t[old %in% ...]` command. – bumblebee Jul 09 '19 at 12:47
  • @bumblebee You want just the first `n` rows to be returned? – nicola Jul 09 '19 at 12:58
  • Yes, thanks for adding it to your solution. I compared your last step `rbindlist(setNames(lapply(...)))` to chinsoon12's suggestion `i[, t[...], by=.(id)]` and the speed was comparable. Thanks to both of you! – bumblebee Jul 10 '19 at 16:19
  • In my actual problem, the values of `new`,`old`, and `start` are POSIXct. It appears that using them with igraph always yields the error that `subcomponent failed, Invalid vertex id`. Does anyone know how to fix this issue? – bumblebee Jul 10 '19 at 16:45