1

dt1 has val of interest in the correct sequence

library(data.table)
dt1 <- data.frame(id = 1,
                  key = c(paste0("a_",1:6),paste0("b_",1:6)), 
                  val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3)),
                  var = c(rep("a",6),rep("b",6)))

   id key val var
1   1 a_1 122   a
2   1 a_2 128   a
3   1 a_3 134   a
4   1 a_4  NA   a
5   1 a_5  NA   a
6   1 a_6  NA   a
7   1 b_1 110   b
8   1 b_2 112   b
9   1 b_3 114   b
10  1 b_4  NA   b
11  1 b_5  NA   b
12  1 b_6  NA   b

dt2 also has the val of interest in the correct sequence but also some extra val

dt2 <- data.frame(id = 1,
                  key = c(paste0("a_",c(1,3:6)),paste0("b_",c(2,4:6))),
                  val = c(122,127,122,128,134,110,110,112,114),
                  var = c(rep("a",5),rep("b",4)))

  id key val var
1  1 a_1 122   a
2  1 a_3 127   a
3  1 a_4 122   a
4  1 a_5 128   a
5  1 a_6 134   a
6  1 b_2 110   b
7  1 b_4 110   b
8  1 b_5 112   b
9  1 b_6 114   b

I want to match the sequence of values from dt2 to the sequence of values in dt1 and ignore extra values that are in dt2. I have tried a rolling join backwards as values of interest in dt2 are culstered towards the end of the sequence.

Current attempt:

setDT(dt1,key = c("id","var","val"))
setDT(dt2,key = c("id","var","val"))
dt1[dt2, roll = -Inf]
   id key val var i.key
1:  1 a_1 122   a   a_1 # wrong
2:  1 a_1 122   a   a_4 
3:  1 a_2 127   a   a_3 # wrong
4:  1 a_2 128   a   a_5 
5:  1 a_3 134   a   a_6 
6:  1 b_1 110   b   b_2 # wrong
7:  1 b_1 110   b   b_4 
8:  1 b_2 112   b   b_5
9:  1 b_3 114   b   b_6

It looks like duplicate values from dt2 that are in dt1 but not in the sequence I am looking for are causing an issue. Also I want to keep i.key to know the original key as it will be used for other processing. I have also tried: merge(dt1,dt2)

Desired output:

id key val var i.key
 1 a_1 122  a  a_4
 1 a_2 128  a  a_5
 1 a_3 134  a  a_6
 1 b_1 110  b  b_4
 1 b_2 112  b  b_5
 1 b_3 114  b  b_6

I would be grateful for some guidance

Community
  • 1
  • 1
e.matt
  • 836
  • 1
  • 5
  • 12
  • is the sequence in consecutive positions? for e.g. do you consider c(122,128,134) from `dt1` to match with c(122, 345, 128, 678, 134) ? – chinsoon12 May 23 '20 at 01:44
  • The sequence in `dt1` is consecutive so effectively trying to match `c(122,128,134)` as a sequence and not value by value in `dt2`by `id` and `var` thanks – e.matt May 23 '20 at 05:47
  • are sequences of any particular length of in dt1? Are they surrounded by NAs? sorry for the multiple qns as this is a very interesting qn – chinsoon12 May 23 '20 at 08:24
  • Seems like this problem is related to Multiple Sequence Alignment: https://stackoverflow.com/questions/4497747/how-to-perform-basic-multiple-sequence-alignments-in-r – chinsoon12 May 23 '20 at 09:15
  • No problem, sequences in dt1 are 2:6 values in length,then NA will always fill to 6, after the seqence never before. Therefore, in dt1 2 values will always be followed by 4NA,3 values followed by 3NA etc. – e.matt May 23 '20 at 09:20

3 Answers3

3

Here's an approach that first joins back onto dt2 to filter out key == key pairs because you can't do a non-equi join and rolling join at the same time in data.table.

Also remember that only the last column in your on = argument can be rolling, so I recommend always explicitly defining it.

library(data.table)
dt2[dt1, on = .(id,var,val),nomatch = 0][
  key != i.key][dt1,on = .(id,var,val,i.key=key), roll = -Inf, nomatch = 0][
    ,.SD[.N],by = .(id,val,var)][,.(id,val,var,key1 = i.key,key2 = key)]
   id val var key1 key2
1:  1 122   a  a_1  a_4
2:  1 128   a  a_2  a_5
3:  1 134   a  a_3  a_6
4:  1 110   b  b_1  b_4
5:  1 112   b  b_2  b_5
6:  1 114   b  b_3  b_6
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • @jangorecki based on your tag history, you love `data.table` even more than me. I wish there was a less messy way to do multi-line `data.table` code. – Ian Campbell May 22 '20 at 18:39
  • `mergelist` is coming to ease some join use cases, but unfortunatelly not non-equi joins (yet): https://github.com/Rdatatable/data.table/pull/4370 – jangorecki May 22 '20 at 18:42
  • This looks promising. Looking forward to having it in the CRAN version. – Ian Campbell May 22 '20 at 18:45
  • Thanks, was really struggling to get anywhere with this – e.matt May 22 '20 at 19:12
  • 1
    @chinsoon12, the column `key` is class `character` and it works OK. From `help("data.table")`: "roll: Applies to the last join column, generally a date but can be **any ordered variable**, irregular and including gaps." – Ian Campbell May 23 '20 at 14:47
  • @IanCampbell, thanks! it was my fault. i was using `setDT` without the `key=` – chinsoon12 May 23 '20 at 21:56
3

Assuming that we need to find the whole sequence from dt1 within dt2, here is another option:

setDT(dt1)
setDT(dt2)
cols <- c("sid", "cnt")

#create sequence index and count of non-NAs
DT1 <- dt1[!is.na(val)][, (cols) := .(seq(.N), .N), .(id, var)]

#inner join to id, var, val that exists in both
DT2 <- DT1[dt2, on=.(id, var, val), nomatch=0L, .(id, var, val, sid, cnt, i.key)]

#identify rows with consecutive seq index and filter for those rows
consec <- DT2[, if(.N == cnt[1L]) .SD, .(cs=cumsum(c(0L, diff(sid)!=1L)))]

#perform join to get desired output
DT1[consec, on=.(id, var, val)][, c(cols, "cs", "i.cnt") := NULL][]

output:

   id key val var i.key
1:  1 a_1 122   a   A_4
2:  1 a_2 128   a   A_5
3:  1 a_3 134   a   A_6
4:  1 b_1 110   b   B_4
5:  1 b_2 112   b   B_5
6:  1 b_3 114   b   B_6
7:  1 c_1 110   c   C_3
8:  1 c_2 112   c   C_5
9:  1 c_3 114   c   C_6

data with additional group based on comment:

library(data.table)
dt1 <- data.frame(id = 1,
  key = c(paste0("a_",1:6),paste0("b_",1:6),paste0("c_",1:6)),
  val = c(122,128,134, rep(NA,3),c(110,112,114),rep(NA,3), c(110,112,114),rep(NA,3)),
  var = c(rep("a",6),rep("b",6),rep("c",6)))
dt2 <- data.frame(id = 1,
  key = c(paste0("A_",c(1,3:6)),paste0("B_",c(2,4:6)),paste0("C_",c(2,3:6))),
  val = c(122,127,122,128,134,110,110,112,114,134,110,200,112,114),
  var = c(rep("a",5),rep("b",4),rep("c",5)))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thanks, I should have also said that in dt2 there can be an incorrect value between the correct sequence that dt1 has. For example if we change the last values in dt2 to ..134,110,200,112,114) is there a condition to pass the 200 and still look for 110,112,114? – e.matt May 23 '20 at 11:48
  • @e.matt i added more data based on your comment and updated the code to address that. But I am assuming that this incorrect value does not exists in dt1 – chinsoon12 May 23 '20 at 12:47
0

There is probably nicer way to do it, but I think this will work:

merge(dt1, dt2, by = c( "val","var")) %>%
  arrange(key.x) %>%
  filter(key.x != key.y) %>%
  mutate (id = id.x, key = key.x, i.key = key.y) %>%
  select (id, key, val, var, i.key)

Result:

  id key val var i.key
1  1 a_1 122   a   a_4
2  1 a_2 128   a   a_5
3  1 a_3 134   a   a_6
4  1 b_1 110   b   b_2
5  1 b_1 110   b   b_4
6  1 b_2 112   b   b_5
7  1 b_3 114   b   b_6
Shan R
  • 521
  • 4
  • 8