5

I have a look-up table of "firsts" in column d. For example, the first time the patient was admitted because of a specific disease. I would like to join this back into the main data frame via data.table on multiple other conditions.

My problem is that, unfortunately, the main data.table could have multiple records with identical joining criteria that results in multiple "firsts" per patient after the join. Real world data is messy, people!

Is it possible to do a {data.table} join on only the first matching record?

This is similar to this question, but the multiple-matches are on the main data table. I think that mult only works on when there are several entries on the table being joined in.

repex:

library(data.table)

set.seed(1724)

d1 <- data.table(a = c(1, 1, 1),
                 b = c(1, 1, 2),
                 c = sample(1:10, 3))

d2 <- data.table(a = 1, b = 1, d = TRUE)

d2[d1, on = c("a", "b")]

   a b    d c
1: 1 1 TRUE 4
2: 1 1 TRUE 8
3: 1 2   NA 2

desired output

   a b    d c
1: 1 1 TRUE 4
2: 1 1   NA 8
3: 1 2   NA 2
Paul
  • 2,877
  • 1
  • 12
  • 28
  • 1
    The easy solution would be to index every row and join on this also (the first table is a filtered version of the main one), but keen to know if there is an elegant solution. – Paul Feb 12 '21 at 07:44
  • Post your comment as an answer, please, still a good alternative. – zx8754 Feb 12 '21 at 08:07
  • 1
    You could find the positions that need to be set to NA with the `duplicated` command. – bdecaf Feb 12 '21 at 08:11
  • 2
    Seems like you want an 'update join': `d1[d2, on = .(a, b), d := d, mult = "first"]` – Henrik Feb 12 '21 at 09:16
  • @Henrik - nice - `d1[d2, on = .(a, b), (names(d2)) := ..d2, mult = "first"]` or something similar maybe to do all variables if there are many? – thelatemail Feb 12 '21 at 09:26
  • @Henrik - I think it does in this instance, but I'm trying (maybe unsuccessfully) to account for the case where there might be a variable/dataset name clash. – thelatemail Feb 12 '21 at 09:46

3 Answers3

4
library(data.table)
set.seed(1724)
d1 = data.table(a = c(1, 1, 1), b = c(1, 1, 2), c = sample(1:10, 3))
d2 = data.table(a = 1, b = 1, d = TRUE)

d1[, i1:=seq_len(.N), by=c("a","b")]
d2[, i2:=seq_len(.N), by=c("a","b")]

d2[d1, on = c("a","b","i2==i1")][, "i2":=NULL][]
#       a     b      d     c
#   <num> <num> <lgcl> <int>
#1:     1     1   TRUE     4
#2:     1     1     NA     8
#3:     1     2     NA     2
jangorecki
  • 16,384
  • 4
  • 79
  • 160
3

One way would be to turn the values to NA after join.

library(data.table)

d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3

#   a b    d c
#1: 1 1 TRUE 4
#2: 1 1   NA 8
#3: 1 2   NA 2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

The easy solution would be to index every row and join on this also (the d2 is a filtered version of d1):

library(data.table)

set.seed(1724)

d1 <- data.table(a = c(1, 1, 1),
                 b = c(1, 1, 2),
                 c = sample(1:10, 3))

d1[, rid := seq(to = .N)]
d2 <- d1[, .SD[1], by = c("a"), .SDcols = c("b", "rid")][, d := TRUE] # UPDATE

d2[d1, on = c("a", "b", "rid")]
Paul
  • 2,877
  • 1
  • 12
  • 28