How to join data to only the first matching row with {data.table} in R

Question

I have a look-up table of "firsts" in column d. For example, the first time the patient was admitted because of a specific disease. I would like to join this back into the main data frame via data.table on multiple other conditions.

My problem is that, unfortunately, the main data.table could have multiple records with identical joining criteria that results in multiple "firsts" per patient after the join. Real world data is messy, people!

Is it possible to do a {data.table} join on only the first matching record?

This is similar to this question, but the multiple-matches are on the main data table. I think that mult only works on when there are several entries on the table being joined in.

repex:

library(data.table)

set.seed(1724)

d1 <- data.table(a = c(1, 1, 1),
                 b = c(1, 1, 2),
                 c = sample(1:10, 3))

d2 <- data.table(a = 1, b = 1, d = TRUE)

d2[d1, on = c("a", "b")]

   a b    d c
1: 1 1 TRUE 4
2: 1 1 TRUE 8
3: 1 2   NA 2

desired output

   a b    d c
1: 1 1 TRUE 4
2: 1 1   NA 8
3: 1 2   NA 2

The easy solution would be to index every row and join on this also (the first table is a filtered version of the main one), but keen to know if there is an elegant solution. — Paul, Feb 12 '21 at 07:44
Post your comment as an answer, please, still a good alternative. — zx8754, Feb 12 '21 at 08:07
You could find the positions that need to be set to NA with the `duplicated` command. — bdecaf, Feb 12 '21 at 08:11
Seems like you want an 'update join': `d1[d2, on = .(a, b), d := d, mult = "first"]` — Henrik, Feb 12 '21 at 09:16
@Henrik - nice - `d1[d2, on = .(a, b), (names(d2)) := ..d2, mult = "first"]` or something similar maybe to do all variables if there are many? — thelatemail, Feb 12 '21 at 09:26
@Henrik - I think it does in this instance, but I'm trying (maybe unsuccessfully) to account for the case where there might be a variable/dataset name clash. — thelatemail, Feb 12 '21 at 09:46

score 4 · Answer 1 · answered Feb 14 '21 at 11:57

4

library(data.table)
set.seed(1724)
d1 = data.table(a = c(1, 1, 1), b = c(1, 1, 2), c = sample(1:10, 3))
d2 = data.table(a = 1, b = 1, d = TRUE)

d1[, i1:=seq_len(.N), by=c("a","b")]
d2[, i2:=seq_len(.N), by=c("a","b")]

d2[d1, on = c("a","b","i2==i1")][, "i2":=NULL][]
#       a     b      d     c
#   <num> <num> <lgcl> <int>
#1:     1     1   TRUE     4
#2:     1     1     NA     8
#3:     1     2     NA     2

answered Feb 14 '21 at 11:57

jangorecki

16,384
4
79
160

1

First time I see this syntax (`"i2==i1"`) for `on`. – s_baldur Feb 14 '21 at 12:03

score 3 · Accepted Answer · answered Feb 12 '21 at 07:46

3

One way would be to turn the values to NA after join.

library(data.table)

d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3

#   a b    d c
#1: 1 1 TRUE 4
#2: 1 1   NA 8
#3: 1 2   NA 2

answered Feb 12 '21 at 07:46

Ronak Shah

377,200
20
156
213

score 0 · Answer 3 · answered Feb 16 '21 at 01:45

The easy solution would be to index every row and join on this also (the d2 is a filtered version of d1):

library(data.table)

set.seed(1724)

d1 <- data.table(a = c(1, 1, 1),
                 b = c(1, 1, 2),
                 c = sample(1:10, 3))

d1[, rid := seq(to = .N)]
d2 <- d1[, .SD[1], by = c("a"), .SDcols = c("b", "rid")][, d := TRUE] # UPDATE

d2[d1, on = c("a", "b", "rid")]

How to join data to only the first matching row with {data.table} in R

3 Answers3

Linked