0

I have two data tables. One, let's call it procedures, contains the procedures done to a patient, and has a patient pseudonym in each row.

Pseudonym    procedure 
AAA          CT scan 
AAA          glucose infusion 
AAB          abdominal surgery 

The second one, a lookup table called allo.dt, assigns each patient pseudonym a special date, such as

pseudonym   erstes.tx.datum 
AAA         2014-01-10
AAB         2018-08-17

What I want is to add a column to the first data table, where in each row, I have the value of "erstes.tx.datum" from the second one.

Pseudonym    procedure          tx.datum 
AAA          CT scan            2014-01-10
AAA          glucose infusion   2014-01-10
AAB          abdominal surgery  2018-08-17

I am very new to data tables and failed to do it with a join.

setkey(procedures, Pseudonym)
setkey(allo.dt, pseudonym)
b <- merge(a, allo.dt, all.x=TRUE)
> A non-empty vector of column names for `by` is required. 

This was surprising, when searching for doing a left join on data.table I didn't find any examples that needed a "by". I suspect that maybe the problem is that the pseudonym is obviously not the key in the procedures table, since it is not unique, but this is not something I can change.

Instead, I tried to use sapply. There I had two problems. First, the performance was terrible. Second, a naive access to the date gave me back a table cell instead of the value inside it, which led to a coercion of my date to an integer.

class(allo.dt[allo.dt$pseudonym=="AAA","erstes.tx.datum"])
[1] "data.table" "data.frame" 

I found an ugly workaround by turning that table cell into a single-element matrix, but it is practically illegible, and the performance is even worse. The whole solution I have is

procedures$erstes.tx.datum -> unlist(sapply(ops.procedures$Pseudonym, function(x) {as.matrix(allo.dt[allo.dt$pseudonym==x,"erstes.tx.datum"])[1,1]}))

I am pretty sure there must be a much easier solution here, because that's what data.table is for, but I'm stuck. What would be a good way to do this?

I am aware that this has been asked in Merging a data frame and a lookup table in r, retaining all records from data frame, but there the OP uses data frames and the answers suggest using plyr. I am looking specifically for a solution based on data.table or base r, without using plyr.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
Rumi P.
  • 1,688
  • 3
  • 23
  • 31
  • Since the column names are different (in capitalization), you need to tell `merge` how to line them up. Add `by.x = "Psuedonym", by.y = "psuedonym"`. Or, more simply, modify the tables so the column to match on has the same name. – Gregor Thomas Dec 19 '19 at 14:08
  • @Gregor--reinstateMonica-- so, you mean that the "setkey" is not doing the "tell merge how to line them up" part? Because that's how I understood the tutorial at https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html and other examples I found. – Rumi P. Dec 19 '19 at 14:13
  • 2
    You could do `dt2[dt1, tx.datum := tx.datum, on = .(Pseudonym = pseudonym)]` (data.table syntax) – David Arenburg Dec 19 '19 at 14:15
  • I mean that the error message you get, *"A non-empty vector of column names for `by` is required."* is pretty clearly saying you need to specify columns in the `by` argument, so I would try doing that. And in the example you link, the columns already have the same name. – Gregor Thomas Dec 19 '19 at 14:20

3 Answers3

2

Here is a solution with base R, where match() is used

df1$"tx.datum" <- df2$erstes.tx.datum[match(df1$Pseudonym,df2$pseudonym)]

such that

> df1
  Pseudonym         procedure   tx.datum
1       AAA           CT scan 2014-01-10
2       AAA  glucose infusion 2014-01-10
3       AAB abdominal surgery 2018-08-17
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

When the columns you want to merge by do not have the same name in both data frames, you need to specify how they should line up. In merge, this is done with the by argument. You can also use data table's [ syntax for merging, in which uses an on argument.

Whether or not you set key, either of these will work:

merge(proc, allo, by.x = "Pseudonym", by.y = "pseudonym")
proc[allo, on = .(Pseudonym = pseudonym)]

So, what does setkey do? Most importantly, it will speed up any merges involving key columns. As far as the merge defaults, we can look at ?data.table::merge, which begins:

...by default, it attempts to merge

  • at first based on the shared key columns, and if there are none,

  • then based on key columns of the first argument x, and if there are none,

  • then based on the common columns between the two data.tables.

Set the by, or by.x and by.y arguments explicitly to override this default.

This is different than base::merge, in that base::merge will always try to merge on all shared columns. data.table::merge will prioritize shared columns that are keyed to merge on. None of these will attempt to merge columns with different names.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Thank you for the explanation! I was mislead by a comment in a tutorial which I interpreted to say that "setkey" does the job. Now it's clear. – Rumi P. Dec 20 '19 at 08:16
0

Here is tidyverse based solution. It easily allows joining with mismatching column names.

library(tidyverse)
library(dplyr)

procedures%>%
left_join(allo.dt,by=c("Pseudonym"="pseudonym"))
costebk08
  • 1,299
  • 4
  • 17
  • 42
  • Code-only answers are generally frowned upon on this site. Could you please edit your answer to include some comments or explanation of your code? Explanations should answer questions like: What does it do? How does it do it? Where does it go? How does it solve OP's problem? See: [How to anwser](https://stackoverflow.com/help/how-to-answer). Thanks! – Eduardo Baitello Dec 19 '19 at 14:29