Multiple joins/merges with data.tables

Question

I have two data.tables, DT and L:

> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")

> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9

> L
   yv lu
1:  1  h
2:  2  g
3:  3  f
4:  4  e
5:  5  d
6:  6  c
7:  7  b
8:  8  a
9: 12  l

I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:

> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
   x y v lu.1 lu.2
1: a 1 1    h    h
2: a 2 3    g    f
3: a 3 6    f    c
4: b 4 1    e    h
5: b 5 3    d    f
6: b 6 6    c    c
7: c 7 1    b    h
8: c 8 3    a    f
9: c 9 6   NA    c

(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)

In SQL this would be simple/straightforward:

SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv

Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.

score 7 · Accepted Answer · answered Jan 02 '13 at 17:51

7

Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.

There might be better ways. How about this:

> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
   yv lu yv lu x y v
1:  1  h  1  h a 1 1
2:  3  f  2  g a 3 2
3:  6  c  3  f a 6 3
4:  1  h  4  e b 1 4
5:  3  f  5  d b 3 5
6:  6  c  6  c b 6 6
7:  1  h  7  b c 1 7
8:  3  f  8  a c 3 8
9:  6  c  9 NA c 6 9

or, to illustrate, this is the same :

> cbind( L[J(DT$y)], L[J(DT$v)], DT )
   yv lu yv lu x y v
1:  1  h  1  h a 1 1
2:  3  f  2  g a 3 2
3:  6  c  3  f a 6 3
4:  1  h  4  e b 1 4
5:  3  f  5  d b 3 5
6:  6  c  6  c b 6 6
7:  1  h  7  b c 1 7
8:  3  f  8  a c 3 8
9:  6  c  9 NA c 6 9

merge could also be used, if the following feature request was implemented :

FR#2033 Add by.x and by.y to merge.data.table

answered Jan 02 '13 at 17:51

Matt Dowle

58,872
22
166
224

Thanks for the cbind tip and the quick response. This method is less nested and easier to read than my original try. To clean up the resulting column names, I'll use something like `cbind(DT, L[J(DT$y)][,list(lu.1=lu)], L[J(DT$v)])[,list(x,y,v,lu.1,lu.2=lu)]`. My background in SQL has conditioned me to never rely on the row sort order, so combining the columns feels wrong, but obviously it works in R. I need to be "thinking in R" more. Huge fan of your work, BTW. – dnlbrky Jan 02 '13 at 20:03
Would implementing FR#2033 allow multiple tables to be merged? As I understand it, `merge` is used for two data frames, not many. I see some options like `merge_all` and `merge_recurse` in reshape, but I can't find too many examples (especially with several sets of foreign keys) so I'm not sure how to use them if they would even work in this example. – dnlbrky Jan 02 '13 at 20:14
Generally in R and in data.table we don't use as many tables. Factors are similar to foreign Keys in sql, and these days since character is cached by R automatically in its global cache, character is really efficient. Big flat tables, denormalised, is good generally. Data.table only groups the columns that j needs so there is no disadvantage of many columns like there is in (row-stored) sql. – Matt Dowle Jan 02 '13 at 20:21
Actually, thinking about it, X[Y[Z]] or X[Y][Z] are ways to join multiple tables. – Matt Dowle Jan 02 '13 at 20:23
Whenever i say "no disadvantage", i don't mean it generally. I mean that within the context of that comment. – Matt Dowle Jan 02 '13 at 20:27
But yes you can rely on row order in R, that's guaranteed unlike you're used to in sql. But the cbind method probably falls over if you have multiple matches or no matches, so might need to embelish that to set mult and nomatch appropriately, to ensure 1-1. Thats where X[Y[Z]] might be better. – Matt Dowle Jan 02 '13 at 20:34
That might be a bit quick to accept. I know I'm the main author but others often have better answers than I do :). Accepting takes it off the radar. Also, dont forget you can always use SQL as-is in sqldf. – Matt Dowle Jan 02 '13 at 21:02
I think my original try was using the nested X[Y[Z]] format, but it looks messy since I was using the setkeys to take care of the factors/foreign keys. As for denormalizing, I have 30M+ rows so I thought it would be more manageable to keep in memory if I normalized it, but maybe there's no advantage. Yeah, I could use sqldf but I think it is slower, right? (Plus, I like learning something new if it is potentially better.) Ok, if you say so I'll un-accept your answer :). – dnlbrky Jan 02 '13 at 21:30
Oh yes, your original without setkey is that, isn't it. If you replace the setkey with j=list() as I showed you'll get less columns in the final result, though. To solve that you can place more columns in j=list(...) than are used in the join, and rename them at the same time, then they should carry through. Admittedly this area is more tricky than sql, but then you have more control to join in different orders, with guaranteed row order at each stage (no need to self join and order by). – Matt Dowle Jan 02 '13 at 22:42
Btw, 30m rows is fairly small, depends on number of columns of course. It's GB that count. A 10GB data.table on a 64GB ram machine is standard. People use data.table up to the 2 billion row limit (although that limit has been removed in R3). – Matt Dowle Jan 02 '13 at 23:00

Multiple joins/merges with data.tables

1 Answers1