4

How can I take a data.table object, and subset it conditionally on two columns by two paired vectors in another table. For instance I have a table of flights:

library(data.table)
library(nycflights13)
flights <- data.table(flights)

And I have another data.table containing my paired columns:

subDT <- data.table(sub_orig = c("EWR", "LGA"), sub_dest = c("IAH", "ATL"))
subDT

I wish to find and subset in flights those rows where the origin is "EWR" and the destination is "IAH" exactly (i.e. like using &) and also in the same query return the rows where origin is "LGA" and where the destination is "ATL".

flights[dest %in% subDT[, sub_dest] & origin %in% subDT[, sub_orig]]

The above contains the data I want i.e. those pairings specified above, however it also contains pairings of "EWR" & "ATL" and "LGA" & "IAH" which I do not want.

I realise I can make a fake helper column (e.g. paste(dest, origin)) and select via %in% by that but I feel like there is a better approach, possibly some magic with setkey that currently escapes me?

N.B. my end use case will actually be using a combination of character vector as one column, and a numeric factor as the other, if that is important.

DaveRGP
  • 1,430
  • 15
  • 34
  • Apologies, what was the reason for marking as duplicate here? Though merge may offer a partial solution, your answers demonstrate there are more specific approaches not evident in the original question (`on`, `J`, multiple setkey) – DaveRGP Nov 16 '15 at 13:14

1 Answers1

5

You can try a binary join on the two columns which both efficient and implements the "and" logic

res <- setkey(flights, dest, origin)[J(subDT$sub_dest, subDT$sub_orig)]
res[, table(dest, origin)]
#      origin
# dest    EWR   LGA
#   ATL     0 10263
#   IAH  3973     0

Alternatively, with v1.9.6+ you could also do

res <- flights[subDT, on = c(dest = "sub_dest", origin = "sub_orig")]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Or `merge(flights, subDT, by.x = c("origin", "dest"), by.y = c("sub_orig", "sub_dest"))`. – Roland Nov 16 '15 at 12:27
  • 1
    Yeah, it can be done with a `dplyr::right_join` too probably, but they after an efficient `data.table` solution I believe. – David Arenburg Nov 16 '15 at 12:28
  • `merge.data.table` should be rather efficient, no? – Roland Nov 16 '15 at 12:30
  • I'm not sure if it was optimized since, but I believe it is inferior to a binary join. – David Arenburg Nov 16 '15 at 12:31
  • 1
    It has been improved in 1.9.6. See the documentation. – Roland Nov 16 '15 at 12:32
  • 1
    From `help("merge.data.table")`: "In versions < v1.9.6, if the specified columns in by was not the key (or head of the key) of x or y, then a copy is first rekeyed prior to performing the merge. This was less performant and memory inefficient." – Roland Nov 16 '15 at 12:40
  • Yeah, you are right, found this in the readme too just now. – David Arenburg Nov 16 '15 at 12:42
  • Thanks for the discussion, I think I understand most of this, except for the part where you explicitly reference the `J` argument in the middle? – DaveRGP Nov 16 '15 at 13:12
  • Additionally, using the `on` method gives an error when including it for my actual use case: `Error in forderv(x, by = rightcols) : 'by' value -2147483648 out of range [1,7]` I'm pretty sure I have the right version, your example worked for me. The other method is fine though. Thank you. – DaveRGP Nov 16 '15 at 13:26
  • 1
    That error is usually happens when you are specifying the column names incorrectly. It works perfectly fine on the provided data sets. – David Arenburg Nov 16 '15 at 13:29
  • Exactly that, thank you again. – DaveRGP Nov 16 '15 at 13:35
  • 1
    Just to update on the mentioned `forderv` error message, it has been already improved in [#1376](https://github.com/Rdatatable/data.table/issues/1376). @DaveRGP – jangorecki Nov 16 '15 at 15:35
  • You'll need the devel version from GH though. – David Arenburg Nov 16 '15 at 16:25