0

Assume two data sets that are linked by common columns, let's call them "recipient" and "donor". One can combine the data sets using merge (R) or LEFT OUTER JOIN (SQL). Often, in my applications, there is supposed to be one (and exactly one) matching donor record for each recipient record. If this is violated, there are either missing values or duplicate recipient records in the result.

Question

Is there a special name for this merge/join operation that enforces this additional existence+uniqueness constraint? I want the merge/join operation to fail if this constraint is violated, since in this case something's wrong either with the data or with some earlier code that generated the data.

Sketch of R function

enrich <- function(x, y, ...) {
  xd <- deparse(substitute(x))
  yd <- deparse(substitute(y))
  stopifnot(!("..watchdog.." %in% colnames(y)))
  y$..watchdog.. <- TRUE
  res <- merge(x, y, sort = FALSE, all.x = TRUE, ...)
  if (any(is.na(res$..watchdog..)))
    stop("At least one row in x=", xd, " has no corresponding row in y=", yd)
  if (nrow(res) > nrow(x))
    stop("At least one row in x=", xd, " has duplicate corresponding rows in y=", yd)
  res$..watchdog.. <- NULL
  res
}

Results of R function

rec <- data.frame(id=letters[1:3], a=1)
don <- data.frame(id=letters[1:3], b=2)
enrich(rec, don)
##   id a b
## 1  a 1 2
## 2  b 1 2
## 3  c 1 2
enrich(rec, don[1:2,])
## Error in enrich(rec, don[1:2, ]): At least one row in x=rec has no corresponding row in y=don[1:2, ]
enrich(rec, rbind(don, don))
## Error in enrich(rec, rbind(don, don)): At least one row in x=rec has duplicate corresponding rows in y=rbind(don, don)
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • 2
    in `data.table` joins this type of merge will automatically result in an error unless you will specify `allow.cartesian = TRUE` and I would also recommend using these in general as these are the most optimized type of joins exists in R. – David Arenburg Jan 14 '15 at 09:57
  • @DavidArenburg: I'm not sold on `data.table`, but there's [an issue that seems to be related](https://github.com/hadley/dplyr/issues/1056) in the `dplyr` bug tracker. – krlmlr Apr 12 '15 at 11:49
  • I'm not sure how my comment related to `dplyr` but it seems like `data.table` changed its policy regarding `allow.cartesian = TRUE` (in the dev version >= 1.9.5) and now it allows cartesian left joins as long as *both* data sets don't have duplicated keys – David Arenburg Apr 12 '15 at 12:10
  • @DavidArenburg: Thanks. `data.table` is one of the backends for `dplyr`. I had a tough time adapting myself to `data.table`---despite its impressive performance---and eventually gave up, but I'll be using `dplyr`. -- The NEWS of `data.table` at GitHub mention that `allow.cartesian` is ignored in some special cases, but if I know that my key is unique there's no point in checking uniqueness. Which, by the way, answers my question ;-) (at least part of it) – krlmlr Apr 12 '15 at 12:17
  • I'm not sure what you mean by "*gave up*" as you can use exactly the same syntax for `data.table` joins as base R. Either Way, I'm not sure I understand your question in the first place. Are you looking for "*a special name for this merge/join operation that enforces this additional existence+uniqueness constraint*" and you satisfied with *your* solution or you actully looking for some type of a practical solution? – David Arenburg Apr 12 '15 at 12:24
  • @DavidArenburg: I found the interface of `data.table` impractical for my use cases, and on a few occasions the reference semantics gave me headaches. (I quicky got comfy with R's quasi-"call by value" semantics -- but `data.table` sometimes unexpectedly modifies in-place.) I gave a practical solution in my question. I'm sure there are other solutions, but in order to find them I thought it would be easier if I knew the *name* of the operation I'm looking for. The uniqueness constraint probably can be translated as "left join over unique keys", but what about the existence constraint? – krlmlr Apr 12 '15 at 12:33

0 Answers0