9

Doing a merge between a populated data.table and another one that is empty introduces one NA row in the resulting data.table:

a = data.table(c=c(1,2),key='c')
b = data.table(c=3,key='c')
b=b[c!=3]
b
# Empty data.table (0 rows) of 1 col: c
merge(a,b,all=T)
#     c
# 1: NA
# 2:  1
# 3:  2

Why? I expected that it would return only the rows of data.table a, as it does with merge.data.frame:

> merge.data.frame(a,b,all=T,by='c')
#  c
#1 1
#2 2
vsalmendra
  • 577
  • 1
  • 6
  • 13
  • So you want to merge `a` with an empty data table `b`. Why you introduce `b` in such unusual way? Why not to use `b=data.table()` ? – user974514 Mar 22 '13 at 09:03
  • @user974514: I just wanted to reproduce the problem as it appeared in my code. Generally the `data.table`'s in my code are populated, but sometimes not, and in my case the tables are keyed and the merge naturally uses these keys. A simple `data.table()` does not reproduce exactly the problem I had. – vsalmendra Mar 22 '13 at 09:16
  • @user974514, that would give a NULL data.table (0 rows and columns) and there will be no "key" column. So, merge would not be possible. – Arun Mar 22 '13 at 09:24
  • @Arun, I don't think so. I did the same thing with a `data.frame` and the result was the expected one: a new `data.frame` with just the rows of the populated one. – vsalmendra Mar 22 '13 at 09:33

4 Answers4

7

The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!

Here's a better example :

> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
   P Q
1: 1 3
2: 2 4
> b
   P R
1: 2 5
2: 3 6
> merge(a,b)  # correct
   P Q R
1: 2 4 5
> merge(a,b,all=TRUE)  # correct.  
   P  Q  R
1: 1  3 NA
2: 2  4  5
3: 3 NA  6
> merge(a,b[0],all=TRUE)  # incorrect result when y is empty, agreed
    P  Q  R
1: NA NA NA
2: NA NA NA
3:  1  3 NA
4:  2  4 NA
> merge.data.frame(a,b[0],all=TRUE)  # correct
  P Q  R
1 1 3 NA
2 2 4 NA

Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :

merge no longer returns spurious NA row(s) when y is empty and all.y=TRUE (or all=TRUE), #2633. Thanks to Vinicius Almendra for reporting. Test added.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
2

all : logical; all = TRUE is shorthand to save setting both all.x = TRUE and all.y = TRUE.

all.x : logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have ’NA’s in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

all.y : logical; analogous to all.x above.

This is taken from data.table documentation. For more, look at the description of the arguments for merge function there.

I think this answers your question.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • maybe you could try `ifelse(dim(b)[1]==0, merge(a,b,all.x=T), merge(a,b,all.y=T))` or something similar – Venkatramanan P.R. Mar 22 '13 at 10:23
  • 1
    This documentation is the same as of `merge.data.frame`, which behaves differently. I don't understand why a full outer join with an empty table should add an NA row. – vsalmendra Mar 22 '13 at 12:33
0

Given you define a and b in your way. The simple usage of rbind(a,b) would return only the rows of a.

However if you want to merge NULL data table b with some other non-empty data table a, there's different approach. I had similar problem when I had to merge different data tables within different loops. I used this workaround.

  #some loop that returns data.table named a
  #another loop starts
  if(all.equal(a,b<-data.table())==TRUE){
  b<-a
  next
  }
  merge(a,b,c("Factor1","Factor2"))

That helped me, maybe it will help you too.

user974514
  • 552
  • 1
  • 7
  • 19
  • Sorry, I don't quite get your if-statement. If `a` is a `data.table` of dimensions `0 by 0` (like `b`), then you assign `b` to `a`. Why? And you are not using `copy(a)`. This'll just assign by reference and not copy a to b. I don't get it. – Arun Mar 22 '13 at 10:13
  • Suppose you have a loop where you fill data.table `a`. So it's not empty. But at the end of the loop you want to merge it with other data table `b` which can be empty in first iteration, but becomes non-empty in second. That's what this workaround solve. – user974514 Mar 22 '13 at 10:16
  • The down vote is from me, if you could explain (or edit), I'd be happy to up-vote). – Arun Mar 22 '13 at 10:16
  • thanks for the reply. it makes sense now. but it doesn't answer the OP's question, as the data.table is **not** of 0 rows **and columns**, but just 0 rows **and 1 column**. – Arun Mar 22 '13 at 10:17
  • For OP's question, I recommended to use `rbind(a,b)`. The output will be only `a` rows just as OP demanded. – user974514 Mar 22 '13 at 10:20
  • @user97, thanks. `rbind` is not a solution, since it assumes both tables have the same columns. That exactly the reason for using `merge ` instead of rbind. – vsalmendra Mar 22 '13 at 12:40
0

That's to be expected, as for merge.data.frame all=T is a full outer join, so you get all keys of both tables see about merge

Community
  • 1
  • 1
statquant
  • 13,672
  • 21
  • 91
  • 162