merge.data.table with all=True introduces NA row. Is this correct?

Question

Doing a merge between a populated data.table and another one that is empty introduces one NA row in the resulting data.table:

a = data.table(c=c(1,2),key='c')
b = data.table(c=3,key='c')
b=b[c!=3]
b
# Empty data.table (0 rows) of 1 col: c
merge(a,b,all=T)
#     c
# 1: NA
# 2:  1
# 3:  2

Why? I expected that it would return only the rows of data.table a, as it does with merge.data.frame:

> merge.data.frame(a,b,all=T,by='c')
#  c
#1 1
#2 2

So you want to merge `a` with an empty data table `b`. Why you introduce `b` in such unusual way? Why not to use `b=data.table()` ? — user974514, Mar 22 '13 at 09:03
@user974514: I just wanted to reproduce the problem as it appeared in my code. Generally the `data.table`'s in my code are populated, but sometimes not, and in my case the tables are keyed and the merge naturally uses these keys. A simple `data.table()` does not reproduce exactly the problem I had. — vsalmendra, Mar 22 '13 at 09:16
@user974514, that would give a NULL data.table (0 rows and columns) and there will be no "key" column. So, merge would not be possible. — Arun, Mar 22 '13 at 09:24
@Arun, I don't think so. I did the same thing with a `data.frame` and the result was the expected one: a new `data.frame` with just the rows of the populated one. — vsalmendra, Mar 22 '13 at 09:33

score 7 · Accepted Answer · answered Aug 31 '13 at 00:39

The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!

Here's a better example :

> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
   P Q
1: 1 3
2: 2 4
> b
   P R
1: 2 5
2: 3 6
> merge(a,b)  # correct
   P Q R
1: 2 4 5
> merge(a,b,all=TRUE)  # correct.  
   P  Q  R
1: 1  3 NA
2: 2  4  5
3: 3 NA  6
> merge(a,b[0],all=TRUE)  # incorrect result when y is empty, agreed
    P  Q  R
1: NA NA NA
2: NA NA NA
3:  1  3 NA
4:  2  4 NA
> merge.data.frame(a,b[0],all=TRUE)  # correct
  P Q  R
1 1 3 NA
2 2 4 NA

Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :

merge no longer returns spurious NA row(s) when y is empty and all.y=TRUE (or all=TRUE), #2633. Thanks to Vinicius Almendra for reporting. Test added.

score 2 · Answer 2 · edited Mar 22 '13 at 10:19

2

all : logical; all = TRUE is shorthand to save setting both all.x = TRUE and all.y = TRUE.

all.x : logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have ’NA’s in those columns that are usually ﬁlled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

all.y : logical; analogous to all.x above.

This is taken from data.table documentation. For more, look at the description of the arguments for merge function there.

I think this answers your question.

edited Mar 22 '13 at 10:19

Arun

116,683
26
284
387

answered Mar 22 '13 at 10:01

Venkatramanan P.R.

184
3

maybe you could try `ifelse(dim(b)[1]==0, merge(a,b,all.x=T), merge(a,b,all.y=T))` or something similar – Venkatramanan P.R. Mar 22 '13 at 10:23
1

This documentation is the same as of `merge.data.frame`, which behaves differently. I don't understand why a full outer join with an empty table should add an NA row. – vsalmendra Mar 22 '13 at 12:33

user974514 · Answer 3 · 2013-03-22T10:22:40.510

0

Given you define a and b in your way. The simple usage of rbind(a,b) would return only the rows of a.

However if you want to merge NULL data table b with some other non-empty data table a, there's different approach. I had similar problem when I had to merge different data tables within different loops. I used this workaround.

  #some loop that returns data.table named a
  #another loop starts
  if(all.equal(a,b<-data.table())==TRUE){
  b<-a
  next
  }
  merge(a,b,c("Factor1","Factor2"))

That helped me, maybe it will help you too.

edited Mar 22 '13 at 10:22

answered Mar 22 '13 at 09:59

user974514

552
1
7
19

Sorry, I don't quite get your if-statement. If `a` is a `data.table` of dimensions `0 by 0` (like `b`), then you assign `b` to `a`. Why? And you are not using `copy(a)`. This'll just assign by reference and not copy a to b. I don't get it. – Arun Mar 22 '13 at 10:13
Suppose you have a loop where you fill data.table `a`. So it's not empty. But at the end of the loop you want to merge it with other data table `b` which can be empty in first iteration, but becomes non-empty in second. That's what this workaround solve. – user974514 Mar 22 '13 at 10:16
The down vote is from me, if you could explain (or edit), I'd be happy to up-vote). – Arun Mar 22 '13 at 10:16
thanks for the reply. it makes sense now. but it doesn't answer the OP's question, as the data.table is **not** of 0 rows **and columns**, but just 0 rows **and 1 column**. – Arun Mar 22 '13 at 10:17
For OP's question, I recommended to use `rbind(a,b)`. The output will be only `a` rows just as OP demanded. – user974514 Mar 22 '13 at 10:20
@user97, thanks. `rbind` is not a solution, since it assumes both tables have the same columns. That exactly the reason for using `merge ` instead of rbind. – vsalmendra Mar 22 '13 at 12:40

score 0 · Answer 4 · edited May 23 '17 at 12:14

0

That's to be expected, as for merge.data.frame all=T is a full outer join, so you get all keys of both tables see about merge

edited May 23 '17 at 12:14

Community

1
1

answered Mar 22 '13 at 10:41

statquant

13,672
21
91
162

Table `b` has an empty key (no rows), which is different from a key with value NA. – vsalmendra Mar 22 '13 at 12:39
2

You might have found a bug, as this is a very unusual case, can you post on data-table list ? – statquant Mar 22 '13 at 12:52
Thanks. I have just submitted it and another bug I found, also related to empty data.table's... – vsalmendra Mar 22 '13 at 13:25
table b does *not* have an empty key. `tables()` shows that the key is set to column `c`. – Arun Mar 23 '13 at 00:06
not sure why we still speak about it, `data.table` should replicate data.frame behaviour, it's a bug let's close ad move on... – statquant Mar 23 '13 at 00:55

merge.data.table with all=True introduces NA row. Is this correct?

4 Answers4