6

I have several data.tables that I would like to rbindlist. The tables contain factors with (possibly missing) levels. Then rbindlist(...) behaves differently from do.call(rbind(...)):

dt1 <- data.table(x=factor(c("a", "b"), levels=letters))

rbindlist(list(dt1, dt1))[,x] 
## [1] a b a b
## Levels: a b

do.call(rbind, list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

If I want to keep the levels, do I have tor resort to rbind or is there a data.table way?

zx8754
  • 52,746
  • 12
  • 114
  • 209
shadow
  • 21,823
  • 4
  • 63
  • 77
  • 4
    You can always grab the levels before you call `rbindlist` and then put em back (see [here](http://stackoverflow.com/questions/14634964/how-does-one-change-the-levels-of-a-factor-column-in-a-data-table)). But I think you're right there should be a `droplevels=TRUE` argument. – Justin Oct 18 '13 at 13:54

2 Answers2

4

I guess rbindlist is faster because it doesn't do the checking of do.call(rbind.data.frame,...)

Why not to set the levels after binding?

    Dt <- rbindlist(list(dt1, dt1)) 
    setattr(Dt$x,"levels",letters)  ## set attribute without a copy

from the ?setattr:

setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks. But where I actually use this, I do not know the levels. If I have 2 `data.tables`, I already need `unique(unlist(lapply(list(dt1, dt2), function(dt) levels(dt[,x]))))` to find the levels, and I am afraid then the `do.call(rbind, ...)` version may be faster. – shadow Oct 18 '13 at 14:36
  • 2
    @shadow I'm guessing that'll only be slower if you have a very large number of rows with an even larger number of factor levels, in which case I'd ask - what's the point of having factors? I'd only use factors if I had a small number of elements that are used over and over again in the data with a large degree of repetition – eddi Oct 18 '13 at 15:45
  • 1
    fwiw, if you use the internal `c.factor` function you can speed that last step up quite a bit: `do.call(data.table:::c.factor, lapply(list(dt1, dt2), "[[", 'x'))` in your scenario – eddi Oct 18 '13 at 16:00
2

Thanks for pointing out this problem. As of version 1.8.11 it has been fixed:

dt1 <- data.table(x=factor(c("a", "b"), levels=letters))

rbindlist(list(dt1, dt1))[,x]
#[1] a b a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
eddi
  • 49,088
  • 6
  • 104
  • 155