57

What is the correct way to change the levels of a factor column in a data.table (note: not data frame)

  library(data.table)
  mydt <- data.table(id=1:6, value=as.factor(c("A", "A", "B", "B", "B", "C")), key="id")

  mydt[, levels(value)]
  [1] "A" "B" "C"

I am looking for something like:

mydt[, levels(value) <- c("X", "Y", "Z")]

But of course, the above line does not work.

    # Actual               # Expected result
    > mydt                  > mydt
       id value                id value
    1:  1     A             1:  1     X
    2:  2     A             2:  2     X
    3:  3     B             3:  3     Y
    4:  4     B             4:  4     Y
    5:  5     B             5:  5     Y
    6:  6     C             6:  6     Z
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 2
    you can still set them the traditional way: levels(mydt$value) <- c(...)`. This should be plenty fast unless you have many many levels. – Justin Jan 31 '13 at 21:01
  • 1
    I failed to try the obvious :) thanks! Put it as an answer, so that I can accept it? – Ricardo Saporta Jan 31 '13 at 21:07

5 Answers5

74

You can still set them the traditional way:

levels(mydt$value) <- c(...)

This should be plenty fast unless mydt is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.

To change the levels by reference with no copy of mydt :

setattr(mydt$value,"levels",c(...))

but be sure to assign a valid levels vector (type character of sufficient length) otherwise you'll end up with an invalid factor (levels<- does some checking as well as copying).

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Justin
  • 42,475
  • 9
  • 93
  • 111
  • 1
    @MatthewDowle Thanks for chiming in. `setattr` is exactly what I was looking for. I have gone ahead and wrapped it in a function `changeLevels` with error checking: http://bit.ly/dt_changeLevels – Ricardo Saporta Feb 01 '13 at 18:36
  • 7
    @RicardoSaporta Looks great. Maybe I could add it to data.table? I'd call it `setlevels` and change the interface a little bit: `setlevels(DT$colname, newlevels)`, if that would be ok? People often ask for the `set*` functions to work on `data.frame` too, which they can do. – Matt Dowle Feb 01 '13 at 21:45
  • 6
    @MattDowle, did `setlevels()` get put in in the end? I can't find any other documentation on it. – DaveRGP Feb 03 '15 at 09:41
  • @MattDowle What if I want to use the same levels on several columns at once? Any shortcut to do it? – skan Jun 26 '17 at 20:20
  • setlevels doesn't exist in my data.table – skan Jun 26 '17 at 20:25
  • 2
    @skan Looks like `setlevels()` exists internally at C level but never got exposed. I just filed [FR#2219](https://github.com/Rdatatable/data.table/issues/2219) pointing here. If you'd like to add it pull request will be very welcome. In the meantime can use `setattr`. To assign the same levels to several columns I'm thinking that a `for` loop would be best, fast and most clear for future readers of your code; provided a `set*` function is used inside the loop to avoid copies. – Matt Dowle Jun 27 '17 at 19:00
9

I would rather go the traditional way of re-assignment to the factors

> mydt$value # This we what we had originally
[1] A A B B B C
Levels: A B C
> levels(mydt$value) # just checking the levels
[1] "A" "B" "C"
**# Meat of the re-assignment**
> levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
> levels(mydt$value)[levels(mydt$value)=="B"] <- "Y"
> levels(mydt$value)[levels(mydt$value)=="C"] <- "Z"
> levels(mydt$value)
[1] "X" "Y" "Z"
> mydt # This is what we wanted
   id value
1:  1     X
2:  2     X
3:  3     Y
4:  4     Y
5:  5     Y
6:  6     Z

As you probably notices, the meat of the re-assignment is very intuitive, it checks for the exact level(use grepl in case there's a fuzzy math, regular expressions or likewise)

levels(mydt$value)[levels(mydt$value)=="A"] <- "X" This explicitly checks the value in the levels of the variable under consideration and then reassigns X (and so on) to it - The advantage- you explicitly KNOW what labeled what.

I find renaming levels as here levels(mydt$value) <- c("X","Y","Z") very non-intuitive, since it just assigns X to the 1st level it SEES in the data (so the order really matters)

PPS : In case of too many levels, use looping constructs.

dpel
  • 1,954
  • 1
  • 21
  • 31
ekta
  • 1,560
  • 3
  • 28
  • 57
  • 5
    But each line of this meat will copy the _whole_ of `mydt`. If `mydt` is 20GB in RAM, that's 60GB it'll churn through. `data.table` is used for memory efficiency as well as its syntax. Justin's answer _makes no copy at all_ of the 20GB, it just changes the levels directly in-place. The answer to the valid concern expressed here is to wrap that nice logic up into the new function `setlevels` which is similar in spirit to the safety, robustness and intuitiveness of `setnames`. – Matt Dowle Jan 15 '14 at 12:26
  • @MattDowle thank you for mentioning that. Could you please mention which of the levels(mydt$value)[levels(mydt$value)=="A"] <- "X" is causing this high churn ? Is it the three passes(for 3 searches in levels) that this assignment would have to make causing this (un-necessary) bottleneck ? I still use the mydt <- data.table(id=1:6, value=as.factor(c("A", "A", "B", "B", "B", "C")), key="id"), but touch upon ONLY mydt$value - appreciate you elaborating on this ? – ekta Jan 15 '14 at 14:52
  • 3
    Even though you touch only a small part of `mydt`, R will copy the whole of `mydt` when you use `<-`. In this case assigning to `levels` goes via the `levels<-` function and that copies the whole of `mydt`. Hence in `data.table` we provide `:=` and `set*` functions to assign by reference to change only the small parts of `mydt` that you want to change, with no copy. `setattr` is one of the `set*` functions. – Matt Dowle Jan 15 '14 at 16:03
  • 1
    @MattDowle Thank you so much for that elaborate explanation + 1 on both your notes. I now also "exactly" understand Justin's comment. – ekta Jan 15 '14 at 16:13
  • I have been looking for this line forever LOL levels(mydt$value)[levels(mydt$value)=="A"] <- "X" Thanks – Annalisa Sep 20 '17 at 21:04
5

You can also rename and add to your levels using a related approach, which can be very handy, especially if you are making a plot that needs more informative labels in a particular order (as opposed to the default):

f <- factor(c("a","b"))
levels(f) <- list(C = "C", D = "a", B = "b")

(modified from ?levels)

Bryan Hanson
  • 6,055
  • 4
  • 41
  • 78
  • 1
    +1 Strictly speaking though the question was about how to change levels _when that factor is a column of a `data.table`_. `data.table` has built in features to allow adding and renaming factors, by reference too to avoid copying the entire object for speed. `:=` on a factor column will auto add the RHS as a level if it isn't already there. And `setattr` can be used to change levels of factor columns by reference (no copy). – Matt Dowle Feb 01 '13 at 08:55
  • @MatthewDowle Ah, you know I didn't even catch `data.table` vs `data.frame`! Thanks for pointing that out. Most of my data sets are only modest in size but I am aware of `data.table` and it has some nice features I know I can use. – Bryan Hanson Feb 01 '13 at 12:20
1

This is safer than Matt Dowle's suggestion (because it uses the checks skipped by setattr) but won't copy the entire data.table. It will replace the entire column vector (whereas Matt's solution only replaces the attributes of the column vector) , but that seems like an acceptable trade-off in order to reduce the risk of messing up the factor object.

mydt[, value:=`levels<-`(value, c("X", "Y", "Z"))]
Michael
  • 5,808
  • 4
  • 30
  • 39
-1

Simplest way to change a column's levels:

dat$colname <- as.factor(as.vector(dat$colname));