5

key issue: using setattr to change level names, keeps unwanted duplicates.

I am cleaning some data where I have sevearl factor levels, all of which are the same, appearing as two or more distinct levels. (This error is due mostly to typos and file encoding issues) I have 153K factors, and abot 5% need to be corrected.

Example

In the following example, the vector has three levels, two of which need to be collapsed into one.

  incorrect <- factor(c("AOB", "QTX", "A_B"))   # this is how the data were entered
  correct   <- factor(c("AOB", "QTX", "AOB"))   # this is how the data *should* be

  > incorrect
  [1] AOB QTX A_B
  Levels: A_B AOB QTX   <~~ Note that "A_B" should be "AOB"

  > correct
  [1] AOB QTX AOB
  Levels: AOB QTX

The vector is part of a data.table.
Everything works fine when using the levels<- function to change the level names.
However, if using setattr, then unwanted duplicates are preserved.

mydt1 <- data.table(id=1:3, incorrect, key="id")
mydt2 <- data.table(id=1:3, incorrect, key="id")



# assigning levels, duplicate levels are dropped
levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect))

# using setattr, duplicate levels are not dropped
setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect)))

                # RESULTS
# Assigning Levels       # Using `setattr`
> mydt1$incorrect        >     mydt2$incorrect
[1] AOB QTX AOB          [1] AOB QTX AOB
Levels: AOB QTX          Levels: AOB AOB QTX   <~~~ Notice the duplicate level

Any thoughts on why this is and/or any options to change this behavior? (ie ..., droplevels=TRUE ?) Thanks

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 1
    Hi. `levels<-()` is clearly doing more than simply overwriting the character string representation of the levels! This might be a case in which it's just better to avoid using `setattr()`. (And I guess `setlevels` could also become a low-priority feature request.) – Josh O'Brien Feb 07 '13 at 18:02
  • ?levels states `Note that for a factor, replacing the levels via levels(x) <- value is not the same as (and is preferred to) attr(x, "levels") <- value.` – mnel Feb 07 '13 at 22:48
  • @mnel, thanks for that reference. It appears that this is an issue with `setattr` and nothing to do with data.table – Ricardo Saporta Feb 08 '13 at 04:50

1 Answers1

5

setattr is a low level, brute force way to change attributes by reference. It doesn't know that the "levels" attribute is special. levels<- has more functionality inside it, but I suspect you may have found that levels(DT$col)<-newlevels will copy the whole of DT (base <-), hence for speed you looked to setattr.

I wouldn't say incorrect btw. It's a correct and valid factor, but just happens to have duplicate levels.

To drop the duplicate levels, I think (untested) :

mydt1[,factorCol:=factor(factorCol)]

should do it. It's possible to go faster than that by finding which levels you've changed, changing the integers to point to the first one of duplicates and then remove the dups from the levels. The call to factor() basically starts from scratch (i.e. coerces all of the factor to character and rematches).

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks @MD, I have been using `droplevels()` to remove the duplicates, but it gives the following error, as does the method above using `factor()`: `duplicated levels will not be allowed in factors anymore` – Ricardo Saporta Feb 07 '13 at 18:06
  • PS "incorrect" is in reference to the data (ie Dirty) and not the functinoality (ie, pretty amazing, regardless of this minor nuance) – Ricardo Saporta Feb 07 '13 at 18:08
  • I haven't seen that error message before I'm afraid. "anymore" seems to suggest it's something new. Sorry. – Matt Dowle Feb 07 '13 at 18:11
  • Apparently the issue is with `setattr`. If I apply the same process to a basic vector (ie, correct the levels via `setattr` then call `droplevels`) I get the same warning. – Ricardo Saporta Feb 08 '13 at 04:53