key issue: using setattr
to change level names, keeps unwanted duplicates.
I am cleaning some data where I have sevearl factor levels, all of which are the same, appearing as two or more distinct levels. (This error is due mostly to typos and file encoding issues) I have 153K factors, and abot 5% need to be corrected.
Example
In the following example, the vector has three levels, two of which need to be collapsed into one.
incorrect <- factor(c("AOB", "QTX", "A_B")) # this is how the data were entered
correct <- factor(c("AOB", "QTX", "AOB")) # this is how the data *should* be
> incorrect
[1] AOB QTX A_B
Levels: A_B AOB QTX <~~ Note that "A_B" should be "AOB"
> correct
[1] AOB QTX AOB
Levels: AOB QTX
The vector is part of a data.table
.
Everything works fine when using the levels<-
function to change the level names.
However, if using setattr
, then unwanted duplicates are preserved.
mydt1 <- data.table(id=1:3, incorrect, key="id")
mydt2 <- data.table(id=1:3, incorrect, key="id")
# assigning levels, duplicate levels are dropped
levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect))
# using setattr, duplicate levels are not dropped
setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect)))
# RESULTS
# Assigning Levels # Using `setattr`
> mydt1$incorrect > mydt2$incorrect
[1] AOB QTX AOB [1] AOB QTX AOB
Levels: AOB QTX Levels: AOB AOB QTX <~~~ Notice the duplicate level
Any thoughts on why this is and/or any options to change this behavior? (ie ..., droplevels=TRUE
?)
Thanks