3

I build a vector of factors containing NA.

my_vec <- factor(c(NA,"a","b"),exclude=NULL)
levels(my_vec)
# [1] "a" "b" NA 

I change one of those levels.

levels(my_vec)[levels(my_vec) == "b"] <- "c"

NA disappears.

levels(my_vec)
# [1] "a" "c"

How can I keep it ?


EDIT

@rawr gave a nice solution that can work most of the time, it works for my previous specific example, but not for the one I'll show below @Hack-R had a pragmatic option using addNA, I could make it work with that but I'd rather a fully general solution

See this generalized issue

my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a"  NA   "b1" "b2"
levels(my_vec)[levels(my_vec) %in% c("b1","b2")] <- "c"
levels(my_vec)
[1] "a" "c"      # NA disppeared

@rawr's solution:

my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a"  NA   "b1" "b2"
attr(my_vec, 'levels')[levels(my_vec) %in% c("b1","b2")] <- "c"
levels(my_vec)
droplevels(my_vec)
[1] "a" NA  "c" "c" # c is duplicated

@Hack-R's solution:

my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
levels(my_vec)
[1] "a"  NA   "b1" "b2"
levels(my_vec)[levels(my_vec) %in% c("b1","b2")] <- "c"
my_vec <- addNA(my_vec)
levels(my_vec)
[1] "a" "c" NA     # NA is in the end

I want levels(my_vec) == c("a",NA,"c")

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Does that really work? I just tried with the explicit statement `levels(my_vec) == "b" & !is.na(levels(my_vec))`, which also makes the NA disappear, even though the statement returns `FALSE TRUE FALSE` – Florian Jul 20 '17 at 13:46
  • `NA` is not a level, just a missing value. Just run `levels(as.factor(NA))` to see what I mean. – F. Privé Jul 20 '17 at 13:48
  • 2
    interesting.. also `?levels` says that your way is preferred, but `attr(my_vec, 'levels')[attr(my_vec, 'levels') == 'b'] <- 'c'` works as expected – rawr Jul 20 '17 at 13:49
  • maybe `levels<-` calls `factor` with `exclude = NA` – moodymudskipper Jul 20 '17 at 13:53
  • 1
    `attr(my_vec, 'levels')[levels(my_vec) == "b"] <- 'c'` works as well, that would solve the how to my question, now a why would be nice :) – moodymudskipper Jul 20 '17 at 13:55
  • 1
    To know that you'll have to dig up primitives : `function (x, value) .Primitive("levels<-")` : [have fun ;-)](https://github.com/jennybc/access-r-source) – Cath Jul 20 '17 at 13:56
  • 1
    so it is based on [this C function](https://github.com/wch/r-source/blob/f42ee5e7ecf89a245afd6619b46483f1e3594ab7/src/main/attrib.c#L1242-L1261) – Cath Jul 20 '17 at 14:00
  • 1
    One work-around is to give a descriptive label to the NA values. For example, `my_vec <- factor(c(NA,"a","b"), levels=c("missing", "a", "b"))` will work with `levels(my_vec)[levels(my_vec) == "b" & !is.na(levels(my_vec))] <- "c"` just fine and still won't affect functions like `is.na(my_vec)`. Or even use "NA" in the `levels=` argument and you are good to go – lmo Jul 20 '17 at 14:18
  • 1
    [Relevant](https://stackoverflow.com/questions/27195956/convert-na-into-a-factor-level) – Sotos Jul 20 '17 at 14:19
  • @Cath The label "missing: will be attached to NA values. For example, `levels(my_vec)[is.na(my_vec)]` will return "missing" with the code in the comment above and the label "missing" will persist with changes to other factor labels. – lmo Jul 20 '17 at 14:22
  • @lmo fair enough but then we're back to "we must know the position of NA if there is any" ;-), because with `vec <- factor(c("a","b", NA), levels=c("missing", "a", "b"))`, NA will be linked to "b" – Cath Jul 20 '17 at 14:24
  • @rawr your solution isn't strictly equivalent, this will give duplicated factors : `attr(my_vec, 'levels')[levels(my_vec) %in% c("b1","b2")] <- "c"` while my initial option would not – moodymudskipper Jul 20 '17 at 14:28
  • 1
    @Cath Yeah, I guess in that instance, alexis-laz, the comment in the link sotos posted would work, `vec <- factor(c("a","b", NA), levels=paste(c("a","b", NA)))`. – lmo Jul 20 '17 at 14:30
  • I didn't say it was equivalent. I did say yours was preferred didn't I? – rawr Jul 20 '17 at 14:36
  • you didn't say it, but I was hoping it would be :), now I'm looking how to make it work without duplicating factors. – moodymudskipper Jul 20 '17 at 14:37
  • duplicate levels used to be okay but has been deprecated, perhaps that is why yours is preferred since it throws a warning if it finds duplicated levels – rawr Jul 20 '17 at 14:41

2 Answers2

1

You have to quote NA, otherwise R treats it as a null value rather than a factor level. Factor levels sort alphabetically by default, but obviously that's not always useful, so you can specify a different order by passing a new list order to levels()

require(plyr)
my_vec <- factor(c("NA","a","b1","b2"))
vec2 <- revalue(my_vec,c("b1"="c","b2"="c"))

#now reorder levels

my_vec2 <- factor(vec2, levels(vec2)[c(1,3,2)])

Levels: a NA c
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • If you quote it it's not NA anymore :) – moodymudskipper Jul 20 '17 at 17:39
  • @Moody_Mudskipper As far as you're concerned it is. If you want to store NA as a factor level, it shouldn't make any difference whether NA is a string, or a true value, in fact, it's probably better to have NA stored as a string so you don't have to pass a special rule to handle NAs for every operation. – Mako212 Jul 20 '17 at 17:45
  • It's a lot of replacing to do, some is.na to change etc, it's not so trivial. – moodymudskipper Jul 20 '17 at 17:49
  • + whenever I do something like replacing NA by "NA" I feel there's probably a much cleaner way, and I start to doubt NA factors can help, maybe using addNA just before needed is cleanest way after all. – moodymudskipper Jul 20 '17 at 17:52
0

I finally created a function that first replaces the NA value with a temp one (inspired by @lmo), then does the replacement I wanted the standard way, then puts NA back in its place using @rawr's suggestion.

my_vec <- factor(c(NA,"a","b1","b2"),levels = c("a",NA,"b1","b2"),exclude=NULL)
my_vec <- level_sub(my_vec,c("b1","b2"),"c")
my_vec
# 1] <NA> a    c    c   
# Levels: a <NA> c

As a bonus level_sub can be used with na_rep = NULL which will remove the NA, and it will look good in pipe chains :).

level_sub <- function(x,from,to,na_rep = "NA"){
  if(!is.null(na_rep)) {levels(x)[is.na(levels(x))] <- na_rep}
  levels(x)[levels(x) %in% from] <- to
  if(!is.null(na_rep)) {attr(x, 'levels')[levels(x) == na_rep] <- NA}
  x
}

Nevertheless it seems that R really doesn't want you to add NA to factors.

levels(my_vec) <- c(NA,"a") will have a strange behavior but that doesn't stop here. While subset will keep NA levels in your columns, rbind will quietly remove them! I wouldn't be surprised if further investigation revealed that half R functions remove NA factors, making them very unsafe to work with...

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167