35

I have a data frame with numeric and ordered factor columns. I have lot of NA values, so no level is assigned to them. I changed NA to "No Answer", but levels of the factor columns don't contain that level, so here is how I started, but I don't know how to finish it in an elegant way:

addNoAnswer = function(df) {
   factorOrNot = sapply(df, is.factor)
   levelsList = lapply(df[, factorOrNot], levels)
   levelsList = lapply(levelsList, function(x) c(x, "No Answer"))
   ...

Is there a way to directly apply new levels to factor columns, for example, something like this:

df[, factorOrNot] = lapply(df[, factorOrNot], factor, levelsList)

Of course, this doesn't work correctly.

I want the order of levels preserved and "No Answer" level added to last place.

zx8754
  • 52,746
  • 12
  • 114
  • 209
enedene
  • 3,525
  • 6
  • 34
  • 41

5 Answers5

36

The levels function accept the levels(x) <- value call. Therefore, it's very easy to add different levels:

f1 <- factor(c("a", "a", NA, NA, "b", NA, "a", "c", "a", "c", "b"))
str(f1)
 Factor w/ 3 levels "a","b","c": 1 1 NA NA 2 NA 1 3 1 3 ...
levels(f1) <- c(levels(f1),"No Answer")
f1[is.na(f1)] <- "No Answer"
str(f1)
 Factor w/ 4 levels "a","b","c","No Answer": 1 1 4 4 2 4 1 3 1 3 ...

You can then loop it around all variables in a data.frame:

f1 <- factor(c("a", "a", NA, NA, "b", NA, "a", "c", "a", "c", "b"))
f2 <- factor(c("c", NA, "b", NA, "b", NA, "c" ,"a", "d", "a", "b"))
f3 <- factor(c(NA, "b", NA, "b", NA, NA, "c", NA, "d" , "e", "a"))
df1 <- data.frame(f1,n1=1:11,f2,f3)

str(df1)
  'data.frame':   11 obs. of  4 variables:
  $ f1: Factor w/ 3 levels "a","b","c": 1 1 NA NA 2 NA 1 3 1 3 ...
  $ n1: int  1 2 3 4 5 6 7 8 9 10 ...
  $ f2: Factor w/ 4 levels "a","b","c","d": 3 NA 2 NA 2 NA 3 1 4 1 ...
  $ f3: Factor w/ 5 levels "a","b","c","d",..: NA 2 NA 2 NA NA 3 NA 4 5 ...    

for(i in 1:ncol(df1)) if(is.factor(df1[,i])) levels(df1[,i]) <- c(levels(df1[,i]),"No Answer")
df1[is.na(df1)] <- "No Answer"

str(df1)
 'data.frame':   11 obs. of  4 variables:
  $ f1: Factor w/ 4 levels "a","b","c","No Answer": 1 1 4 4 2 4 1 3 1 3 ...
  $ n1: int  1 2 3 4 5 6 7 8 9 10 ...
  $ f2: Factor w/ 5 levels "a","b","c","d",..: 3 5 2 5 2 5 3 1 4 1 ...
  $ f3: Factor w/ 6 levels "a","b","c","d",..: 6 2 6 2 6 6 3 6 4 5 ...
Bastien
  • 3,007
  • 20
  • 38
30

You could define a function that adds the levels to a factor, but just returns anything else:

addNoAnswer <- function(x){
  if(is.factor(x)) return(factor(x, levels=c(levels(x), "No Answer")))
  return(x)
}

Then you just lapply this function to your columns

df <- as.data.frame(lapply(df, addNoAnswer))

That should return what you want.

ilir
  • 3,236
  • 15
  • 23
  • Just a little suggestion to make this function more generic. I've encountered the need to add a new level to a given factor a number of times (e.g., when merging datasets), so others might be in that case too: addLevel <- function(x, newlevel=NULL){ if(is.factor(x)) return(factor(x, levels=c(levels(x), newlevel))) return(x) } – msoftrain Aug 22 '14 at 15:47
  • 3
    It's probably better to do something like `df[] <- lapply(df, addNoAnswer)` instead (haven't tested it with your function though). – David Arenburg Jun 06 '17 at 11:38
18

I have a very simple answer that may not directly address your specific scenario, but is a simple way to do this generally

levels(df$column) <- c(levels(df$column), newFactorLevel)
Michael L
  • 181
  • 1
  • 4
  • part of the value of this answer is that it cleanly and easily generalizes to cases beyond the original, while answering the original question well. – EngrStudent Aug 26 '21 at 13:50
6

Since this question was last answered this has become possible using fct_explicit_na() from the forcats package. I add here the example given in the documentation.

f1 <- factor(c("a", "a", NA, NA, "a", "b", NA, "c", "a", "c", "b"))
table(f1)

# f1
# a b c 
# 4 2 2 

f2 <- forcats::fct_explicit_na(f1)
table(f2)

# f2
#     a         b         c (Missing) 
#     4         2         2         3 

Default value is (Missing) but this can be changed via the na_level argument.

Uwe
  • 41,420
  • 11
  • 90
  • 134
Joe
  • 8,073
  • 1
  • 52
  • 58
  • 1
    Good suggestion. Hadley's `forcats` package has turned out to be a great help to me when I had to solve tricky as well as trivial situations with factors. – Uwe Jun 06 '17 at 12:45
4

Expanding on ilir's answer and its comment, you can check if a column is a factor and that it does not already contain the new level, then add the level and thus make the function re-runable:

addLevel <- function(x, newlevel=NULL) {
  if(is.factor(x)) {
    if (is.na(match(newlevel, levels(x))))
      return(factor(x, levels=c(levels(x), newlevel)))
  }
  return(x)
}

You can then apply it like so:

dataFrame$column <- addLevel(dataFrame$column, "newLevel")
Danny Varod
  • 17,324
  • 5
  • 69
  • 111