-2

The base::levels help file https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html contains the following example of modifying the levels of a variable:

z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z

Suppose that this stuff sits inside a data frame:

mydata <- data.frame(z=gl(3, 2, 12, labels = c("apple", "salad", "orange")), n=1:12)

I want to write a function that does the conversion of the levels which takes the data frame and the variable name as inputs:

modify_levels <- function(df,varname,from,to) {
  ### MAGIC HAPPENS
}

so that modify_levels(mydata,z,from=c("apple","orange"),to="fruit") does a part of the transformation (and modify_levels(mydata,z,from=c("salad","broccoli"),to="veg") does the second part, even though the level broccoli may not exist in my data set).

With some non-standard evaluation voodoo I can zoom down onto what I need to modify:

where_are_levels <- function(df,varname,from,to,verbose=FALSE) {
  # input checks
  if ( !is.data.frame(df) ) {
    stop("df is not a data frame")
  }
  if ( !is.factor(eval(substitute(varname),df)) ) {
    stop("df$varname is not a factor")
  }
  if (verbose==TRUE) {
    cat("df$varname is",
      paste0(substitute(df),"$",substitute(varname)))
    cat(" which evaluates to:\n")
    print( eval(substitute(varname),df) )
  }
  if (length(to)!=1) {
    stop("Substitution is ambiguous")
  }
  # figure out what the cases are with the supplied source values
  for (val in from) {
    r <- (eval(substitute(varname),df) == val)
    if (verbose==TRUE) {
      print(r)
      cat( paste0(substitute(df),"$",substitute(varname)),"==",val)
      cat(": ",sum(r), "case(s)\n")
    }
  }
}

So far, so good (the to option does nothing though):

> where_are_levels(mydata,z,from=c("apple","orange"),to="",verbose=TRUE)

## df$varname is mydata$z which evaluates to:
## [1] apple  apple  salad  salad  orange orange apple  apple  salad  salad  orange orange
## Levels: apple salad orange
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## mydata$z == apple:  4 case(s)
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
## mydata$z == orange:  4 case(s)

Now, for the next step, what I think I need to do is to append the levels of the target variable with an additional level, and change the values of that variable. In interactive work, I would

# to <- "fruit" # passed as a function argument
l1 <- levels(mydata$z)
levels(mydata$z) <- union(l1,to)
mydata[r,"z"] <- to

of which I can only get the first line programmatically within the val cycle:

l1 <- levels(eval(substitute(varname),df))

which would happen inside the val cycle.

Note that I want to keep the existing levels of apples and oranges rather than just change the whole thing around (as was done in the overhaul example in the help file).

If the solution is easier to achieve via dplyr programming from scratch, that is fine with me (although my understanding is that the NSE with it is even more hardcore in dplyr than in base R).

StasK
  • 1,525
  • 10
  • 21
  • 1
    Is there any reason you're using `eval` and `substitute` for strictly `factor`/`character`-based replacement? Seems to me that `replace` could be used with `%in%`, plus a cleanup call to `levels`. – r2evans Jan 02 '18 at 23:28
  • 1
    Can't see why this is so complicated. There is a `levels<-` function. Should work on columns inside dataframes. – IRTFM Jan 02 '18 at 23:30
  • This is so complicated because my knowledge of `R` is patchy. Give newbie a slack will you please. And I really want to do this by reference, so I don't have to haul the data frames and pass them back (legacy of another programming language that I am better used to), but it is nigh impossible in `R`. – StasK Jan 03 '18 at 14:30

3 Answers3

3

No need for the all the substitutions, one should suffice. I'll keep all your messaging

where_are_levels <- function(df,varname,from,to,verbose=FALSE) {
  # input checks
  varname <- substitute(varname)

  if (!is.data.frame(df)) {
    stop("df is not a data frame")
  }
  if (!is.factor(df[[varname]])) {
    stop("df$varname is not a factor")
  }
  if (verbose) {
    cat("df$varname is", paste0(substitute(df),"$",varname))
    cat(" which evaluates to:\n")
    print(df[[varname]])
  }
  if (length(to) != 1) {
    stop("Substitution is ambiguous")
  }
  # figure out what the cases are with the supplied source values
  r <- df[[varname]] %in% from
  new_levels <- union(levels(df[[varname]]), to)
  df[[varname]] <- factor(df[[varname]], new_levels)
  df[[varname]] <- replace(df[[varname]], r, to)
  if (verbose) {
    print(r)
    cat( paste0(df[[varname]]),"==",from)
    cat(": ",sum(r), "case(s)\n")
  }
  return(df)
}
where_are_levels(mydata,z,from=c("apple","orange"),to="fruit")
       z  n
1  fruit  1
2  fruit  2
3  salad  3
4  salad  4
5  fruit  5
6  fruit  6
7  fruit  7
8  fruit  8
9  salad  9
10 salad 10
11 fruit 11
12 fruit 12
Axeman
  • 32,068
  • 8
  • 81
  • 94
2

I don't see the need for either nonstandard evaluation or any tidyverse magic. Just use ordinary "[[" and levels<-

modify_levels <- function(dfrm, cname, from=NA,to=NA) { 
                          pos <- which( from %in% levels(dfrm[[cname]]) )
                         levels(dfrm[[cname]])[pos] <- to
                          dfrm[[cname]]}  # be sure to assign the result back

Use:

> modify_levels(mydata,'z',from=c("salad","broccoli"),to="veg")
 [1] fruit fruit veg   veg   fruit fruit fruit fruit veg   veg   fruit fruit
Levels: fruit veg

But do need to assign the result:

> mydata$z <- modify_levels(mydata,'z',from=c("salad","broccoli"),to="veg")
> mydata
       z  n
1  fruit  1
2  fruit  2
3    veg  3
4    veg  4
5  fruit  5
6  fruit  6
7  fruit  7
8  fruit  8
9    veg  9
10   veg 10
11 fruit 11
12 fruit 12
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

You could change your function to this:

where_are_levels<-function(mydata,varname,from, to, additional){
  mydata[[varname]]<-plyr::mapvalues(mydata[[varname]], from = from, to = to)
  mydata[[varname]]<-factor(mydata[[varname]],levels=c(levels(mydata[[varname]]),additional))
  return(mydata)
}

example:

varname="z"
from = c("apple", "salad","orange")
to = c("fruit", "veg", "fruit")
additional="Milk"    
a<-where_are_levels(mydata,varname,from, to, additional)
JeanVuda
  • 1,738
  • 14
  • 29