lappy conditional on variable value

Question

I want to lappy two functions on a data set conditional on the value of a specific variable.

first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}

df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)

Someting like:

df <- lapply(df, if(df$Letters=="A") first_function else second_function )

To produce:

df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4))

You may need `ifelse` i.e. `df$Numbers <- with(df, ifelse(Letters == "A", first_function(Numbers), second_function(Numbers)))` — akrun, Oct 26 '18 at 17:03
Your current application of `lapply` is incorrect. Regardless of which function you use, `lapply` will apply the function on each column of the frame. Since you have non-`numeric` columns, they will produce errors (if `character`) or warnings and `NA`s (if `factor`). Based on a naive interpretation of your functions, I think @akrun's solution is more correct. — r2evans, Oct 26 '18 at 17:09

Hanjo Odendaal · Accepted Answer · 2018-10-27T06:44:22.490

You can do it with dplyr and purrr. Obviously this is a basic function, but you should be able to build on it for your needs:

library(dplyr)
library(purrr)
calc <- function(y, x){
  first_function <- function(x) {return (x + 0)}
  second_function <- function(x) {return (x + 1)}

  if(y == "A")
    return(first_function(x))

    return(second_function(x))
}

df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)

df %>% 
  mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))

  Letters Numbers
1       A       1
2       B       3
3       B       4

>(df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4)))
      Letters Numbers
1       A       1
2       B       3
3       B       4

BENCHMARKING

I am not a data.table expert (feel free to add), so did not incorporate here. But, @R Yoda is correct. Although it reads nicely and future you will find it easier to read and extend the function, the purrr solution is not that fast. I liked the ifelse approach, so added case_when which is easier to scale when dealing with multiple functions. Here are a couple solutions:

library(dplyr)
library(purrr)
library(microbenchmark)

first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}

calc <- function(y, x){
  if(y == "A")
    return(first_function(x))

    return(second_function(x))
}

df <- data.frame(Letters = rep(c("A","B","B"),1000), Numbers = 1:3)

basic <- function(){
  data.frame(df$Letters, apply(df, 1, function(row) {
    num <- as.numeric(row['Numbers'])
    if (row['Letters'] == 'A') first_function(num) else second_function(num)
  }))
}

dplyr_purrr <- function(){
  df %>% 
    mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
}

dplyr_case_when <- function(){
  df %>% 
    mutate(Numbers = case_when(
        Letters == "A" ~ first_function(Numbers),
        TRUE ~ second_function(Numbers)))
}

map_list <- function(){
   data.frame(df$Letters, map2_dbl(df2$Letters, df2$Numbers, ~calc(.x, .y)))
}

within_mapply <- function(){
  within(df, Numbers <- mapply(Letters, Numbers, 
                               FUN = function(x, y){
    switch(x, 
           "A" = first_function(y),
           "B" = second_function(y))
    }))
}

within_ifelse <- function(){

  within(df, Numbers <- ifelse(Letters == "A",
                               first_function(Numbers),
                               second_function(Numbers)))
}

within_case_when <- function(){

  within(df, Numbers <- case_when(
    Letters == "A" ~ first_function(Numbers),
    TRUE ~ second_function(Numbers)))
}

(mbm <- microbenchmark(
  basic(),
  dplyr_purrr(),
  dplyr_case_when(),
  map_list(),
  within_mapply(),
  within_ifelse(),
  within_case_when(),
  times = 1000
))

Unit: microseconds
               expr       min         lq       mean     median        uq        max neval    cld
            basic() 12816.427 24028.3375 27719.8182 26741.7770 29417.267 277756.650  1000      f
      dplyr_purrr()  9682.884 17817.0475 20072.2752 19736.8445 21767.001  48344.265  1000     e 
  dplyr_case_when()  1098.258  2096.2080  2426.7183  2325.7470  2625.439   9039.601  1000  b    
         map_list()  8764.319 16873.8670 18962.8540 18586.2790 20599.000  41524.564  1000    d  
    within_mapply()  6718.368 12397.1440 13806.1752 13671.8120 14942.583  24958.390  1000   c   
    within_ifelse()   279.796   586.6675   690.1919   653.3345   737.232   8131.292  1000 a     
 within_case_when()   470.155   955.8990  1170.4641  1070.5655  1219.284  46736.879  1000 a

Elegant solution to encapsulate the conditional logic in a function but this seems to impact the performance since the function is called per row (but performance was no requirement in the question) — R Yoda, Oct 26 '18 at 18:01

divibisan · Answer 2 · 2018-10-26T17:27:54.253

The simple way to do this with *apply would be to put the whole logic (with the conditional and the two functions) into another function and use apply with MARGIN=1 to pass the data in row by row (lapply will pass in the data by column):

apply(df, 1, function(row) {
    num <- as.numeric(row['Numbers'])
    if (row['Letters'] == 'A') first_function(num) else second_function(num)
    })

[1] 1 3 4

The problem with this approach, at @r2evans points out in the comment below, is that when you use apply with a heterogeneous data.frame (in this case, Letters is type factor while Numbers is type integer) each row passed into the applied function is passed as a vector which can only have a single type, so everything in the row is coerced to the same type (in this case character). This is why it's necessary to use as.numeric(row['Numbers']), to turn Numbers back into type numeric. Depending on your data, this could be a simple fix (as above) or it could make things much more complicated and bug-prone. Either way @akrun's solution is much better, since it preserves each variable's original data type.

The fact that you have to force `as.numeric(row['Numbers'])` indicates the cost/consequence of using `apply` across all columns of a heterogeneous frame. Larger data and/or other classes of data will have different costs/consequences. — r2evans, Oct 26 '18 at 17:12

score 0 · Answer 3 · answered Oct 26 '18 at 17:58

lapply has difficulty in this case because it's column-based. However you can try transpose your data by t() and use lapply if you persist. Here I provide two ways which use mapply and ifelse :

df$Letters <- as.character(df$Letters)

# Method 1
within(df, Numbers <- mapply(Letters, Numbers, FUN = function(x, y){
                             switch(x, "A" = first_function(y),
                                       "B" = second_function(y))
}))

# Method 2
within(df, Numbers <- ifelse(Letters == "A",
                             first_function(Numbers),
                             second_function(Numbers)))

Both above got the same outputs :

#   Letters Numbers
# 1       A       1
# 2       B       3
# 3       B       4

score 0 · Answer 4 · answered Oct 26 '18 at 18:14

Here a data.table variant for better performance in case of many data rows (but also showing an implicit conversion problem):

library(data.table)

setDT(df)  # fast convertion from data.frame to data.table

df[  Letters == "A",  Numbers := first_function(Numbers) ]
df[!(Letters == "A"), Numbers := second_function(Numbers)]  # issues a warning, see below

df
# Letters Numbers
# 1:       A       1
# 2:       B       3
# 3:       B       4

The issued warning is:

Warning message: In [.data.table(df, !(Letters == "A"), :=(Numbers, second_function(Numbers))) : Coerced 'double' RHS to 'integer' to match the column's type; may have truncated precision. Either change the target column ['Numbers'] to 'double' first (by creating a new 'double' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

The reason is that the data.frame column Numbers is an integer

> str(df)
'data.frame':   3 obs. of  2 variables:
 $ Letters: Factor w/ 2 levels "A","B": 1 2 2
 $ Numbers: int  1 2 3

but the functions return a double (for whatever reason):

> typeof(first_function(df$Numbers))
[1] "double"

lappy conditional on variable value

4 Answers4