8

Consider the following:

library(tidyverse)

df <- tibble(x = rnorm(100), y = rnorm(100, 10, 2), z = x * y)

df %>% 
mutate_all(funs(avg = mean(.), dev = sd(.), scaled = (. - mean(.)) / sd(.)))

Is there a way to avoid calling mean and sd twice by referencing the avg and dev columns. What I have in mind is something like

df %>% 
mutate_all(funs(avg = mean(.), dev = sd(.), scaled = (. - avg) / dev))

Clearly this won't work because there aren't columns avg and dev, but x_avg, x_dev, y_avg, y_dev, etc.

Is there a good way, within funs to use the rlang tools to create those column references programmatically, so that I can refer to columns created by the previous named arguments to funs (when . is x, I would reference x_mean and x_dev for calculating x_scaled, and so forth)?

Jonathan Gilligan
  • 701
  • 1
  • 5
  • 21

3 Answers3

5

I think it would be easier if you convert your data to long format

library(tidyverse)

set.seed(111)
df <- tibble(x = rnorm(100), y = rnorm(100, 10, 2), z = x * y)

df %>% 
  gather(key, value) %>% 
  group_by(key) %>% 
  mutate(avg    = mean(value),
         sd     = sd(value),
         scaled = (value - avg) / sd)
#> # A tibble: 300 x 5
#> # Groups:   key [3]
#>    key    value     avg    sd scaled
#>    <chr>  <dbl>   <dbl> <dbl>  <dbl>
#>  1 x      0.235 -0.0128  1.07  0.232
#>  2 x     -0.331 -0.0128  1.07 -0.297
#>  3 x     -0.312 -0.0128  1.07 -0.279
#>  4 x     -2.30  -0.0128  1.07 -2.14 
#>  5 x     -0.171 -0.0128  1.07 -0.148
#>  6 x      0.140 -0.0128  1.07  0.143
#>  7 x     -1.50  -0.0128  1.07 -1.39 
#>  8 x     -1.01  -0.0128  1.07 -0.931
#>  9 x     -0.948 -0.0128  1.07 -0.874
#> 10 x     -0.494 -0.0128  1.07 -0.449
#> # ... with 290 more rows

Created on 2018-11-04 by the reprex package (v0.2.1.9000)

Tung
  • 26,371
  • 7
  • 91
  • 115
2

This might work for you :

avg <- quo(mean(.))
dev <- quo(sd(.))
res <- df %>% 
  mutate_all(funs(avg = !!avg, dev = !!dev, scaled = (. - !!avg) / !!dev))

Confirm that it works :

res0 <- df %>% 
  mutate_all(funs(avg = mean(.), dev = sd(.), scaled = (. - mean(.)) / sd(.)))
identical(res, res0)
# [1] TRUE
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
1

This seems a little convoluted, but it works:

scaled <- function(col_name, x, y) {
  col_name <- deparse(substitute(col_name))
  avg <- eval.parent(as.symbol(paste0(col_name, x)))
  dev <- eval.parent(as.symbol(paste0(col_name, y)))
  (eval.parent(as.symbol(col_name)) - avg) / dev
}

df %>%
  mutate_all(funs(avg = mean(.), dev = sd(.), scaled = scaled(., "_avg", "_dev"))) 
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48