forloop inside dplyr mutate

Question

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.

here is an example

Sample data:

df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))

sample code:

dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))

"to do the following" is awfully vague, requiring readers to trudge through your code, y'know. You could use words to describe it. — Frank, Jun 14 '16 at 21:35
You should `melt` so that you can do a grouped operation on the `x`, `y`, `z` groups. (Actually, from your example, it might be straight-up column arithmetic after melting.) — Gregor Thomas, Jun 14 '16 at 21:36
agree with @Gregor; you could also `tidyr::gather()` (hadleyverse 2) instead of `reshape2:melt()`ing (hadleyverse 1) — Ben Bolker, Jun 14 '16 at 21:37
@BenBolker, there are *two* hadleyverses now? Is this some kind of dimensional rift or many worlds thing? — eipi10, Jun 14 '16 at 22:00
Theoretically, aren't their *infinite* hadleyverses? We've only discovered two of them. We're just a [mere simulation](http://www.extremetech.com/extreme/227126-neil-degrasse-tyson-says-its-very-likely-the-universe-is-a-simulation). — r2evans, Jun 14 '16 at 22:03
Does that mean there's a hadleyverse where you're allowed to make pie charts and second y-axes with ggplot2? — eipi10, Jun 14 '16 at 22:04
in case it's not obvious, hadleyverse 1 = (plyr, reshape2, ggplot2); hadleyverse 2 = (dplyr, tidyr, rvest, readr, ..., ggvis?). I don't know about the original versions of reshape and ggplot (primordial chaos?) — Ben Bolker, Jun 14 '16 at 22:27
Wouldn't it be our luck that the hadleyverse where the entire world uses the metric system is the same hadleyverse that allows pie charts. :( — Benjamin, Jun 14 '16 at 22:56

score 3 · Answer 1 · answered Jun 14 '16 at 23:30

Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.

Your data:

df <- data.frame(treatment=rep(letters[1:2],10),
                 c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
                 c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
                 c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
                 c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
                 c_n=rnorm(20))
library(dplyr)
library(tidyr)

This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.

variableTransform <- data_frame(
  cnum = paste0("c", 1:12),
  cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
#    cnum  cvar
#   <chr> <chr>
# 1    c1    a1
# 2    c2    a1
# 3    c3    a1
# 4    c4    a2
# 5    c5    a2
# 6    c6    a2

Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.

df %>%
  tidyr::gather(cnum, value, -treatment, -c_n) %>%
  tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
  left_join(variableTransform, by = "cnum") %>%
  select(-cnum) %>%
  tidyr::spread(cvar, value) %>%
  mutate(
    ans = a1 * (a2/c_n) + a3 * (a4/c_n)
  ) %>%
  head
#   treatment       c_n xyz         a1          a2         a3          a4         ans
# 1         a -1.535934   x -0.3276474  1.45959746 -1.2650369  1.02795419  1.15801448
# 2         a -1.535934   y -1.3662388 -0.05668467  0.4867865 -0.10138979 -0.01828831
# 3         a -1.535934   z -2.5026018 -0.99797169  0.5181513  1.20321878 -2.03197283
# 4         a -1.363584   x -0.9742016 -0.12650863  1.3612361 -0.24840493  0.15759418
# 5         a -1.363584   y -0.9795871  1.52027017  0.5510857  1.08733839  0.65270681
# 6         a -1.363584   z  0.2985557 -0.22883439  0.1536078 -0.09993095  0.06136036

First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:

df %>%
  tidyr::gather(cnum, value, -treatment, -c_n) %>%
#   treatment         c_n cnum      value
# 1         a  0.20745647 c1_x -0.1250222
# 2         b  0.01015871 c1_x -0.4585088
# 3         a  1.65671028 c1_x -0.2455927
# 4         b -0.24037137 c1_x  0.6219516
# 5         a -1.16092349 c1_x -0.3716138
# 6         b  1.61191700 c1_x  1.7605452

It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:

  tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
#   treatment         c_n cnum xyz      value
# 1         a  0.20745647   c1   x -0.1250222
# 2         b  0.01015871   c1   x -0.4585088
# 3         a  1.65671028   c1   x -0.2455927
# 4         b -0.24037137   c1   x  0.6219516
# 5         a -1.16092349   c1   x -0.3716138
# 6         b  1.61191700   c1   x  1.7605452

From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:

  left_join(variableTransform, by = "cnum") %>%
  select(-cnum) %>%
#   treatment         c_n xyz      value cvar
# 1         a  0.20745647   x -0.1250222   a1
# 2         b  0.01015871   x -0.4585088   a1
# 3         a  1.65671028   x -0.2455927   a1
# 4         b -0.24037137   x  0.6219516   a1
# 5         a -1.16092349   x -0.3716138   a1
# 6         b  1.61191700   x  1.7605452   a1

Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)

  tidyr::spread(cvar, value) %>% head
#   treatment       c_n xyz         a1          a2         a3          a4
# 1         a -1.535934   x -0.3276474  1.45959746 -1.2650369  1.02795419
# 2         a -1.535934   y -1.3662388 -0.05668467  0.4867865 -0.10138979
# 3         a -1.535934   z -2.5026018 -0.99797169  0.5181513  1.20321878
# 4         a -1.363584   x -0.9742016 -0.12650863  1.3612361 -0.24840493
# 5         a -1.363584   y -0.9795871  1.52027017  0.5510857  1.08733839
# 6         a -1.363584   z  0.2985557 -0.22883439  0.1536078 -0.09993095

From here, we just need to mutate to get the right answer.

Gregor Thomas · Answer 2 · 2016-06-14T23:45:38.313

Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).

library(tidyr)
library(stringr)
library(dplyr)

# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
    # separate off the xyz and the c123
    separate(col = cc_xyz, into = c("cc", "xyz")) %>%
    # extract the number
    mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
           # mod it by 4 for groupings and add a letter so its a good col name
           num_mod = paste0("v", (num %% 4) + 1)) %>%
    # remove unwanted columns
    select(-cc, -num) %>%
    # go into a reasonable data width for calculation
    spread(key = num_mod, value = value) %>%
    # calculate
    mutate(result = v1 + v2/c_n + v3 + v4 / c_n)

#    treatment          c_n xyz           v1           v2            v3          v4        result
# 1          a -1.433858289   x  1.242153708 -0.985482158 -0.0240414692  1.98710285    0.51956295
# 2          a -1.433858289   y -0.019255516  0.074453615 -1.6081599298  1.18228939   -2.50389188
# 3          a -1.433858289   z -0.362785313  2.296744655 -0.0610463292  0.89797526   -2.65188998
# 4          a -0.911463819   x -1.088308527 -0.703388193  0.6308253909  0.22685013    0.06534405
# 5          a -0.911463819   y  1.284513516  1.410276163  0.5066869590 -2.07263912    2.51790289
# 6          a -0.911463819   z  0.957778345 -1.136532104  1.3959561507 -0.50021647    4.14947069
# ...

forloop inside dplyr mutate

2 Answers2