Referencing variable names in loops for dplyr

Question

I know this has been discussed already, but can't find a solution that works for me. I have several binary (0/1) variables named "indic___1" to "indic___8" and one continuous variable "measure".

I would like to compute summary statistics for "measure" across each group, so I created this code:

library(dplyr)
indic___1 <- c(0, 1, 0, 1, 0)
indic___2 <- c(1, 1, 0, 1, 1)
indic___3 <- c(0, 0, 1, 0, 0)
indic___4 <- c(1, 1, 0, 1, 0)
indic___5 <- c(0, 0, 0, 1, 1)
indic___6 <- c(0, 1, 1, 1, 0)
indic___7 <- c(1, 1, 0, 1, 1)
indic___8 <- c(0, 1, 1, 1, 0)
measure <- c(28, 15, 26, 42, 12)

dataset <- data.frame(indic___1, indic___2, indic___3, indic___4, indic___5, indic___6, indic___7, indic___8, measure)

for (i in 1:8) {
  variable <- paste0("indic___", i)
  print(variable)
  dataset %>% group_by(variable) %>% summarise(mean = mean(measure))
}

It returns an error:

Error in `group_by()`:
! Must group by variables found in `.data`.
x Column `variable` is not found.

@h1427096 it doesn't work. The print function correctly prints the variable's name but the second line doesn't produce any output. — DrNumeri, Feb 01 '23 at 10:01
The general solution to this class of problems is not to use a loop but rather to reformulate your problem without them. Unfortunately you only posted a code fragment so giving more precise hints isn't really possible; but one common solution is to reshape the data into long format (using `tidyr::pivot_longer`). — Konrad Rudolph, Feb 01 '23 at 10:08
the second line does not produce result because you do not print it. Add %>% print at the end of the line an you'll have it printed to the console — h1427096, Feb 01 '23 at 10:20
@h1427096 it works now, but it provides a common mean for all groups — DrNumeri, Feb 01 '23 at 10:32
yes, sorry, actually you should use !!!rlang::parse_exprs(variable) — h1427096, Feb 01 '23 at 10:39
@h1427096 Using `syms()` is simpler than `parse_exprs()`. But at any rate I think OP should use neither. — Konrad Rudolph, Feb 01 '23 at 10:41

Konrad Rudolph · Accepted Answer · 2023-02-01T12:44:00.023

0

Putting data into long format makes this generally solvable without a loop. You didn’t specify what you wanted to do with the data inside the loop so I had to guess, but the general form of the solution would look as follows:

results = dataset |>
    pivot_longer(starts_with("indic___"), names_pattern = "indic___(.*)") |>
    group_by(name, value) |>
    summarize(mean = mean(measure), .groups = "drop")

# # A tibble: 16 × 3
#    name  value  mean
#    <chr> <dbl> <dbl>
#  1 1         0  22
#  2 1         1  28.5
#  3 2         0  26
#  4 2         1  24.2
#  5 3         0  24.2
# …

If you want to separate the results from the individual names, you can use a combination of nest and pull:

results |>
    nest(data = c(value, mean), .by = name) |>
    pull(data)

# [[1]]
# # A tibble: 2 × 2
#   value  mean
#   <dbl> <dbl>
# 1     0  22
# 2     1  28.5
#
# [[2]]
# # A tibble: 2 × 2
#   value  mean
#   <dbl> <dbl>
# 1     0  26
# 2     1  24.2
# …

… but at this point I’d ask myself why I am using table manipulation in the first place. The following seems a lot easier:

indices = unname(mget(ls(pattern = "^indic___")))
results = indices |>
    lapply(split, x = measure) |>
    lapply(vapply, mean, numeric(1L))

# [[1]]
#    0    1
# 22.0 28.5
#
# [[2]]
#     0     1
# 26.00 24.25
# …

Notably, in real code you shouldn’t need the first line since your data should not be in individual, numbered variables in the first place. The proper way to do this is to have the data in a joint list, as is done here. Also, note that I once again explicitly removed the unreadable indic___X names. You can of course retain them (just remove the unname call) but I don’t recommend it.

edited Feb 01 '23 at 12:44

answered Feb 01 '23 at 10:14

Konrad Rudolph

530,221
131
937
1,214

Error in summarize(group_by(pivot_longer(dataset, starts_with("indic___"), : argument "by" is missing, with no default – DrNumeri Feb 01 '23 at 10:42
@DrNumeri The code I've posted does not produce this error, you must be executing different code. – Konrad Rudolph Feb 01 '23 at 10:43
I just copied and pasted your code following the code in OP. I must be doing something wrong. Edit: you wrote "summarize" instead of "summarise", but it works now – DrNumeri Feb 01 '23 at 10:45
@DrNumeri Well the code in your question *does not work*, as Maël said (even after your edit!) so I had to modify it slightly to work. But all I did was to remove the last three values from `measure`. Other than that it is identical to your code (before the loop). – Konrad Rudolph Feb 01 '23 at 10:48
@DrNumeri `summarize` is correct. Actually both work because ‘dplyr’ defines both, but code is customarily written in American English, and that's what I use throughout and that is what I recommend using. – Konrad Rudolph Feb 01 '23 at 10:49
However the output is difficult to read compared with using "group_by" on the original "indic___" variables – DrNumeri Feb 01 '23 at 10:52
you are right, the original code didn't work, it should now. Thank you for your help – DrNumeri Feb 01 '23 at 10:54
"difficult to read" in what sense? Your original code doesn't produce any output so I don't know what you are after. – Konrad Rudolph Feb 01 '23 at 10:57
I mean if you use the following: "dataset %>% group_by(indic___1) %>% summarise(mean = mean(measure)) %>% print" you get at least the name of the variable on top of the output – DrNumeri Feb 01 '23 at 11:03
@DrNumeri The name of the variable, in my code, is in the column `name`. I dropped the weird `indic___` prefix precisely *because* they're unreadable, but if you want to retain it you of course can. If you want to split the table you can also do that after computing the result: `results |> nest(data = mean, .by = name) |> pull(data)`. – Konrad Rudolph Feb 01 '23 at 12:34
I agree with you with regard to variable naming. Unfortunately, that is the way REDCap provides variables from multiple-choice questions. It would be just great if it were possible to make the variable label (e.g. "Indication number 1) and values labels (e.g. yes/no) appear in the output rather than the actual variable name and values – DrNumeri Feb 01 '23 at 14:10
@DrNumeri Of course you can do that. You simply have to provide them. In fact, in R there's no need to use numbers instead of proper labels if you have them. Simply use character or factor variables with the appropriate labels. – Konrad Rudolph Feb 01 '23 at 14:59
Thank you @Konrad Rudolph. A last question: you mentioned that using such a naming convention with numbering should be avoided, but what if I wanted to do the same with variables that are not related by name and just share a binary/categorical nature (say, diabetes, high blood pressure and asthma)? Not possible to predict this in the data input phase, so you would still need to apply a "reshape long" approach, correct? – DrNumeri Feb 01 '23 at 15:59
Basically, whenever you have data that you want to treat in a uniform way this is a sign that this data belongs in a vector, list or table rather than in individual variables. Numbered variable names are a tell-tale sign that this is such a situation, but your example of differently-named categorical variables *might* also fall under this guideline. – Konrad Rudolph Feb 01 '23 at 16:08

Referencing variable names in loops for dplyr

1 Answers1