Iterating name of a field with dplyr::summarise function

Question

first time for me here, I'll try to explain you my problem as clearly as possible. I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.

Here's a example :

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
  cur_df <- df
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}

I tried with

for (df in lst_df){
  cur_df <- df
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(cur_camp = mean(cur_camp))
}

but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.

How can I build the current_name_of_erosion_field here ?

score 1 · Accepted Answer · answered Jan 14 '22 at 17:38

1

We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(!!cur_camp := mean(!! sym(cur_camp)))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12

Or may use across

out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
  cur_df <- lst_df[[i]]
  cur_camp <- names(cur_df)[2]
  cur_df <- cur_df %>% 
    group_by(ID) %>% 
    summarise(across(all_of(cur_camp), mean))
  out[[i]] <- cur_df
}

-output

> out
[[1]]
# A tibble: 2 × 2
     ID ERO13
  <dbl> <dbl>
1     1     3
2     2     6

[[2]]
# A tibble: 2 × 2
     ID ERO17
  <dbl> <dbl>
1     4   4.5
2     6  12

answered Jan 14 '22 at 17:38

akrun

874,273
37
540
662

,akrun Master could you please check here if `if_all` is possible? – TarJae Jan 14 '22 at 17:51
1

@TarJae i guess `test %>% filter(if_all(var2:var3, ~ . == var1))` should work based on the example I tested there – akrun Jan 14 '22 at 17:54
thks you a lot @akrun, I chose the across solution which seems to be the clearest for me. Can you explain a bit about this syntax which is totally new for me : summarise(!!cur_camp := mean(!! sym(cur_camp))) ; you start with a string character to turn it into the name of variable (symbol), don't you ? – Béranger Jan 17 '22 at 15:42
@Béranger the `=` doesn't evaluate the object on the lhs, thus if you do `cur_camp = mean(`, the column name will be `cur_camp`, whereas the assignment operator in tidyverse (`:=`) does allow for evaluation (`!!`) of object to return the value of it. Similarly on the rhs, we need to get the value of the column name string stored in cur_camp. Thus, convert to `sym`bol and evaluate (`!!`) – akrun Jan 17 '22 at 16:55

score 0 · Answer 2 · answered Jan 14 '22 at 18:59

A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.

library(tidyverse)

df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))

bind_rows(df1, df2) %>%
  pivot_longer(starts_with('ERO'), 
               names_to = 'ERO',
               values_drop_na = TRUE) %>%
  group_by(ID, ERO) %>%
  summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups:   ID [4]
#>      ID ERO   value
#>   <dbl> <chr> <dbl>
#> 1     1 ERO13   3  
#> 2     2 ERO13   6  
#> 3     4 ERO17   4.5
#> 4     6 ERO17  12

^{Created on 2022-01-14 by the reprex package (v2.0.0)}

Iterating name of a field with dplyr::summarise function

2 Answers2