3

I'm a long-term developer but somewhat new to the R language. I'm trying to write some clean and maintainable code. I know how to do this in multiple languages but not R.

I've got an R function that performs the same action for different fields.

# working code but not DRY
summarize_dataset_v2 <- function ( data, dataset ) {
    switch ( dataset,
             hp = {
                 data %>%
                     group_by ( cyl ) %>%
                     summarize ( hp  = mean ( hp ),
                                 num = n ( ) ) ->
                     summarized_composite
             },
             wt = {
                 data %>%
                     group_by ( cyl ) %>%
                     summarize ( wt  = mean ( wt ),
                                 num = n ( ) ) ->
                     summarized_composite
             },
             stop ( "BAD THINGS" ) )

    return ( summarized_composite )

The actual code has 6-8 variants with more logic. It works but by being non-DRY it is a bug ready to happen.

Conceptually what I want looks something like this:

    switch ( dataset,
             hp = { field_name = "hp" },
             wt = { field_name = "wt" },
             stop ( "BAD THINGS" ) )

    data %>%
        group_by ( cyl ) %>%
        summarize ( *field_name = mean( *field_name ),
            num = n( )
        ) ->
        summarized_composite

    return( summarized_composite )
}

The *field_name construct is just there to illustrate that I'd like to parameterize that common code. Maybe currying that summarize statement would work. I'm using the tidyverse stuff but I'm open to using another package to accomplish this.

Edit #1: Thanks for the answers (https://stackoverflow.com/users/12993861/stefan, https://stackoverflow.com/users/12256545/user12256545)

I've applied both answers to my example code and understand (I think) how they work. The one from stefan matches my experience in other languages. The one from user12256545 comes from a different POV and shifts focus to the caller, giving it more power. I haven't done a lot of formula-based code so this is a chance to explore that facet.

I'm going to apply both approaches to my actual problem to see how they feel. I'll respond with the results in a few days.

Thank you both.

Edit #2: When I applied these two approaches to my actual code I found that the one by stefan matched my mental model of how this would work. I accepted that as an answer.

Thanks!

  • You might find the ["Programming with dplyr"](https://dplyr.tidyverse.org/articles/programming.html) vignette useful (including the section on indirection). For a deeper dive, see the [rlang](https://rlang.r-lib.org/) tidy evaluation and metaprogramming vignettes, or the metaprogramming chapters of [*Advanced R*](https://adv-r.hadley.nz/). – zephryl Feb 21 '23 at 20:39

2 Answers2

5

One approach to get rid of the duplicated code may look like so. First, switch is not necessary. Instead you could make use of the .data pronoun to pass column names as strings. Additionally I make use of some glue syntax and the walrus operator := to name the "mean" column according to the column name passed as an argument:

library(dplyr)

summarize_dataset_v2 <- function(data, dataset) {
  if (!dataset %in% c("hp", "wt")) stop("BAD THINGS")

  data %>%
    group_by(cyl) %>%
    summarize(
      "{dataset}" := mean(.data[[dataset]]),
      num = n()
    )
}

summarize_dataset_v2(mtcars, "hp")
#> # A tibble: 3 × 3
#>     cyl    hp   num
#>   <dbl> <dbl> <int>
#> 1     4  82.6    11
#> 2     6 122.      7
#> 3     8 209.     14

summarize_dataset_v2(mtcars, "wt")
#> # A tibble: 3 × 3
#>     cyl    wt   num
#>   <dbl> <dbl> <int>
#> 1     4  2.29    11
#> 2     6  3.12     7
#> 3     8  4.00    14

summarize_dataset_v2(mtcars, "disp")
#> Error in summarize_dataset_v2(mtcars, "disp"): BAD THINGS
stefan
  • 90,330
  • 6
  • 25
  • 51
3

Using formula syntax with aggregate could be an elegant solution in this case:


summarize_dataset <- function(form,data) {
  aggregate(
    form,data,
    FUN=\(x) { setNames(cbind(mean(x),NROW(x)),c("mean","N")) }
  )
}
# simple example:
summarize_dataset(formula(hp~cyl),mtcars)

#>  cyl   hp.mean      hp.N
#> 1   4  82.63636  11.00000
#> 2   6 122.28571   7.00000
#> 3   8 209.21429  14.00000

# more complex selection, group by two factors and three dependent vars:
summarize_dataset(formula(cbind(hp,mpg,disp)~cyl+am),mtcars)

#>   cyl am   hp.mean      hp.N mpg.mean    mpg.N disp.mean   disp.N
#> 1   4  0  84.66667   3.00000 22.90000  3.00000  135.8667   3.0000
#> 2   6  0 115.25000   4.00000 19.12500  4.00000  204.5500   4.0000
#> 3   8  0 194.16667  12.00000 15.05000 12.00000  357.6167  12.0000
#> 4   4  1  81.87500   8.00000 28.07500  8.00000   93.6125   8.0000
#> 5   6  1 131.66667   3.00000 20.56667  3.00000  155.0000   3.0000
#> 6   8  1 299.50000   2.00000 15.40000  2.00000  326.0000   2.0000

# all columns
summarize_dataset(formula(.~ cyl),mtcars)

# iris example
summarize_dataset(formula(Sepal.Length~Species),iris)
user12256545
  • 2,755
  • 4
  • 14
  • 28