1

I've been reading through programming with dplyr and trying to apply the ideas it describes in my work. I have something that works, but it's unclear to me whether I've done it in the "right" way. Is there something more elegant or concise I could be doing?

I have a tibble where rows are scenarios and columns relate to tests that were run in that scenario. There are two types of columns, those that store a test statistic that was computed in that scenario and those that store the degrees of freedom of that test.

So, here's a small, toy example of the type of data I have:

library(tidyverse)
set.seed(27599)

my_tbl <- data_frame(test1_stat = rnorm(12), test1_df = rep(x = c(1, 2, 3), times = 4), 
                     test2_stat = rnorm(12), test2_df = rep(x = c(1, 2, 3, 4), times = 3))

I want to compute a summary of each test that will be based on both its stat and its df. My example here is that I want to compute the median stat for each group, where groups are defined by df. The groupings are not guaranteed to be the same across tests, nor are the number of groups even guaranteed to be the same.

So, here's what I've done:

get_test_median = function(df, test_name) {

  stat_col_name <- paste0(test_name, '_stat')
  df_col_name <- paste0(test_name, '_df')
  median_col_name <- paste0(test_name, '_median')

  df %>%
    dplyr::group_by(rlang::UQ(rlang::sym(df_col_name))) %>%
    dplyr::summarise(rlang::UQ(median_col_name) := median(x = rlang::UQ(rlang::sym(stat_col_name)), na.rm = TRUE))
}

my_tbl %>% get_test_median(test_name = 'test1')
my_tbl %>% get_test_median(test_name = 'test2')

This works. But is it how an experienced rlang user would do it? I am new to NSE, and a bit surprised to be using two nested rlang functions repeatedly (UQ(sym(.))).

I am happy using UQ rather than !!, just because I'm more comfortable with traditional function notation.

Based on the comments, I got rid of the namespace::function notation and now my function doesn't look so verbose:

get_test_median = function(df, test_name) {

  stat_col_name <- paste0(test_name, '_stat')
  df_col_name <- paste0(test_name, '_df')
  median_col_name <- paste0(test_name, '_median')

  df %>%
    dplyr::group_by(UQ(sym(df_col_name))) %>%
    dplyr::summarise(UQ(median_col_name) := median(x = UQ(sym(stat_col_name)), na.rm = TRUE))
}
rcorty
  • 1,140
  • 1
  • 10
  • 28
  • I’m a **really big** ([!!!](https://github.com/klmr/modules)) fan of explicitly qualifying identifier names with a namespace. But even I think that in this case it would be better to attach `rlang` to make the code more readable. The package name (“rlang”) is a hint that these are things that should ideally be part of the core language. You wouldn’t use `base::for`, right? – Konrad Rudolph Jun 28 '17 at 18:26
  • I should have stated in my question that this is part of a package. Given that it's part of a package, I believe I have to use the package::function notation. Happy to be corrected on that if I've misunderstood the situation. – rcorty Jun 28 '17 at 18:45
  • 2
    Even in the package, you can save the package::function notation if you make the corresponding imports in your namespace file, see http://r-pkgs.had.co.nz/namespace.html#imports – Consistency Jun 28 '17 at 18:53
  • It would be great if someone could answer "yes" or "no", given my edits in response to these comments. – rcorty Jul 02 '17 at 01:01
  • I guess the question boils down to whether you are comfortable deducing the names of the columns from string, e.g. `get_test_median(test_name = 'test1')`. Column names can have strange edge cases, so if you are comfortable that you can always easily locate columns belonging to the specified model, everything in your example looks good – dmi3kno Jul 27 '17 at 21:20

0 Answers0