1

Motivation

I am creating a custom utility function streak_over(), which is intended to grammatically mimic the dplyr verb group_by(). While streak_over() essentially wraps the grouping functionality of group_by(), this grouping is a precursor to further operations.

Within the context of a given dataset, the purpose of streak_over() is to index each "streak" of consecutive observations sharing a group, where the grouping is preestablished (via a prior group_by()) or specified in streak_over() itself (via tidy evaluation).

Here is an illustration, where the grouping variables are x and y respectively:

      x y     group_id streak_index
  <dbl> <chr>    <int>        <int>
1     1 a            1            1
2     1 a            1            1
3     2 b            4            2
4     2 b            4            2
5     1 a            1            3
6     2 a            3            4
7     1 b            2            5
8     2 b            4            6

Details

Aside from one small issue, I have everything working exactly as desired. Here is the general form of streak_over, which accepts in ... all the arguments for group_by(), and then returns an integer vector (like streak_index) with the streak indices. Note: .start and .min are my own parameters that define the criteria for a "streak"; aside from their existence as named parameters in the function header, they are otherwise irrelevant to my question.

streak_over <- function(..., .start = 1, .min = 2) {
  return(
    dplyr::group_by(...) # %>%
    # ... %>% Further Operations %>% ...
  )
}

Generally, the usability is ideal. Given a data.frame like x | y from the above illustration

df <- data.frame(x = c(1, 1, 2, 2, 1, 2, 1, 2), y = c("a", "a", "b", "b", "a", "a", "b", "b"))

we can generate our vector of streak indices through an ergonomic workflow:

df %>% group_by(x, y) %>% streak_over(.add = TRUE)
# [1] 1 1 2 2 3 4 5 6

df %>% group_by(x) %>% streak_over(y, .add = TRUE)
# [1] 1 1 2 2 3 4 5 6

df %>% streak_over(x, y)
# [1] 1 1 2 2 3 4 5 6

We can also change the grouping as we would with successive uses of group_by():

df %>% group_by(x) %>% streak_over(y, .add = FALSE)
# [1] 1 1 2 2 3 3 4 4

df %>% group_by(x) %>% streak_over(y)  # .add = FALSE by default
# [1] 1 1 2 2 3 3 4 4

df %>% streak_over(y)
# [1] 1 1 2 2 3 3 4 4

Finally, we can generate the indices with no grouping whatsoever:

df %>% streak_over()  # no grouping given
# [1] 1 1 1 1 1 1 1 1

However, there is one default behavior I would like to change.

Problem

According to the current default (.add = FALSE) of the group_by() function it wraps, streak_over() currently overrides existing groupings when .add is not specified otherwise. While I generally do like this behavior, there is one situation where it is counterintuitive and inconvenient:

df %>% group_by(x, y) %>% streak_over()
# [1] 1 1 1 1 1 1 1 1

Here, a grouping already exists and is likely useful. Furthermore, streak_over() contains no further grouping variables, which would otherwise necessitate disambiguation via .add. It would be very convenient to simply have streak_over() preserve the existing grouping in this particular situation, while deferring to group_by() for all other default settings.

df %>% group_by(x, y) %>% streak_over()
# [1] 1 1 2 2 3 4 5 6

Here is tabulated the behavior I desire:

.add Grouping Variables Specified in streak_over() Call Grouping Variables Unspecified
(missing) Current Default Behavior for group_by() Always Add to Existing Grouping
TRUE Add to Existing Grouping Add to Existing Grouping
FALSE Override Existing Grouping Override Existing Grouping

I also wish this "deferral" to be dynamic: if the R team updates dplyr::group_by() with new default settings (like mutate(.add = TRUE,), I want streak_over() to follow suit in alignment with the dplyr workflow, rather than having hard-coded defaults (like streak_over(.add = TRUE,) that might go out of date.

Finally, for the sake of aesthetics and professionality, I wish to keep the function header for streak_over() in roughly canonical form:

streak_over <- function(..., .start = 1, .min = 2) {
  # ...
}

# Or...

streak_over <- function(.data, ..., .add, .drop, .start = 1, .min = 2) {
  # ...
}

Attempts

All without success, I have explored many approaches. My current iteration

streak_over <- function(.data, ..., .add, .drop,
                        .start = 1, .min = 2) {
  if(rlang::is_missing(.add) && !length(list(...))) {
    .add <- TRUE

    if(!rlang::is_empty(dplyr::group_vars(.data))) {
      message("Existing groups will be kept. Discard with '.add = FALSE'.")
    }
  }
  
  
  return(
         dplyr::group_by(.data, ...,
                         .add = rlang::maybe_missing(.add),
                         .drop = rlang::maybe_missing(.drop)) # %>%
         # ... %>% Further Operations %>% ...
  )
}

has failed, along with many others, for what appear to be the following reasons:

  1. Despite my initial instincts as supported by this answer, length(list(...)) cannot effectively test for the presence of masked variables (like y from df in streak_over(y)), and match.call(expand.dots = FALSE) == match.call(expand.dots = TRUE) seems similarly unfeasible. Indeed, the former approach gives me the error below; it seems to interpret y as an object unto itself, rather than merely the symbol for a variable within .data:
df %>% group_by(x) %>% streak_over(y)
# Error in streak_over(., y) : object 'y' not found
  1. It seems that dplyr plays poorly with rlang::maybe_missing(), whose value should simulate a missing argument. Since group_by() unconditionally coerces .add into a logical, I get the following error:
streak_over <- function(.data, ..., .add, .drop,
                      .start = 1, .min = 2) {
  # Condition REMOVED to progress beyond error above.
  return(
    dplyr::group_by(.data, ...,
                    .add = rlang::maybe_missing(.add),
                    .drop = rlang::maybe_missing(.drop)) # %>%
    # ... %>% Further Operations %>% ...
  )
}
df %>% group_by(x) %>% streak_over(y)
# Error in if (.add) { : argument is not interpretable as logical 
  1. Even when I attempted to collapse the function header (streak_over(..., .start = 1, .min = 2)), and let .add "lurk in the shadows" until it is explicitly specified (streak_over(.add = TRUE,), I had to alter its value somehow: new_add <- TRUE conditionally, followed by group_by(..., .add = new_add). Unfortunately, if the user does explicitly specify .add, R will include it within ... next to .add = new_add, and the resulting clash cannot be avoided by assigning new_add <- NULL where appropriate. The result is the (admittedly predictable) error:
streak_over <- function(..., .start = 1, .min = 2) {
  # Check if '.add' is present in '...'; and if no masked variables are present therein.
  if(is.null(list(...)$.add) && all(names(list(...)) %in% c(".data", ".add", ".drop"))) {
    new_add <- TRUE
  } else {
    new_add <- NULL
  }
  
  return(
    dplyr::group_by(..., .add = new_add) # %>%
    # ... %>% Further Operations %>% ...
  )
}
df %>% group_by(x, y) %>% streak_over(.add = FALSE)
# Error in dplyr::group_by(..., .add = add) : 
#  formal argument ".add" matched by multiple actual arguments

Conclusion

I feel like there has to be a way to conditionally override the default value of a formal parameter to a wrapped function, even if it is the enigmatic ... containing masked or tidily evaluated variables. However, I suspect this verges into symbol territory, an area of R with which I am decidedly inexperienced.

As always, I appreciate your consideration, along with any assistance you might be able to render.


Update

Thanks to the hint from ktiu, synthesized with further research on Stack Overflow, I have cobbled together a somewhat "hacky" solution, which seems to satisfy my initial criteria:

streak_over <- function(.data, ..., .add, .drop,
                        .start = 1, .min = 2) {
  # Store the defaults for 'group_by()', in case they are needed.
  gb_formals <- formals(dplyr::group_by)
  
  # If neither '.add' nor masked variables in '...' were supplied to
  # 'streak_over()', yet a grouping already exists in '.data', override the
  # 'group_by()' default to intuitively preserve the grouping.
  if(rlang::is_missing(.add) && !length(rlang::enquos(...)) &&
     !rlang::is_empty(dplyr::group_vars(.data))) {
    .add <- TRUE
    message("Existing groups will be kept. Discard with '.add = FALSE'.")
  }

  return(
    dplyr::group_by(.data, ...,
                    .add = rlang::maybe_missing(.add, gb_formals$.add),
                    .drop = rlang::maybe_missing(.drop, gb_formals$.drop)) # %>%
    # ... %>% Further Operations %>% ...
  )
}

I would welcome any further help that:

  • Offers a more elegant solution; possibly (say) as suggested here with the default package, which seems ideally surgical if implemented safely...though I do wonder if the masking will translate properly. Unfortunately, my first attempt failed: while everything looks clean
streak_over <- function(..., .start = 1, .min = 2) {
  # Condition always TRUE here to illustrate the point.
  if(TRUE) {
    default::default(group_by) <- list(.add = TRUE)
  }
  
  # Print out default, to check if correctly updated.
  default::default(group_by)
  
  return(
    group_by(...) # %>%
    # ... %>% Further Operations %>% ...
  )
}

and the printout indicates that .add now defaults to TRUE for the local group_by(), the output still acts as if .add = FALSE:

df %>% group_by(x, y) %>% streak_over()
#   - .data = [none]
#   - ... = [none]
# * - .add = TRUE
#   - .drop = group_by_drop_default(.data)

# [1] 1 1 1 1 1 1 1 1
  • Improves the technique (tools, structure, syntax, etc.) of my existing solution; possibly (say) by using rlang::fn_fmls() instead of base::formals().
  • Stabilizes the functionality of this solution; especially to be more robust against structural changes to dplyr::group_by(), including but not limited to:
    • the alteration of existing defaults for parameters in group_by()
    • the addition of defaults for parameters in group_by() that previously had no defaults
    • the renaming of existing parameters in group_by()
    • the addition of new parameters to group_by().

In addition to answers meeting the original criteria, any answers that successfully provide this further help (while meeting the original criteria) will receive my upvote and consideration for acceptance.

Bonus

I am also curious as to which of the following streak_over() headers is more canonical:

  • streak_over(..., .start = 1, .min = 2): typical wrapper.
  • streak_over(..., .add, .start = 1, .min = 2): suggested by ktiu.
  • streak_over(.data, ..., .add, .drop, .start = 1, .min = 2): resembles dplyr.

Mind you, this function is intended to mimic the dplyr grammar, in imitation of group_by(.data, ..., .add, .drop). Yet before performing further operations, streak_over() still wraps another function: a situation in which R overwhelmingly represents all arguments passed to the wrapped function as ... in the wrapper header. Then again, for purely functional purposes, .add is the only formal parameter passed to group_by() that need explicitly exist outside ... in the streak_over() header.

Clarity on this point, as supported by authoritative references, might serve as a "tiebreaker" for acceptance of an answer.

Thanks again! — Greg

Greg
  • 3,054
  • 6
  • 27
  • It would be easier to help if you provided some test cases that could be run to verify possible solutions. It's unclear exactly what command you are running when you get your errors. The code you have provided so far does not return the results shown in the question. We don't need your actual function, just a simple reproducible example that can be used to test the particular problem you are trying to solve. – MrFlick Jun 29 '21 at 05:21
  • Hi @MrFlick, and thanks for reading! The thing about this question is that it is essentially conceptual in nature: "how do I wrap `group_by()` with a function of the form `streak_by(..., .start, .min)` (or `streak_by(.data, ..., .add, .drop, .start, .min)`), such that I can override the default for `.add` when and only when `...` (which may contain **tidily evaluated** and masked variables) and `.add` are both missing, and otherwise preserve the current defaults for `group_by()`?". The particulars of what I do _after_ the grouping are irrelevant; they are merely the motivation. – Greg Jun 29 '21 at 14:05
  • So to that end, the function `streak_over <- function(..., .start = 1, .min = 2){return(dplyr::group_by(...))}` should suffice, or alternatively `streak_over <- function(.data, ..., .add, .drop, .start = 1, .min = 2){return(dplyr::group_by(...))}`. All I really need is a way to conditionally override the current defaults for `group_by()`, while otherwise preserving them, to get the grouping behavior defined in the table and illustrated in the example output. – Greg Jun 29 '21 at 14:18
  • @MrFlick I have updated my question with reproducible errors. To be clear, these are examples of errors I have already diagnosed, and their cause is hardly mysterious. However, since they concretely illustrate the conceptual roadblocks to a solution, so I suppose they do serve a conceptual purpose. – Greg Jun 29 '21 at 15:18

1 Answers1

2

Here is an approach that

  1. uses rlang::enquos() to defuse the function arguments in ...,
  2. supplies .add = TRUE in cases where
    • .add is not explicitly passed AND
    • ... contains only one "non-special-dot" argument (the name of the data, or . in a pipe)
  3. calls a custom wrapper for group_by() with those variables:
streak_over <- function(..., .start = 1, .min = 2) {
  defused <- rlang::enquos(...)
  if (any(".add" %in% names(defused),
          sum(! grepl("\\..+", names(defused))) > 1))
    custom_wrapper(...)
  else
    custom_wrapper(..., .add = TRUE)
}

custom_wrapper <- function(...) {
  # add custom logic here
  dplyr::group_by(...)
}

Note that I did not go out of my way to match all the cases you specified, but this might work as a proof of concept that you can roll into a more fully fledged solution.

Trying it out:

Example 1

library(dplyr)

df %>%
   group_by(x) %>%
   streak_over()

keeps the grouping:

# A tibble: 8 x 2
# Groups:   x [2]
      x y    
  <dbl> <chr>
1     1 a    
2     1 a    
3     2 b    
4     2 b    
5     1 a    
6     2 a    
7     1 b    
8     2 b    

Example 2

df %>%
  group_by(x) %>%
  streak_over(y)

overwrites the grouping:

# A tibble: 8 x 2
# Groups:   y [2]
      x y    
  <dbl> <chr>
1     1 a    
2     1 a    
3     2 b    
4     2 b    
5     1 a    
6     2 a    
7     1 b    
8     2 b    

Example 3

df %>%
   group_by(x) %>%
   streak_over(.add = F)

deletes the grouping:

# A tibble: 8 x 2
      x y    
  <dbl> <chr>
1     1 a    
2     1 a    
3     2 b    
4     2 b    
5     1 a    
6     2 a    
7     1 b    
8     2 b    

Example 4

df %>%
  group_by(x) %>%
  streak_over(y, .add = T)

adds the grouping:

# A tibble: 8 x 2
# Groups:   x, y [4]
      x y    
  <dbl> <chr>
1     1 a    
2     1 a    
3     2 b    
4     2 b    
5     1 a    
6     2 a    
7     1 b    
8     2 b    
ktiu
  • 2,606
  • 6
  • 20
  • 1
    Thanks, @ktiu, I am exploring this as we speak! – Greg Jun 29 '21 at 15:40
  • Just FYI, I _do_ want to avoid hard-coding any defaults for the `group_by()` parameters when creating the function header `streak_over <- function(..., .add` **`= F`** `, .start = 1, .min = 2)`. The reason is that the R team might change the defaults for `group_by()` in the future, and I want `streak_over()` to automatically align with these changes. Hence, I want to achieve what `rlang::maybe_missing()` should have done, without throwing the error (#2) in my list of examples. – Greg Jun 29 '21 at 15:46
  • I updated my answer, see if the new approach satisfies your constraints. – ktiu Jun 30 '21 at 18:01
  • Thanks @ktiu, I will check it out! – Greg Jun 30 '21 at 18:04
  • Due to the regex, I think there's an edge case in your new `streak_over()`: for `streak_over(.data = df %>% group_by(x), y)`, it **fails to override existing groups**. As a named parameter, `.data` appears in `names(defused)` as `".data"` rather than `""`, which reduces by 1 the count of nonmatches to the regex `"\\..+"`. As such, when only one masking variable is specified alongside `.data`, the count of nonmatches comes out to only 1. So when `.add` is also left out, the `if` condition is `FALSE` and the `else` statement is executed, which passes `.add = TRUE` to `group_by()`. – Greg Jun 30 '21 at 19:45
  • Likewise `df %>% group_by(x) %>% streak_over_2(.data = ., y)`, all with or without explicitly naming `, .drop =` (anything) `)`. – Greg Jun 30 '21 at 19:52
  • Also, I'm wondering how this version of `streak_over()` would check if a grouping exists in the given `.data`; I do so in my hacked solution, because I don't want to confuse the user with `message("Existing groups will be kept. Discard with '.add = FALSE'.")`, unless there actually _were_ some groups to begin with. If `.data` is not a formal argument in the `streak_over()` function header, then `length(dplyr::group_vars(x = .data)) > 0` cannot check for the presence of grouping variables, as `group_vars()` cannot accept an `rlang_fake_data_pronoun` object, only a `data.frame` (or extension). – Greg Jun 30 '21 at 22:17