Motivation
I am creating a custom utility function streak_over()
, which is intended to grammatically mimic the dplyr
verb group_by()
. While streak_over()
essentially wraps the grouping functionality of group_by()
, this grouping is a precursor to further operations.
Within the context of a given dataset, the purpose of streak_over()
is to index each "streak" of consecutive observations sharing a group, where the grouping is preestablished (via a prior group_by()
) or specified in streak_over()
itself (via tidy evaluation).
Here is an illustration, where the grouping variables are x
and y
respectively:
x y group_id streak_index
<dbl> <chr> <int> <int>
1 1 a 1 1
2 1 a 1 1
3 2 b 4 2
4 2 b 4 2
5 1 a 1 3
6 2 a 3 4
7 1 b 2 5
8 2 b 4 6
Details
Aside from one small issue, I have everything working exactly as desired. Here is the general form of streak_over
, which accepts in ...
all the arguments for group_by()
, and then returns an integer
vector (like streak_index
) with the streak indices. Note: .start
and .min
are my own parameters that define the criteria for a "streak"; aside from their existence as named parameters in the function header, they are otherwise irrelevant to my question.
streak_over <- function(..., .start = 1, .min = 2) {
return(
dplyr::group_by(...) # %>%
# ... %>% Further Operations %>% ...
)
}
Generally, the usability is ideal. Given a data.frame
like x
| y
from the above illustration
df <- data.frame(x = c(1, 1, 2, 2, 1, 2, 1, 2), y = c("a", "a", "b", "b", "a", "a", "b", "b"))
we can generate our vector of streak indices through an ergonomic workflow:
df %>% group_by(x, y) %>% streak_over(.add = TRUE)
# [1] 1 1 2 2 3 4 5 6
df %>% group_by(x) %>% streak_over(y, .add = TRUE)
# [1] 1 1 2 2 3 4 5 6
df %>% streak_over(x, y)
# [1] 1 1 2 2 3 4 5 6
We can also change the grouping as we would with successive uses of group_by()
:
df %>% group_by(x) %>% streak_over(y, .add = FALSE)
# [1] 1 1 2 2 3 3 4 4
df %>% group_by(x) %>% streak_over(y) # .add = FALSE by default
# [1] 1 1 2 2 3 3 4 4
df %>% streak_over(y)
# [1] 1 1 2 2 3 3 4 4
Finally, we can generate the indices with no grouping whatsoever:
df %>% streak_over() # no grouping given
# [1] 1 1 1 1 1 1 1 1
However, there is one default behavior I would like to change.
Problem
According to the current default (.add = FALSE
) of the group_by()
function it wraps, streak_over()
currently overrides existing groupings when .add
is not specified otherwise. While I generally do like this behavior, there is one situation where it is counterintuitive and inconvenient:
df %>% group_by(x, y) %>% streak_over()
# [1] 1 1 1 1 1 1 1 1
Here, a grouping already exists and is likely useful. Furthermore, streak_over()
contains no further grouping variables, which would otherwise necessitate disambiguation via .add
. It would be very convenient to simply have streak_over()
preserve the existing grouping in this particular situation, while deferring to group_by()
for all other default settings.
df %>% group_by(x, y) %>% streak_over()
# [1] 1 1 2 2 3 4 5 6
Here is tabulated the behavior I desire:
.add |
Grouping Variables Specified in streak_over() Call |
Grouping Variables Unspecified |
---|---|---|
(missing) | Current Default Behavior for group_by() |
Always Add to Existing Grouping |
TRUE |
Add to Existing Grouping | Add to Existing Grouping |
FALSE |
Override Existing Grouping | Override Existing Grouping |
I also wish this "deferral" to be dynamic: if the R team updates dplyr::group_by()
with new default settings (like mutate(.add = TRUE,
), I want streak_over()
to follow suit in alignment with the dplyr
workflow, rather than having hard-coded defaults (like streak_over(.add = TRUE,
) that might go out of date.
Finally, for the sake of aesthetics and professionality, I wish to keep the function header for streak_over()
in roughly canonical form:
streak_over <- function(..., .start = 1, .min = 2) {
# ...
}
# Or...
streak_over <- function(.data, ..., .add, .drop, .start = 1, .min = 2) {
# ...
}
Attempts
All without success, I have explored many approaches. My current iteration
streak_over <- function(.data, ..., .add, .drop,
.start = 1, .min = 2) {
if(rlang::is_missing(.add) && !length(list(...))) {
.add <- TRUE
if(!rlang::is_empty(dplyr::group_vars(.data))) {
message("Existing groups will be kept. Discard with '.add = FALSE'.")
}
}
return(
dplyr::group_by(.data, ...,
.add = rlang::maybe_missing(.add),
.drop = rlang::maybe_missing(.drop)) # %>%
# ... %>% Further Operations %>% ...
)
}
has failed, along with many others, for what appear to be the following reasons:
- Despite my initial instincts as supported by this answer,
length(list(...))
cannot effectively test for the presence of masked variables (likey
fromdf
instreak_over(y)
), andmatch.call(expand.dots = FALSE) == match.call(expand.dots = TRUE)
seems similarly unfeasible. Indeed, the former approach gives me the error below; it seems to interprety
as an object unto itself, rather than merely the symbol for a variable within.data
:
df %>% group_by(x) %>% streak_over(y)
# Error in streak_over(., y) : object 'y' not found
- It seems that
dplyr
plays poorly withrlang::maybe_missing()
, whose value should simulate a missing argument. Sincegroup_by()
unconditionally coerces.add
into alogical
, I get the following error:
streak_over <- function(.data, ..., .add, .drop,
.start = 1, .min = 2) {
# Condition REMOVED to progress beyond error above.
return(
dplyr::group_by(.data, ...,
.add = rlang::maybe_missing(.add),
.drop = rlang::maybe_missing(.drop)) # %>%
# ... %>% Further Operations %>% ...
)
}
df %>% group_by(x) %>% streak_over(y)
# Error in if (.add) { : argument is not interpretable as logical
- Even when I attempted to collapse the function header (
streak_over(..., .start = 1, .min = 2)
), and let.add
"lurk in the shadows" until it is explicitly specified (streak_over(.add = TRUE,
), I had to alter its value somehow:new_add <- TRUE
conditionally, followed bygroup_by(..., .add = new_add)
. Unfortunately, if the user does explicitly specify.add
, R will include it within...
next to.add = new_add
, and the resulting clash cannot be avoided by assigningnew_add <- NULL
where appropriate. The result is the (admittedly predictable) error:
streak_over <- function(..., .start = 1, .min = 2) {
# Check if '.add' is present in '...'; and if no masked variables are present therein.
if(is.null(list(...)$.add) && all(names(list(...)) %in% c(".data", ".add", ".drop"))) {
new_add <- TRUE
} else {
new_add <- NULL
}
return(
dplyr::group_by(..., .add = new_add) # %>%
# ... %>% Further Operations %>% ...
)
}
df %>% group_by(x, y) %>% streak_over(.add = FALSE)
# Error in dplyr::group_by(..., .add = add) :
# formal argument ".add" matched by multiple actual arguments
Conclusion
I feel like there has to be a way to conditionally override the default value of a formal parameter to a wrapped function, even if it is the enigmatic ...
containing masked or tidily evaluated variables. However, I suspect this verges into symbol
territory, an area of R with which I am decidedly inexperienced.
As always, I appreciate your consideration, along with any assistance you might be able to render.
Update
Thanks to the hint from ktiu, synthesized with further research on Stack Overflow, I have cobbled together a somewhat "hacky" solution, which seems to satisfy my initial criteria:
streak_over <- function(.data, ..., .add, .drop,
.start = 1, .min = 2) {
# Store the defaults for 'group_by()', in case they are needed.
gb_formals <- formals(dplyr::group_by)
# If neither '.add' nor masked variables in '...' were supplied to
# 'streak_over()', yet a grouping already exists in '.data', override the
# 'group_by()' default to intuitively preserve the grouping.
if(rlang::is_missing(.add) && !length(rlang::enquos(...)) &&
!rlang::is_empty(dplyr::group_vars(.data))) {
.add <- TRUE
message("Existing groups will be kept. Discard with '.add = FALSE'.")
}
return(
dplyr::group_by(.data, ...,
.add = rlang::maybe_missing(.add, gb_formals$.add),
.drop = rlang::maybe_missing(.drop, gb_formals$.drop)) # %>%
# ... %>% Further Operations %>% ...
)
}
I would welcome any further help that:
- Offers a more elegant solution; possibly (say) as suggested here with the
default
package, which seems ideally surgical if implemented safely...though I do wonder if the masking will translate properly. Unfortunately, my first attempt failed: while everything looks clean
streak_over <- function(..., .start = 1, .min = 2) {
# Condition always TRUE here to illustrate the point.
if(TRUE) {
default::default(group_by) <- list(.add = TRUE)
}
# Print out default, to check if correctly updated.
default::default(group_by)
return(
group_by(...) # %>%
# ... %>% Further Operations %>% ...
)
}
and the printout indicates that .add
now defaults to TRUE
for the local group_by()
, the output still acts as if .add = FALSE
:
df %>% group_by(x, y) %>% streak_over()
# - .data = [none]
# - ... = [none]
# * - .add = TRUE
# - .drop = group_by_drop_default(.data)
# [1] 1 1 1 1 1 1 1 1
- Improves the technique (tools, structure, syntax, etc.) of my existing solution; possibly (say) by using
rlang::fn_fmls()
instead ofbase::formals()
. - Stabilizes the functionality of this solution; especially to be more robust against structural changes to
dplyr::group_by()
, including but not limited to:- the alteration of existing defaults for parameters in
group_by()
- the addition of defaults for parameters in
group_by()
that previously had no defaults - the renaming of existing parameters in
group_by()
- the addition of new parameters to
group_by()
.
- the alteration of existing defaults for parameters in
In addition to answers meeting the original criteria, any answers that successfully provide this further help (while meeting the original criteria) will receive my upvote and consideration for acceptance.
Bonus
I am also curious as to which of the following streak_over()
headers is more canonical:
streak_over(..., .start = 1, .min = 2)
: typical wrapper.streak_over(..., .add, .start = 1, .min = 2)
: suggested by ktiu.streak_over(.data, ..., .add, .drop, .start = 1, .min = 2)
: resemblesdplyr
.
Mind you, this function is intended to mimic the dplyr
grammar, in imitation of group_by(.data, ..., .add, .drop)
. Yet before performing further operations, streak_over()
still wraps another function: a situation in which R overwhelmingly represents all arguments passed to the wrapped function as ...
in the wrapper header. Then again, for purely functional purposes, .add
is the only formal parameter passed to group_by()
that need explicitly exist outside ...
in the streak_over()
header.
Clarity on this point, as supported by authoritative references, might serve as a "tiebreaker" for acceptance of an answer.
Thanks again! — Greg