Similar to dplyr / R cumulative sum with reset, I'd like to calculate groups and cumsums of a series ('a') with a reset based on either the cumsum exceeding some threshold OR when some other threshold between the "current" observation and the first observation since the reset is met or exceeded. It is key to have a handle on either of these conditions being met.
for example:
library(dplyr)
library(tibble)
library(purrr)
tib <- tibble(
t = c(1,2.05,3,3.5,4.5,4.75,7, 7.3),
a = c(1,1,1,1,2,9,3,1)
)
# what I want
## thresh_a = 4
## thresh_t = 9
# A tibble: 6 x 4
# t a g c
# <dbl> <dbl> <int> <dbl>
# 1 1.00 1.00 0 1.00
# 2 2.05 1.00 0 2.00
# 3 3.00 1.00 0 3.00
# 4 3.50 1.00 0 4.00
# 5 4.50 2.00 1 2.00
# 6 4.75 9.00 1 11.00
# 7 7.00 3.00 2 3.00
# 8 7.30 1.00 2 4.00
# what I want
## thresh_a = 4
## thresh_t = 2
# A tibble: 6 x 4
# t a g c
# <dbl> <dbl> <int> <dbl>
# 1 1.00 1.00 0 1.00
# 2 2.05 1.00 0 2.00
# 3 3.00 1.00 0 3.00
# 4 3.50 1.00 1 1.00
# 5 4.50 2.00 1 3.00
# 6 4.75 9.00 1 12.00
# 7 7.00 3.00 2 3.00
# 8 7.30 1.00 3 1.00
In the first case, the grouping variable ('g') changes when the cumsum over 'a' exceeds thresh_a. In the 2nd case, the reset happens at t=3 because 3-1 is >= thresh_t. The reset happens at t=4.75 because a accumulates to over thresh_a. t=7 resets due to 7-4.75 > thresh_t. In reality, 't' is a timestamp.
EDIT: the looping version. a more vectorized, tidy version is sought:
sum_cond_reset <- function(thresh_a, thresh_t, a, t) {
if (length(a) != length(t))
{
stop("length of vectors must be equal")
}
cumsum <- 0
grp <- 0
grp_idx <- 1
result_cumsum <- numeric()
result_grp <- numeric()
for (i in 1:length(a)) {
cumsum <- cumsum + a[i]
delta_t <- 0
if (i > 1) {
delta_t = t[i] - t[grp_idx]
}
if (cumsum >= thresh_a | delta_t >= thresh_t) {
result_grp <- c(result_grp, grp)
result_cumsum <- c(result_cumsum, cumsum)
grp <- grp + 1
grp_idx <- i
cumsum <- 0
} else {
result_cumsum <- c(result_cumsum, cumsum)
result_grp <- c(result_grp, grp)
}
}
return(tibble(g = result_grp,
c = result_cumsum))
}
tib <- tibble(
t = c(1,2.05,3,3.5,4.5,4.75,7, 7.3),
a = c(1,1,1,1,2,9,3,1)
)
bind_cols(tib, sum_cond_reset(4,9, tib$a, tib$t) )
bind_cols(tib, sum_cond_reset(4,2, tib$a, tib$t) )
produces
> bind_cols(tib, sum_cond_reset(4,9, tib$a, tib$t) )
# A tibble: 8 x 4
t a g c
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 0 1.00
2 2.05 1.00 0 2.00
3 3.00 1.00 0 3.00
4 3.50 1.00 0 4.00
5 4.50 2.00 1.00 2.00
6 4.75 9.00 1.00 11.0
7 7.00 3.00 2.00 3.00
8 7.30 1.00 2.00 4.00
> bind_cols(tib, sum_cond_reset(4,2, tib$a, tib$t) )
# A tibble: 8 x 4
t a g c
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 0 1.00
2 2.05 1.00 0 2.00
3 3.00 1.00 0 3.00
4 3.50 1.00 1.00 1.00
5 4.50 2.00 1.00 3.00
6 4.75 9.00 1.00 12.0
7 7.00 3.00 2.00 3.00
8 7.30 1.00 3.00 1.00