Let's say we have the following dataframe, which is the result of a previous dplyr::sumarise()
call:
df <- data.frame(x = c(1, 2, 3, 4, 5),
n = c(8, 7, 3, 3, 1))
Here df$x
is a value which we later want to group over and df$n
states how many items in the original table had the given df$x
.
However, there are currently too many groups of df$x
, such that the higher values have very low $n
's, so I want to collapse groups into "supergroups", such that each contains a total $n
of at least 4.
I've found many similar questions on SO (this being the most cited), but none quite like this.
The best solution I've found isn't very elegant, but works:
suppressPackageStartupMessages(library(dplyr))
mySplit <- function(x, n) {
a <- seq(n)
count <- 0
j <- 1
for(i in rev(x)) {
a[i] <- j
count <- count + n[i]
if (count >= 4) {
j <- j + 1
count <- 0
}
}
return(a)
}
df <- data.frame(x = c(1, 2, 3, 4, 5),
n = c(8, 7, 3, 3, 1))
df %>%
mutate(
g = mySplit(x, n)
) %>%
group_by(
desc(g)
) %>%
summarise(
x = x[1]
, n = sum(n)
)
#> # A tibble: 3 x 3
#> `desc(g)` x n
#> <dbl> <dbl> <dbl>
#> 1 -3 1 8
#> 2 -2 2 10
#> 3 -1 4 4
Created on 2020-01-13 by the reprex package (v0.3.0)
From that output, we can see that it merged $x %in% (2, 3)
and $x %in% (4, 5)
, such that no group has less than 4 items.
However, this isn't a very good solution for many reasons:
- As already stated, it's inelegant
- It's greedy. This is acceptable since I've made it go from back to front; the latter values of
$x
will almost always have the lower$n
s, so the risk of having poor grouping is reduced. - It's non-optimal. The best split would arguably have been to leave groups
$x = 1, 2, (3, 4, 5)
, which would have created supergroups of almost equal size (8, 7 and 7).
So, is there a better way to do this I'm unaware of?
This is basically a "rolling sum" function which resets the sum whenever a value is crossed and instead returns the number of "previous resets" (though an optimal solution as stated above would need something much more complex). However, I can't figure out how to actually code that.