I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.
I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:
library(tidyverse)
test_df <- tribble(
~category, ~pos,
'A', 2,
'A', 5,
'B', 4,
'B', 8
)
breaks <- tribble(
~category, ~start, ~end,
'A', 0, 4,
'A', 4, 7,
'A', 7, 10,
'B', 0, 3,
'B', 3, 5,
'B', 5, 10
)
The result I would like to obtain would be something like:
category pos tile
<chr> <dbl> <chr>
1 A 2 (0, 4]
2 A 5 (4, 7]
3 B 4 (3, 5]
4 B 8 (5, 10]
I would normally use cut
for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by
to create distinct intervals with cut
, is by fixing the number of cuts to perform (which is not applicable in this case).
The best way I can come up to address my problem is this:
bind_rows(
lapply(
X=unique(test_df$category),
FUN=function(x) {
test_df %>%
filter(category==x) %>%
mutate(tile=cut(
pos,
breaks=c(0, filter(breaks, category==x)$end, Inf)))
} ) )
which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).
Any suggestion on how to streamline it? Any way of keeping it "piped"?
Cheers,
Fran