Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

Question

I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.

I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:

library(tidyverse)

test_df <- tribble(
  ~category, ~pos,
  'A', 2,
  'A', 5,
  'B', 4,
  'B', 8
)

breaks <- tribble(
  ~category, ~start, ~end,
  'A', 0, 4,
  'A', 4, 7,
  'A', 7, 10,
  'B', 0, 3,
  'B', 3, 5,
  'B', 5, 10
)

The result I would like to obtain would be something like:

  category   pos tile   
  <chr>    <dbl> <chr>  
1 A            2 (0, 4] 
2 A            5 (4, 7] 
3 B            4 (3, 5] 
4 B            8 (5, 10]

I would normally use cut for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by to create distinct intervals with cut, is by fixing the number of cuts to perform (which is not applicable in this case).

The best way I can come up to address my problem is this:

bind_rows(
  lapply(
    X=unique(test_df$category),
    FUN=function(x) {
      test_df %>%
        filter(category==x) %>%
        mutate(tile=cut(
          pos,
          breaks=c(0, filter(breaks, category==x)$end, Inf)))
    } ) )

which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).

Any suggestion on how to streamline it? Any way of keeping it "piped"?

Cheers,

Fran

it sounds like you want a `non-equi` join and seeing as you are worried about performance `data.table` sounds appropriate here: https://stackoverflow.com/questions/48256362/efficient-way-to-extract-conditional-data-by-group-from-data-table-during-non-e — user63230, May 29 '22 at 14:59

score 3 · Answer 1 · 2022-05-29T08:17:17.200

3

one solution:

## rearrange "breaks"
breaks <- 
  breaks %>%
  pivot_longer(cols = start:end) %>%
  distinct(category, value) %>% 
  group_by(category) %>%
  summarise(breaks =  list(value))

## join and cut:
test_df %>%
  left_join(breaks) %>%
  rowwise %>%
  mutate(tile = cut(pos, unlist(breaks))) %>%
  ungroup ## reduce memory size of result object

output:


## # A tibble: 4 x 4
## # Rowwise: 
##   category   pos breaks    tile  
##   <chr>    <dbl> <list>    <fct> 
## 1 A            2 <dbl [4]> (0,4] 
## 2 A            5 <dbl [4]> (4,7] 
## 3 B            4 <dbl [4]> (3,5] 
## 4 B            8 <dbl [4]> (5,10]

edit

A faster (about 5 times) though less readable approach with a slightly smaller result object:

library(purrr)

## boil dataframe "breaks" down to a list of break vectors:

breaks_list <- 
  breaks %>%
  ## one tibble with columns 'start' and 'end' per 'category':
  nest_by(category) %>%
  ## dataframe into list of tibbles, named with category
  pull(data, category) %>%
  ## tibbles into vector of breaks
  map(~ .x %>% as.matrix %>% c %>% unique %>% sort) 

## get tile by indexing into "break_list" via mapping: 
test_df %>% 
mutate(
  tile = map2(pos, category,
              ~ cut(.x, breaks_list[[.y]])
              ) %>% unlist
)

edited May 29 '22 at 08:17

answered May 28 '22 at 09:42

This looks really nice! How efficient would it be (memory-wise, primarily) to keep a copy of all breaks (of a category) in each row, though? I assume this might blow up quite quickly – Francisco Rodríguez Algarra May 28 '22 at 12:30
You could drop everything but category and tile with the *.keep* argument (`mutate( ... , .keep = "unused"`) or specifically drop the *breaks* column with `... %>% select(-breaks)`. – May 28 '22 at 12:51
Yes, definitely. I could drop it at the very end, but that wouldn't help during the rowwise calculation, right? Anyway, I will try and see if it fits in memory :) – Francisco Rodríguez Algarra May 28 '22 at 14:09
See alternative approach in the edited answer please. I overlooked that the `object.size()` of the first approaches result can be reduced by a final `ungrouping`. Compared to the ungrouped result, the result returned by the second approach is a few percent points smaller but calculation is about five times at fast (at the cost of readability). – May 29 '22 at 08:16
1

Thanks, I_O! I will give it a try if I can, but in the meantime I realised that changing `cut` to `findInterval` makes the computation insanely more efficient! Using your original approach I needed to request 64GB of RAM on our HPC servers for a single input file of just a few MB, which I needed to kill after roughly 12 hours of execution. Using `findInterval` instead of cut on my own approach (from the OP) completed in less than 5 minutes using less that 6 GB of RAM! – Francisco Rodríguez Algarra May 29 '22 at 09:51
The only difference between `findInterval` and `cut` is that the intervals/tiles are not labelled in the output, but that is actually really easy to reconstruct if necessary once all tiles have been assigned. – Francisco Rodríguez Algarra May 29 '22 at 09:53

Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

1 Answers1