3

I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.

I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:

library(tidyverse)

test_df <- tribble(
  ~category, ~pos,
  'A', 2,
  'A', 5,
  'B', 4,
  'B', 8
)

breaks <- tribble(
  ~category, ~start, ~end,
  'A', 0, 4,
  'A', 4, 7,
  'A', 7, 10,
  'B', 0, 3,
  'B', 3, 5,
  'B', 5, 10
)

The result I would like to obtain would be something like:

  category   pos tile   
  <chr>    <dbl> <chr>  
1 A            2 (0, 4] 
2 A            5 (4, 7] 
3 B            4 (3, 5] 
4 B            8 (5, 10]

I would normally use cut for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by to create distinct intervals with cut, is by fixing the number of cuts to perform (which is not applicable in this case).

The best way I can come up to address my problem is this:

bind_rows(
  lapply(
    X=unique(test_df$category),
    FUN=function(x) {
      test_df %>%
        filter(category==x) %>%
        mutate(tile=cut(
          pos,
          breaks=c(0, filter(breaks, category==x)$end, Inf)))
    } ) )

which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).

Any suggestion on how to streamline it? Any way of keeping it "piped"?

Cheers,

Fran

  • it sounds like you want a `non-equi` join and seeing as you are worried about performance `data.table` sounds appropriate here: https://stackoverflow.com/questions/48256362/efficient-way-to-extract-conditional-data-by-group-from-data-table-during-non-e – user63230 May 29 '22 at 14:59

1 Answers1

3

one solution:

## rearrange "breaks"
breaks <- 
  breaks %>%
  pivot_longer(cols = start:end) %>%
  distinct(category, value) %>% 
  group_by(category) %>%
  summarise(breaks =  list(value))

## join and cut:
test_df %>%
  left_join(breaks) %>%
  rowwise %>%
  mutate(tile = cut(pos, unlist(breaks))) %>%
  ungroup ## reduce memory size of result object

output:


## # A tibble: 4 x 4
## # Rowwise: 
##   category   pos breaks    tile  
##   <chr>    <dbl> <list>    <fct> 
## 1 A            2 <dbl [4]> (0,4] 
## 2 A            5 <dbl [4]> (4,7] 
## 3 B            4 <dbl [4]> (3,5] 
## 4 B            8 <dbl [4]> (5,10]

edit

A faster (about 5 times) though less readable approach with a slightly smaller result object:

library(purrr)

## boil dataframe "breaks" down to a list of break vectors:

breaks_list <- 
  breaks %>%
  ## one tibble with columns 'start' and 'end' per 'category':
  nest_by(category) %>%
  ## dataframe into list of tibbles, named with category
  pull(data, category) %>%
  ## tibbles into vector of breaks
  map(~ .x %>% as.matrix %>% c %>% unique %>% sort) 

## get tile by indexing into "break_list" via mapping: 
test_df %>% 
mutate(
  tile = map2(pos, category,
              ~ cut(.x, breaks_list[[.y]])
              ) %>% unlist
)
  • This looks really nice! How efficient would it be (memory-wise, primarily) to keep a copy of all breaks (of a category) in each row, though? I assume this might blow up quite quickly – Francisco Rodríguez Algarra May 28 '22 at 12:30
  • You could drop everything but category and tile with the *.keep* argument (`mutate( ... , .keep = "unused"`) or specifically drop the *breaks* column with `... %>% select(-breaks)`. –  May 28 '22 at 12:51
  • Yes, definitely. I could drop it at the very end, but that wouldn't help during the rowwise calculation, right? Anyway, I will try and see if it fits in memory :) – Francisco Rodríguez Algarra May 28 '22 at 14:09
  • See alternative approach in the edited answer please. I overlooked that the `object.size()` of the first approaches result can be reduced by a final `ungrouping`. Compared to the ungrouped result, the result returned by the second approach is a few percent points smaller but calculation is about five times at fast (at the cost of readability). –  May 29 '22 at 08:16
  • 1
    Thanks, I_O! I will give it a try if I can, but in the meantime I realised that changing `cut` to `findInterval` makes the computation insanely more efficient! Using your original approach I needed to request 64GB of RAM on our HPC servers for a single input file of just a few MB, which I needed to kill after roughly 12 hours of execution. Using `findInterval` instead of cut on my own approach (from the OP) completed in less than 5 minutes using less that 6 GB of RAM! – Francisco Rodríguez Algarra May 29 '22 at 09:51
  • The only difference between `findInterval` and `cut` is that the intervals/tiles are not labelled in the output, but that is actually really easy to reconstruct if necessary once all tiles have been assigned. – Francisco Rodríguez Algarra May 29 '22 at 09:53