Dplyr downsample in pipeline

Question

I have a tibble like so:

tibble(a = c(1,2,3,4,5), b = c(1,1,1,2,2))

I want to randomly downsample the data by the "b" column, like so:

tibble(a = c(1,3,4,5), b = c(1,1,2,2))

How can I do this entirely in a Dplyr pipeline without changing the data type of the tibble?

IceCreamToucan · Accepted Answer · 2018-04-13T18:50:28.843

3

This gets the smallest group size (grouped by b), and samples that many elements from each group. Not clear if that's what you wanted.

If your tibble is called df

df %>% 
  group_by(b) %>% 
  add_count %>% 
  slice(sample(row_number(), min(.$n))) %>% 
  select(-n)

edited Apr 13 '18 at 18:50

answered Apr 13 '18 at 18:37

IceCreamToucan

This is much better than mine. I removed it because the OP's conditions are not very clear – akrun Apr 13 '18 at 18:40

1 Answers1