0

I have a tibble like so:

tibble(a = c(1,2,3,4,5), b = c(1,1,1,2,2))

I want to randomly downsample the data by the "b" column, like so:

tibble(a = c(1,3,4,5), b = c(1,1,2,2))

How can I do this entirely in a Dplyr pipeline without changing the data type of the tibble?

picciano
  • 22,341
  • 9
  • 69
  • 82
Christopher Costello
  • 1,186
  • 2
  • 16
  • 30

1 Answers1

3

This gets the smallest group size (grouped by b), and samples that many elements from each group. Not clear if that's what you wanted.

If your tibble is called df

df %>% 
  group_by(b) %>% 
  add_count %>% 
  slice(sample(row_number(), min(.$n))) %>% 
  select(-n)
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • This is much better than mine. I removed it because the OP's conditions are not very clear – akrun Apr 13 '18 at 18:40