0

If I take a tibble and try to sample it, it works fine,

dft <- tibble(a=rnorm(200),b=seq.int(1,200),c=sample(LETTERS[1:26],200,replace = T))
sample_frac(dft,.5)
# A tibble: 100 x 3
     a     b c    
 <dbl> <int> <chr>
1 -0.233     58 S    
2  0.0529    82 Y    
3  0.371     31 S    
4  0.978    136 Z    
5  0.878    106 S    
6  0.253     46 D    
7 -1.07      16 W    
8 -1.98     193 Y    
9 -0.890     51 H    
10  0.151     75 A    
# ... with 90 more rows

but if I group that tibble and then try to sample the grouped tibble it returns an empty tibble.

dft <- dft %>% group_by(c) %>% count()              
sample_frac(dft,.5)
# A tibble: 0 x 2
# Groups:   c [0]
# ... with 2 variables: c <chr>, n <int>

If I coerce the tibble to a data.frame, the sampling works. The issue was reported as a bug and closed some time ago, so I am guessing it is not something easily fixed.

What is different about tibbles and data.frames that causes issues like this one?

Kevin Mc
  • 477
  • 4
  • 14
  • There are attributes generated with the `group_by`. If you convert to `data.frame`, those gets lost – akrun Jul 09 '18 at 17:57
  • 3
    It looks like there is only one row for each value of `c`. If you use `%>% sample_frac(0.5)` on one row it will return a 0-row tibble. For example: `data.frame(a = 1) %>% sample_frac(.5)` returns a 0-row tibble. I don't think this has anything to do with differences between tibbles and data.frames – IceCreamToucan Jul 09 '18 at 18:07
  • 1
    This would work: `dft %>% count(c) %>% sample_frac(.5)`. In addition to @Ryan's comment: note that there is only one observation per group that you try to sample from. – markus Jul 09 '18 at 18:15
  • @Ryan It works if you use `sample_frac(as.data.frame(dft),.5)` – Kevin Mc Jul 10 '18 at 13:24
  • That’s because when you do `as.data.frame` you're removing the groups. – IceCreamToucan Jul 10 '18 at 13:25
  • @markus I should say, this is only a example. In the code where I ran into this issue, I was trying to plot a representative sample from a dataframe that included some 4 million rows, but was grouping by an index that repeated, much like the column c in the example code(though not periodically). – Kevin Mc Jul 10 '18 at 13:34
  • @Ryan That's why it has to do with the difference between a tibble and a data.frame, but what I would like to know is _why_ at a structural level this happens. What is the use for preserving the grouping information? Just so you can call ungroup()? – Kevin Mc Jul 10 '18 at 13:42
  • That's just what they decided the behavior should be. See https://github.com/tidyverse/dplyr/issues/2963 . It would be more accurate to say this is an (intentional) difference between grouped tibbles and ungrouped tibbles/data.frames, rather than a difference between tibbles and data.frames. If you remove the groups and keep it as a tibble you get the result you expect. `sample_frac` is intended to (and does) work on groups if the tibble is grouped. – IceCreamToucan Jul 10 '18 at 13:50
  • @Ryan Thanks for the link, that does clarify quite a bit. But I am curious what you mean when you say sample_frac does work on groups if the tibble is grouped. Can you give me an example? If, in the sample code I instead `%>% group_by(c,b,a) %>% sample_frac(.5)`, thereby preserving all rows from the initial tibble, I still get an empty tibble as a result. If the intention of sample_frac was to preserve the distribution of the grouped column, it should be able to give something of a representative sample (I have tried it with 26,000 rows to be sure). – Kevin Mc Jul 10 '18 at 14:20
  • How big are the groups? If the groups are one row each, an empty tibble is what should be returned. – IceCreamToucan Jul 10 '18 at 14:23
  • @Ryan Oh I figured it out. In my last example, it was trying to pull representative samples of the continous variables as well. If I only `group_by(c)` it works as you said. Thanks – Kevin Mc Jul 10 '18 at 14:31

0 Answers0