14

I want to group a data.table based on a column's range value, how can I do this with the dplyr library?

For example, my data table is like below:

library(data.table)
library(dplyr)
DT <- data.table(A=1:100, B=runif(100), Amount=runif(100, 0, 100))

Now I want to group DT into 20 groups at 0.05 interval of column B, and count how many rows are in each group. e.g., any rows with a column B value in the range of [0, 0.05) will form a group; any rows with the column B value in the range of [0.05, 0.1) will form another group, and so on. Is there an efficient way of doing this group function?

Thank you very much.

-----------------------------More question on akrun's answer. Thanks akrun for your answer. I got a new question about the "cut" function. If my DT is like below:

DT <- data.table(A=1:10, B=c(0.01, 0.04, 0.06, 0.09, 0.1, 0.13, 0.14, 0.15, 0.17, 0.71)) 

by using the following code:

DT %>% 
  group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05), right=F) ) %>% 
  summarise(n= n()) %>%
  arrange(as.numeric(gr))

I expect to see results like this:

          gr n
1   [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 3
4 [0.15,0.2) 2
5 [0.7,0.75) 1

but the result I got is like this:

          gr n
1   [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 4
4 [0.15,0.2) 1
5 [0.7,0.75) 1 

Looks like the value 0.15 is not correctly allocated. Any thoughts on this?

ekad
  • 14,436
  • 26
  • 44
  • 46
Carter
  • 1,563
  • 8
  • 23
  • 32
  • If the initial object is a data.table, we can use data.table methods `DT[,.N ,.(gr=cut(B, breaks=seq(0, max(B), by=0.05)))]` – akrun Sep 04 '15 at 03:11
  • Fyi, nice to use `set.seed` when producing random example data, so that we're all looking at the same data. – Frank Sep 04 '15 at 03:15
  • Updated the post, check if that works for you. – akrun Sep 04 '15 at 08:15

2 Answers2

19

We can use cut to do the grouping. We create the 'gr' column within the group_by, use summarise to create the number of elements in each group (n()), and order the output (arrange) based on 'gr'.

library(dplyr)
 DT %>% 
     group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>% 
     summarise(n= n()) %>%
     arrange(as.numeric(gr))

As the initial object is data.table, this can be done using data.table methods (included @Frank's suggestion to use keyby)

library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]

EDIT:

Based on the update in the OP's post, we could substract a small number to the seq

lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
   group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
                 .Machine$double.eps, right=FALSE, labels=lvls)) %>% 
   summarise(n=n()) %>% 
   arrange(as.numeric(gr))
#          gr n
#1   (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    @Frank Thanks, I modified it. – akrun Sep 04 '15 at 03:19
  • Hi akrun, I add the new question in this post as it is part of my problem. Once this is addressed I will accept the solution. Thanks. – Carter Sep 04 '15 at 07:50
  • @Carter Thanks, I thought you were asking a completely new question. – akrun Sep 04 '15 at 07:52
  • 2
    This is great. One useful value I used is `Inf` for my particular data set, just want to share with everyone. For example, `breaks=c(1,5,10,Inf)` will include from 5 and beyond. – David C. May 09 '17 at 15:57
0

Adding another, alternative data.table solution:

I normally prefer to use round_any (from plyr) over cut:

e.g.

DT[, .N, keyby = round_any(B, 0.05, floor)]

This essentially rounds the data to any multiple of a number (i.e. 0.05). The third argument says to use floor when rounding (i.e. 0.04 would be grouped down to (0,0.05] and not (0.05,0.1]). You can also set this third argument to ceiling and round (the default).

For large tables, this solution is faster than akrun's data.table solution (with small tables they are roughly the same speed).

One thing to note, is that the output of the two commands is different - the group column for cut is a range, while with round_any, the group column value is a single number (i.e. the floor number).

Benchmark on a 10M row dataset:

DT <- data.table(A=1:10000000, B=runif(10000000), Amount=runif(100, 0, 10000000))

bench::mark(
  dplyr = DT %>%
    group_by(gr = cut(B, breaks = seq(0, 1, by = 0.05))) %>%
    summarise(n = n()) %>%
    arrange(as.numeric(gr)),
  data_table_cut = DT[, .N, keyby = .(gr = cut(B, breaks = seq(0, 1, by = 0.05)))],
  data_table_round_any = DT[, .N, keyby = round_any(B, 0.05, floor)],
  check = FALSE
)

The output:

# A tibble: 3 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 dplyr                654ms    654ms      1.53     445MB        0     1     0      654ms
2 data_table_cut       573ms    573ms      1.75     534MB        0     1     0      573ms
3 data_table_round_any 234ms    236ms      4.21     343MB        0     3     0      712ms

So round_any is roughly 2.5 times faster than the data.table cut solution (and 2.7 times faster than the dplyr solution)...

Ismail Moghul
  • 2,864
  • 2
  • 22
  • 28