How to create bin frequency table where bin size varies by group

Question

I'm trying to create a bin frequency table, where there are multiple grouping columns, but, more importantly, bin size varies by one of the grouping columns. Let me illustrate:

set.seed(42)
ID <- as.factor(c(rep("A",20),rep("B",22)))
date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
val <- round(rnorm(42,20,10),0)

df <- data.frame(ID,date,group,val)

A frequency table for val by ID, date and group can be generated, using some code I have edited from this post:

br <- c(0,10,30,100)
frqtab <- aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))

However, I would like to have different bin sizes for each factor within group, e.g. group G could remain with brG <- c(0,10,30,100) and group H could be brH <- c(0,10,50,100). I guess, I could write some ifelse function, but that would be very messy, particularly because my real data has many groups. Any help will be much appreciated!

I need three sizes per bin group, e.g. 0-10, 10-20 and >20. This then needs to vary for each factor within group. — M.Teich, Aug 10 '18 at 11:48

AntoniosK · Accepted Answer · 2018-08-10T12:55:46.457

Here is a possible solution:

# example data
set.seed(42)
ID <- as.factor(c(rep("A",20),rep("B",22)))
date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
val <- round(rnorm(42,20,10),0)

df <- data.frame(ID,date,group,val)

# using the function you provided
f = function(br, df) {aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))}

library(tidyverse)

# create a look up table
# (specify the breaks for each group)
look_up = data_frame(group_id = c("G","H"),
                     br = list(c(0,10,30,100), c(0,10,50,100)))

df_upd = df %>%
  group_by(group_id = group) %>%          # duplicate group column and group by it
  nest() %>%                              # nest data
  left_join(look_up, by="group_id") %>%   # join look up table to get corresponding breaks
  mutate(d = map2(br, data, ~f(.x, .y)))  # apply function

# see results
df_upd$d

# [[1]]
#   ID date group val.(0,10] val.(10,30] val.(30,100]
# 1  A    C     G          0           5            1
# 2  A    D     G          1           4            1
# 3  B    E     G          1           3            2
# 4  B    F     G          1           5            0
# 
# [[2]]
#   ID date group val.(0,10] val.(10,50] val.(50,100]
# 1  A    C     H          0           6            0
# 2  B    E     H          1           3            0
# 3  B    F     H          0           5            0

I've decided to use the function you provided, which obviously includes the breaks into the column names. For this reason, when you have different breaks for different groups, the output cannot be included in one data frame as there will be a column name conflict.

The only way to get everything in one data frame is if you change your function to produce a more "tidy" output:

library(tidyverse)

# updated function
f = function(br, df) {
  df %>%
  mutate(g = cut(val, br)) %>%
  na.omit() %>%
  count(g, ID, date, group) %>%
  complete(g, nesting(ID, date, group), fill=list(n=0)) }

# same lookup table
look_up = data_frame(group_id = c("G","H"),
                     br = list(c(0,10,30,100), c(0,10,50,100)))

# apply your function
df %>%
  group_by(group_id = group) %>%          
  nest() %>%                              
  left_join(look_up, by="group_id") %>%   
  mutate(d = map2(br, data, ~f(.x, .y))) %>%
  unnest(d) %>%
  select(-group_id) %>%
  arrange(group, date, ID)   # for visualisation purposes only

# # A tibble: 21 x 5
#   g        ID    date  group     n
#   <chr>    <fct> <fct> <fct> <dbl>
# 1 (0,10]   A     C     G         0
# 2 (10,30]  A     C     G         5
# 3 (30,100] A     C     G         1
# 4 (0,10]   A     D     G         1
# 5 (10,30]  A     D     G         4
# 6 (30,100] A     D     G         1
# 7 (0,10]   B     E     G         1
# 8 (10,30]  B     E     G         3
# 9 (30,100] B     E     G         2
# 10 (0,10]  B     F     G         1
# # ... with 11 more rows

Thanks @AntoniosK! Any way to turn df_upd into a more user friendly format? Currently, it's a 2 x 4 data frame filled with list objects. Ideally, df_upd$d would be a data frame. — M.Teich, Aug 10 '18 at 12:37
As I mentioned in the end of the solution, you'll expect to have different column names for each group. How do you imagine that data frame you said? How should the columns look like? — AntoniosK, Aug 10 '18 at 12:39
Sorry, I got so hung up on your code that I didn't read the end of your post. I can imagine a wide and a long format that the table could look like: `ID date valG.(0,10] valG.(10,30] valG.(30,100] valH.(0,10] valH.(10,50] valH.(50,100]` or `ID date group.bin val`. For the latter, group.bin would contain a factor level for each combination of group and bin size. The wide format is somewhat cumbersome, but would fit previous data I already have and reduce the amount of tweaking I have to do to my existing code. Is this a realistic goal? — M.Teich, Aug 10 '18 at 12:54

score 1 · Answer 2 · answered Aug 10 '18 at 15:08

A data.table version of the "tidy" part of Antonios K's answer:

df[, data.table(table(bin = cut(val, 
  breaks = c(0, 10, if (group == "G") 30 else 50, 100)
))), by=.(ID, date, group)]

    ID date group      bin N
 1:  A    C     G   (0,10] 0
 2:  A    C     G  (10,30] 5
 3:  A    C     G (30,100] 1
 4:  A    C     H   (0,10] 0
 5:  A    C     H  (10,50] 6
 6:  A    C     H (50,100] 0
 7:  A    D     G   (0,10] 1
 8:  A    D     G  (10,30] 4
 9:  A    D     G (30,100] 1
10:  B    E     G   (0,10] 1
11:  B    E     G  (10,30] 3
12:  B    E     G (30,100] 2
13:  B    E     H   (0,10] 1
14:  B    E     H  (10,50] 3
15:  B    E     H (50,100] 0
16:  B    F     G   (0,10] 1
17:  B    F     G  (10,30] 5
18:  B    F     G (30,100] 0
19:  B    F     H   (0,10] 0
20:  B    F     H  (10,50] 5
21:  B    F     H (50,100] 0
    ID date group      bin N

Or write a helper function and a helper table:

library(magrittr)
cut_tab = function(x, br) x %>% cut(br) %>% table(bin = . ) %>% data.table

cutDT = data.table(key="group",
  group = c("G", "H"), 
  br = list(c(0, 10, 30, 100), c(0, 10, 50, 100)))

df[, cut_tab(val, br = cutDT[.BY, on=key(cutDT), unlist(x.br)]), by=.(ID, date, group)]

How to create bin frequency table where bin size varies by group

2 Answers2