I'd like to know the preferred way to frank
subgroups on the count of their appearances by group.
For example, I have customers who belong to segments and who have postal codes. I would like to know the most common 3 postal codes for each segment.
library(data.table)
set.seed(123)
n <- 1e6
df <- data.table( cust_id = 1:n,
cust_segment = sample(LETTERS, size=n, replace=T),
cust_postal = sample(as.character(5e4:7e4),size=n, replace=T)
)
This chain (inside the dcast()
below) produces the desired output but requires two passes, the first to count by group-subgroup and the second to rank the counts by group.
dcast(
df[,.(.N),
by = .(cust_segment, cust_postal)
][,.(cust_postal,
postal_rank = frankv(x=N, order=-1, ties.method = 'first')
), keyby=cust_segment
][postal_rank<=3],
cust_segment ~ paste0('postcode_rank_',postal_rank), value.var = 'cust_postal'
)
# desired output:
# cust_segment postcode_rank_1 postcode_rank_2 postcode_rank_3
# A 51274 64588 59212
# B 63590 69477 50380
# C 60619 66249 53494 ...etc...
Is that the best there is, or is there a single-pass approach?