I am currently dealing with a moderately large dataframe called d.mkt
(> 2M
rows and 12
columns). As dplyr
is too slow when applying summarise()
function combined with group_by_at
, I am trying to write an equivalent statement using data.table
to speed up the summarise
computation part of dplyr
. However, the situation is quite special in the case that the original dataframe is group_by_at
and then summarising
over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()
).
With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1).
Could someone please help let me know how to fix this error?
dplyr's code
d.mkt <- d.mkt %>%
left_join(codes, by = c('rte_cd', 'cd')) %>%
mutate(is_valid = replace_na(is_valid, FALSE),
rte_cd = ifelse(is_valid, rte_cd, 'RC'),
rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
select(-is_valid) %>%
group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
ungroup()
My attempt for translating the above
d.mkt <- as.data.table(d.mkt)
d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
`:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'),
rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
d.mkt <- d.mkt[, -"is.valid", with=FALSE]
d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?