0

I am currently dealing with a moderately large dataframe called d.mkt (> 2M rows and 12 columns). As dplyr is too slow when applying summarise() function combined with group_by_at, I am trying to write an equivalent statement using data.table to speed up the summarise computation part of dplyr. However, the situation is quite special in the case that the original dataframe is group_by_at and then summarising over the same set of columns (e.g. X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()).

With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1). Could someone please help let me know how to fix this error?

dplyr's code

    d.mkt <- d.mkt %>%
  left_join(codes, by = c('rte_cd', 'cd')) %>%
  mutate(is_valid = replace_na(is_valid, FALSE),
         rte_cd = ifelse(is_valid, rte_cd, 'RC'),
         rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
  select(-is_valid) %>%
  group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
  summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
  ungroup()

My attempt for translating the above

  d.mkt <- as.data.table(d.mkt)
  d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
                        `:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'), 
                              rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
  d.mkt <- d.mkt[, -"is.valid", with=FALSE]
  d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?
M--
  • 25,431
  • 8
  • 61
  • 93
user177196
  • 738
  • 1
  • 8
  • 16
  • Please, [edit] your question and provide a [mre]. At least, post the result of `dput(head(d.mkt))` as well as of `dput(head(code))`. This will help to identify further optimisation potential. Thank you. – Uwe May 06 '22 at 18:03
  • 1
    For example, it would be interesting to see if `is_valid` contains only `TRUE` and `NA`. This would allow for specific optimisations. Also, why is it necessary to typecast the value columns to numeric before taking the sum? – Uwe May 06 '22 at 18:09

1 Answers1

2

Close. Some suggestions/answers.

  1. If you're shifting to data.table for speed, I suggest use if fifelse in lieu of replace_na and ifelse, minor.
  2. The canonical way to remove is_valid is d.mkt[, is.valid := NULL].
  3. Grouping cab be done with a setdiff. In data.table, there is no need to "ungroup", each [-call uses its own grouping. (For the reason, if you have multiple chained [-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.)
  4. Since all of your summary stats are the same, we can lapply(.SD, ..) this for a little readability improvement.

This might work:

library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
  .[, c("is_valid", "rte_cd", "rte_dsc") :=
        .(fcoalesce(is_valid, FALSE),
          fifelse(is.na(is_valid), rte_cd, "RC"),
          fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
    by = setdiff(names(tmp), cols), .SDcols = cols ]
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Many thanks for your great help. But shouldn't the very first line be `d.mkt[codes, on = .(rte_cd, cd)] %>% ...` instead? Because in the original code, `d.mkt %>% left_join(codes, by =...)`? One more question though: `what does a `.` in front of `[, c("is_valid", "rte_cd", "rte_dsc")]` do ? – user177196 May 05 '22 at 00:26
  • When you're doing a left join without internal assignment, then it's `RHS[LHS, on=.(...)]`. Have you tried that to see what it generates? https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html is a reasonable reference. – r2evans May 05 '22 at 00:31
  • I tried the other way `LHS[RHS, on = .(...)]` and found that it matched exactly. But I will try your way. Apologize in advance. – user177196 May 05 '22 at 00:36
  • 1
    I suggest a quick comparison: `LHS <- as.data.table(mtcars); RHS <- data.table(cyl=4, quux=1);`, then compare the number of rows between `LHS[RHS, on=.(cyl)]` and `RHS[LHS, on=.(cyl)]`. – r2evans May 05 '22 at 00:39
  • Got you. I saw it now :) Many thanks. Will continue to try out the rest of your code. One more question though: does the final data table `tmp` have any `key`? – user177196 May 05 '22 at 00:41
  • 2
    There is nafill, even faster than fifelse – jangorecki May 05 '22 at 04:18
  • 2
    I think you may be switching off GForce by nesting as.numeric. Why not to make it separately and keep grouping call simple sum? – jangorecki May 05 '22 at 04:20
  • 1
    user177196, no key assigned here. @jangorecki, how can we use `nafill` to change on column based on `NA` values in a different column? I thought `nafill` was solely locf/nocb operations within a vector/column. – r2evans May 05 '22 at 04:34
  • @r2evans: after testing your query, surprisingly the speed was 0.02 seconds slower than the original `dplyr` code, for the dataset of 2.4M rows. Weird :( – user177196 May 05 '22 at 13:41
  • Also, I am surprised that the result obtained from your code does not need to be `order` by the selected `by( setdiff(names(tmp), cols).` So far for other cases when I do `data.table`, I had to order the end result to get it exactly the same as `dplyr` with `group_by() %>% summarise().` – user177196 May 05 '22 at 13:48
  • There are a lot of likely repetitive calls to `as.numeric`, you could easily take that outside of the grouped operations (if indeed it is still necessary). If you know the fields are not strings or factors, then that operation is completely unnecessary. If you know of have string fields, well, definitely take it out of the grouped operations, but you still need to "pay the price", but call it once. I don't know that that will make a huge difference in performance, but it can only make it better. – r2evans May 06 '22 at 14:07
  • I forgot to mention that the data types of the 4 columns in `cols` are `chr`, so you definitely were correct to put the `as.numeric()` inside the `sum.` I am just puzzled by the fact that your code automatically does `order` w.r.t the remaining columns. Also, could you lend a hand of help on my other question related to `parallel computing` for the purpose of `parameter tuning` of some optimization model: https://stackoverflow.com/questions/72121227/parallel-processing-applied-to-a-nested-for-with-list-of-list? – user177196 May 06 '22 at 15:12
  • 1
    `order`: the default of `merge` is `sort=TRUE`, for both `base::merge` and `data.table::merge`, so `codes[d.mkt, on = .(rte_cd, cd) ]` will likely reorder rows. `parallel computing`: not able. – r2evans May 06 '22 at 15:15