Combine several dcast data.table (which share same key) efficiently

Question

Here is the simple problem I'm trying to solve: I have a data.table like following table, and I'm trying to use dcast.data.table function to calculate number of advancement for each group, but also I'm interested to calculate median of grades in each group:

set.seed(10);
DT = data.table(GROUP = sample(c("a","b","c"),100,replace = T), 
                ADVANCED = sample(c("ADVANCED","DROP"),100,replace = T),
                GRADE = sample(1:10,100, replace=T))

     GROUP ADVANCED GRADE
  1:     b ADVANCED     3
  2:     a ADVANCED     6
  3:     b ADVANCED     7
  4:     c ADVANCED     9

 95:     b     DROP     6
 96:     c ADVANCED     5
 97:     a     DROP    10
 98:     b ADVANCED     1
 99:     c     DROP     6
100:     a     DROP     2
     GROUP ADVANCED GRADE

Essentially here is the result I'm looking for:

result = merge(
  dcast.data.table(DT,.Primitive("~")(GROUP,ADVANCED)),
  dcast.data.table(DT,.Primitive("~")(GROUP,.),
                   value.var="GRADE", 
                   fun.aggregate=median));

setnames(result,".","MEDIAN_GRADE")

   GROUP ADVANCED DROP MEDIAN_GRADE
1:     a       17   19            6
2:     b       20   21            7
3:     c       13   10            6

Now I'm wondering how can I do it without making two separate dcast tables and merge at the end. I'm dealing with many row and column in my tables and grouping by key is a bottleneck. I'm wondering is there a better way to calculate this?

** Since my first question was vague I edit completely (thanks to Frank and Akrun for their feedback).

You should probably use `set.seed` before making random examples. — Frank, Aug 10 '16 at 18:57
There's `reshape2::dcast(DT, GROUP ~ ADVANCED, margins="ADVANCED")` but you lose access to the improvements made in `data.table::dcast`. It looks like `margins` never got implemented: https://github.com/Rdatatable/data.table/issues/1214 — Frank, Aug 10 '16 at 19:04
This looks like a completely different question from the one you posted earlier — akrun, Aug 10 '16 at 19:24
you are right, we could solve previous questions with many different ways . — Mahdi Jadaliha, Aug 10 '16 at 19:27
The `rhs` of `~` is different for both `dcast`, so there is no way to do that in a single `dcast`. I posted an option with `on` — akrun, Aug 10 '16 at 19:34
thanks for the answer, using `on` instead of `merge` make my code 30% faster. :) — Mahdi Jadaliha, Aug 10 '16 at 19:44

akrun · Accepted Answer · 2016-08-10T20:01:39.170

4

For the updated question

setnames(dcast(DT, GROUP~ADVANCED, length)[dcast(DT, GROUP~., median),
            on = "GROUP"], ".", "MEDIAN_GRADE")[]
#   GROUP ADVANCED DROP MEDIAN_GRADE
#1:     a       17   19            6
#2:     b       20   21            7
#3:     c       13   10            6

Or a faster approach would be to group by 'GROUP', get the median of 'GRADE' and then do the join on the dcast output

DT[,.(MEDIAN_GRADE = median(GRADE)) , .(GROUP)][
              dcast(DT, GROUP ~ ADVANCED, length), on = 'GROUP']

edited Aug 10 '16 at 20:01

answered Aug 10 '16 at 18:56

akrun

874,273
37
540
662

2

More generally `dcast(DT, GROUP ~ ADVANCED)[, SIZE := Reduce(\`+\`, .SD), .SDcols = unique(DT$ADVANCED)][]` – Frank Aug 10 '16 at 18:59
the real function that I'm interested is more complex and I cannot sum all columns to get that directly. I added this to the question. thanks – Mahdi Jadaliha Aug 10 '16 at 18:59
@Mahdi That's too vague for me to understand. – Frank Aug 10 '16 at 19:02
@MahdiJadaliha You showed a simple example and expected result for others to code. So how do we know what kind of dataset you have – akrun Aug 10 '16 at 19:08
sorry I will fix it in a minute with more elaborated example. – Mahdi Jadaliha Aug 10 '16 at 19:10
1

Yeah, I think this final approach with `on` is the best way to go. – Frank Aug 10 '16 at 19:40
1

Thanks @akrun, this answer improved the performance of my code considerably. it would be great if you remove the answer to the first question to avoid confusions. :) – Mahdi Jadaliha Aug 10 '16 at 19:54
@MahdiJadaliha Okay, I removed it. – akrun Aug 10 '16 at 20:01
@Akrun, I changed the number of rows in data.table to 50000000 and now the approach your proposed at last is crashing 1 out of 3 times (approx.) on my computer. The first approach still working fine and it is 30% faster than the original merge approach, but it is 20% slower than your latest approach which is crushing sometimes. – Mahdi Jadaliha Aug 11 '16 at 16:38
@MahdiJadaliha Okay, there might be a limit on the number of rows. – akrun Aug 12 '16 at 03:58

Frank · Answer 2 · 2016-08-10T19:56:41.807

Well, you could compute in long form and then reshape:

dcast(DT[, rbind(
  .SD[, .(v = .N), by=.(stat = paste0("n.",ADVANCED))],
  .(stat = "med", v = as.numeric(median(GRADE)))
), by=GROUP], GROUP ~ stat)

   GROUP med n.ADVANCED n.DROP
1:     a   6         17     19
2:     b   7         20     21
3:     c   6         13     10

Obviously, this still involves tons of manual fiddling. It also requires that your stats all be numeric (since they are stacked in the stat column together before the dcast). I think the approach in @akrun's answer -- like DT[, f(...), by=GROUP][dcast(DT, GROUP ~ x), on=GROUP] -- is much better, limiting the dcast to only those calls that need it.

Combine several dcast data.table (which share same key) efficiently

2 Answers2