19

Suppose I've got the following data.table :

dt <- data.table(id = c(rep(1, 5), rep(2, 4)),
                 sex = c(rep("H", 5), rep("F", 4)), 
                 fruit = c("apple", "tomato", "apple", "apple", "orange", "apple", "apple", "tomato", "tomato"),
                 key = "id")

   id sex  fruit
1:  1   H  apple
2:  1   H tomato
3:  1   H  apple
4:  1   H  apple
5:  1   H orange
6:  2   F  apple
7:  2   F  apple
8:  2   F tomato
9:  2   F tomato

Each row represents the fact that someone (identified by it's id and sex) ate a fruit. I want to count the number of times each fruit has been eaten by sex. I can do it with :

dt[ , .N, by = c("fruit", "sex")]

Which gives:

    fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2

The problem is, by doing it this way I'm losing the count of orange for sex == "F", because this count is 0. Is there a way to do this aggregation without loosing combinations of zero counts?

To be perfectly clear, the desired result would be the following:

   fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2
6: orange   F 0

Thanks a lot !

Henrik
  • 65,555
  • 14
  • 143
  • 159
juba
  • 47,631
  • 14
  • 113
  • 118

2 Answers2

18

Seems like the most straightforward approach is to explicitly supply all category combos in a data.table passed to i=, setting by=.EACHI to iterate over them:

setkey(dt, sex, fruit)
dt[CJ(sex, fruit, unique = TRUE), .N, by = .EACHI]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
11

One way is to change sex or id to factor (id is redundant here?)

dt[, sex := factor(sex)]
dt[, .(sex=levels(sex), N=c(table(sex))), by=fruit]
#     fruit sex N
# 1:  apple   F 2
# 2:  apple   H 3
# 3: tomato   F 2
# 4: tomato   H 1
# 5: orange   F 0
# 6: orange   H 1

Or you can change fruit to factor and group by sex:

dt[, fruit := factor(fruit)]
dt[, .(fruit = levels(fruit), N=c(table(fruit))),by=sex]
#    sex  fruit N
# 1:   H  apple 3
# 2:   H orange 1
# 3:   H tomato 1
# 4:   F  apple 2
# 5:   F orange 0
# 6:   F tomato 2

Edit:

But I suspect if your data.table is huge, then depending on table may not be a good idea. In this case, using CJ from your earlier question may be the way to go. That is, first do the aggregation and then do a join.

out <- setkey(dt, sex, fruit)[, .N, 
             by="sex,fruit"][CJ(c("H","F"), 
             c("apple","tomato","orange")), 
             allow.cartesian=TRUE][is.na(N), N := 0L]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    I was testing your answer and thinking "it works great, but seems a bit slow". Then I saw your edit :) Brilliant, thanks a lot. – juba May 13 '13 at 10:25