1

Is it possible to aggregate with a complement in R data.tables. Example below.

library(data.table)
dt <- data.table(a=c("word1","word2","word2","word2"), b=c("cat1","cat1","cat1","cat2"))

To get number of particular words in a category

newdt <- dt[,(.N),by=.(a,b)]
#word1,cat1 - 1
#word2,cat1 - 2
#word2,cat2 - 1

How could I count the number of all other words in the category? Or relatedly, number of other categories that the word is in? Something like the following?

#doesn't work
#newdt2 <- dt[a!=a,(.N),by=.(a,b)]
#the expected answer would be
#word1,cat1 - 2
#word2,cat1 - 1
#word2,cat2 - 0

I can't find any help on this in online tutorials or questions. Is there an easy way to get the complement. Data.table solution would be nice, as working with a 50M row table. Thanks!

Uwe
  • 41,420
  • 11
  • 90
  • 134
puslet88
  • 1,288
  • 15
  • 25
  • 1
    It is not clear about the request – akrun Aug 01 '17 at 12:11
  • 3
    Yeah, I also have no idea what's the question is about. Also, your example is not reproducible. Have you tried running your own code? – David Arenburg Aug 01 '17 at 12:12
  • I don't get it...shouldn't `word2,cat2` be 2? – Sotos Aug 01 '17 at 12:37
  • Sorry, was in a rush, thanks for the help, the answers given below are exactly what was needed! word2,cat2 = 0 because there are no other words in category 2. – puslet88 Aug 01 '17 at 15:40
  • 2
    If this is just to browse the data by eye, you might consider going to wide format, like `dt[, .N, by = .(a, b)][, Nb := sum(N), by=b][, dcast(.SD, b + Nb ~ a, value.var="N", fill=0)]` or `dt[, .N, by = .(a, b)][, Na := sum(N), by=a][, dcast(.SD, a + Na ~ b, value.var="N", fill=0)]` – Frank Aug 01 '17 at 15:54

2 Answers2

3

The following is your code (I've added double quotes so it runs):

library(data.table)
dt <-
 data.table(a=c("word1","word2","word2","word2"),b=c("cat1","cat1","cat1","cat2"))

newdt <- dt[,(.N),by=.(a,b)]
names(newdt) = c("a", "b", "cnt") # rename the count column

The following line will count how many times each category appears

catCnt = dt[,(.N),by=.(b)]
names(catCnt) = c("b", "tot_b")
catCnt

The number of words belonging to each category except the current word is the difference between the number of word belonging to the category and the number of times the couple (category, word) appears. To achive the result I merge the two data.table objects by the category column

aux = merge(newdt, catCnt, by="b")

and then I compute the difference between the total count and the "couple" count:

aux$cnt_not_a = aux$tot_b - aux$cnt

If you want to keep the required column only:

res = aux[, c("b", "a", "cnt_not_a")]
res

I don't know if you can achieve this by using data.table commands only but this looks quite straightforward to me...

Uwe
  • 41,420
  • 11
  • 90
  • 134
Bruno Zamengo
  • 800
  • 1
  • 11
  • 24
  • Thanks a lot! The step-by-step approach is indeed the way to go when counting things and we solved the problem, thanks a lot! I'll give the answer tick to Uwe's even more concise answer for future reference. – puslet88 Aug 01 '17 at 15:44
  • In cases like this it's generally safe to edit the OP's question to fix typos, I think (like adding those double quotes). – Frank Aug 01 '17 at 15:45
3

Following the idea of Bruno to compute the difference of total count per category minus the count of words in each category but using data.table syntax with an update on join this becomes a "one-liner":

library(data.table)
dt <-data.table(a = c("word1", rep("word2", 3L)), b = c(rep("cat1", 3L), "cat2"))
dt[, .N, by = .(a, b)][dt[, .N, by = b], on = "b", Nc := i.N - N][]
       a    b N Nc
1: word1 cat1 1  2
2: word2 cat1 2  1
3: word2 cat2 1  0
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Well, that's even better! By the way, thanks for the formatting improvement ;) – Bruno Zamengo Aug 01 '17 at 13:16
  • Thanks! I've got a lot to learn about data.tables, but this is indeed so very concise and absolutely what was necessary. Thanks a lot! – puslet88 Aug 01 '17 at 15:40
  • 1
    @puslet88 For learning data.table I recommend the vignettes and the FAQ on https://github.com/Rdatatable/data.table/wiki/Getting-started, and chapter 3 of Frank's *Quick R Tutorial* https://franknarf1.github.io/r-tutorial/_book/ – Uwe Aug 01 '17 at 15:48
  • @UweBlock, thanks for the help again! The resources look very useful and like exactly what I need, going straight to the point! Two thumbs up! – puslet88 Aug 01 '17 at 19:25