Sorry for the bad title - I am trying to achieve the following: I have a data.table dt with two categorical variables "a" and "b". As you can see, a has 5 unique values and b has three. Now e.g. the combination of categorical variables ("a = 1" and "b = 3") is not in the data.
library(data.table)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- rnorm(10)
dt <- data.table(a = a, b = b, y = y)
dt[order(a, b), .N, by = c("a", "b")]
# a b N
#1: 1 1 2
#2: 1 2 1
#3: 2 2 1
#4: 2 3 1
#5: 3 1 1
#6: 3 2 1
#7: 3 3 1
#8: 4 1 1
#9: 5 2 1
If I simply sum "a" and "b", such groups as ("a = 1" and b = 3") will simply be ignored:
group_sum <- dt[, lapply(.SD, sum), by = c("a", "b")]
group_sum
# a b y
#1: 1 1 -0.7702614
#2: 4 1 -0.2894616
#3: 1 2 -0.2992151
#4: 2 2 -0.4115108
#5: 5 2 0.2522234
#6: 3 2 -0.8919211
#7: 2 3 0.4356833
#8: 3 1 -1.2375384
#9: 3 3 -0.2242679
Is there an internal way in data table to "keep" such missing groups and either assign a 0 or NA?
One way to achieve my goal would be to create a grid and merge in a second step:
grid <- unique(expand.grid(a = dt$a, b = dt$b)) # dim
setDT(grid)
res <- merge(grid, group_sum, by = c("a", "b"), all.x = TRUE)
head(res)
# a b y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3 NA
#4: 2 1 NA
#5: 2 2 -0.4115108
#6: 2 3 0.4356833