The drop
argument in dcast
(from "reshape2" or "dplyr") can be useful when going from a "long" to a "wide" dataset and you want to create columns even for combinations that do not exist in the long form.
It turns out that using drop
also affects combinations the left hand side (LHS) of the formula as well as the right hand side (RHS). Thus, it also creates extra rows based on the combinations of LHS values.
Is there a way to override this behavior?
Here's some sample data:
library(data.table)
DT <- data.table(v1 = c(1.105, 1.105, 1.105, 2.012, 2.012, 2.012),
ID = c(1L, 1L, 1L, 2L, 2L, 2L),
v2 = structure(c(2L, 3L, 5L, 1L, 2L, 6L),
.Label = c("1", "2", "3", "4", "5", "6"),
class = "factor"),
v3 = c(3L, 2L, 2L, 5L, 4L, 3L))
Notice that "v2" is a factor
column with 6 levels. I essentially want to go from "long" to wide", but add in columns for any missing factor levels (in this case "4").
reshape
handles the shape, but not the missing columns:
reshape(DT, direction = "wide", idvar = c("ID", "v1"), timevar = "v2")
# v1 ID v3.2 v3.3 v3.5 v3.1 v3.6
# 1: 1.105 1 3 2 2 NA NA
# 2: 2.012 2 4 NA NA 5 3
dcast
handles adding the missing columns, but only if there's one value on the LHS:
dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE)
# ID 1 2 3 4 5 6
# 1: 1 NA 3 2 NA 2 NA
# 2: 2 5 4 NA NA NA 3
If there are multiple values on the LHS, the combinations of the values on the LHS are also expanded out, as if we had used CJ
or expand.grid
, but rows 2 and 3 are not at all of interest to me:
dcast(DT, ... ~ v2, value.var = "v3", drop = FALSE)
# v1 ID 1 2 3 4 5 6
# 1: 1.105 1 NA 3 2 NA 2 NA
# 2: 1.105 2 NA NA NA NA NA NA
# 3: 2.012 1 NA NA NA NA NA NA
# 4: 2.012 2 5 4 NA NA NA 3
This is similar to using xtabs
in base R: ftable(xtabs(v3 ~ ID + v1 + v2, DT))
.
Is there a way to let dcast
know that essentially, "Hey. The combination of values on the LHS are the IDs. Don't try to fill them in for me."
My current approach is to do three steps, one for collapsing down the LHS values, another for spreading out the RHS values, and then one for merging the result.
merge(DT[, list(v1 = unique(v1)), .(ID)], ## or unique(DT[, c("ID", "v1"), with = FALSE])
dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE),
by = "ID")[]
# ID v1 1 2 3 4 5 6
# 1: 1 1.105 NA 3 2 NA 2 NA
# 2: 2 2.012 5 4 NA NA NA 3
Is there a better approach that I'm missing?