19

The drop argument in dcast (from "reshape2" or "dplyr") can be useful when going from a "long" to a "wide" dataset and you want to create columns even for combinations that do not exist in the long form.

It turns out that using drop also affects combinations the left hand side (LHS) of the formula as well as the right hand side (RHS). Thus, it also creates extra rows based on the combinations of LHS values.

Is there a way to override this behavior?


Here's some sample data:

library(data.table)
DT <- data.table(v1 = c(1.105, 1.105, 1.105, 2.012, 2.012, 2.012),
                 ID = c(1L, 1L, 1L, 2L, 2L, 2L), 
                 v2 = structure(c(2L, 3L, 5L, 1L, 2L, 6L), 
                                .Label = c("1", "2", "3", "4", "5", "6"), 
                                class = "factor"),
                 v3 = c(3L, 2L, 2L, 5L, 4L, 3L)) 

Notice that "v2" is a factor column with 6 levels. I essentially want to go from "long" to wide", but add in columns for any missing factor levels (in this case "4").

reshape handles the shape, but not the missing columns:

reshape(DT, direction = "wide", idvar = c("ID", "v1"), timevar = "v2")
#       v1 ID v3.2 v3.3 v3.5 v3.1 v3.6
# 1: 1.105  1    3    2    2   NA   NA
# 2: 2.012  2    4   NA   NA    5    3

dcast handles adding the missing columns, but only if there's one value on the LHS:

dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE)
#    ID  1 2  3  4  5  6
# 1:  1 NA 3  2 NA  2 NA
# 2:  2  5 4 NA NA NA  3

If there are multiple values on the LHS, the combinations of the values on the LHS are also expanded out, as if we had used CJ or expand.grid, but rows 2 and 3 are not at all of interest to me:

dcast(DT, ... ~ v2, value.var = "v3", drop = FALSE)
#       v1 ID  1  2  3  4  5  6
# 1: 1.105  1 NA  3  2 NA  2 NA
# 2: 1.105  2 NA NA NA NA NA NA
# 3: 2.012  1 NA NA NA NA NA NA
# 4: 2.012  2  5  4 NA NA NA  3

This is similar to using xtabs in base R: ftable(xtabs(v3 ~ ID + v1 + v2, DT)).


Is there a way to let dcast know that essentially, "Hey. The combination of values on the LHS are the IDs. Don't try to fill them in for me."

My current approach is to do three steps, one for collapsing down the LHS values, another for spreading out the RHS values, and then one for merging the result.

merge(DT[, list(v1 = unique(v1)), .(ID)],  ## or unique(DT[, c("ID", "v1"), with = FALSE])
      dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE), 
      by = "ID")[]
#    ID    v1  1 2  3  4  5  6
# 1:  1 1.105 NA 3  2 NA  2 NA
# 2:  2 2.012  5 4 NA NA NA  3

Is there a better approach that I'm missing?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Something close (maybe): `dcast(DT, interaction(v1,ID,drop=TRUE) ~ v2, value.var = "v3", drop = FALSE)`. You have to split the first column afterwise, however. – nicola Jan 16 '16 at 19:53
  • @nicola, I thought about that, but I didn't like the possible loss in fidelity of data, the lack of the option to use `...` on the LHS, and the loss of column names. – A5C1D2H2I1M1N2O1R2T1 Jan 17 '16 at 02:19
  • 1
    A possible solution might to allow a double logical in `drop`. Something like `drop = c(TRUE, FALSE)` where the first applies to RHS and the second to LHS. – Jaap Jan 31 '16 at 09:00
  • Sounds interesting @Jaap. Have a mock implementation? ;-) – A5C1D2H2I1M1N2O1R2T1 Jan 31 '16 at 09:15
  • 1
    Posted on GH as a [feature request for *data.table*](https://github.com/Rdatatable/data.table/issues). – Jaap Jan 31 '16 at 16:54

1 Answers1

9

Just implemented in data.table development version v1.9.7, commit 2113, closes #1512.

require(data.table) # v1.9.7, commit 2113+
dcast(DT, ... ~ v2, value.var = "v3", drop = c(TRUE, FALSE))
#       v1 ID  1 2  3  4  5  6
# 1: 1.105  1 NA 3  2 NA  2 NA
# 2: 2.012  2  5 4 NA NA NA  3
Arun
  • 116,683
  • 26
  • 284
  • 387