why is dcast so impossible to pass a non-aggregate function?

Question

I am using the data.table package for a table like this:

DT <- data.table(id=rep(1:100, each=50),
                 grp=rep(letters[1:4], each=1250),
                 time=rep(1:50,100),
                 outcome=rnorm(5000),
                 seconds=rep(1:500,10),
                 weights=rnorm(5000),
                 response=rep(1:200, each=25),
                 key=c("grp", "time"))

I would like to create a new (possibly rbindlisted) data table of some summary statistics from this table. I first created two intermediary tables a and b,

a <- DT[, list(mean = weighted.mean(outcome, weights), 
               median=median(outcome),seconds), by=c("grp","time")]
b <- DT[, list(mean=weighted.mean(response, seconds),
               median=median(response)), by=c("grp","time")]

and then am trying to rowbind these together across all groups but still preserve the grouping along the rows. This does not work:

  DTfinal <- data.table(DT$grp, DT$time,
   outcomemean=a$mean, responsemean=b$mean, 
   outcomemedian=a$median, responsemedian=b$median)

I don't think a merge works since a and b have different lengths. Rowbinding a and b also mixes up the different means and medians of a and b, ideally I would like a rbindlist that has some kind of suffix for each column like c(".a",".b").

Update: I get an error (since a and b have different dimension) doing

DTfinal <- rbindlist(setNames(list(a[, c("grp", "time", "mean", "median"),
                                     with = FALSE], 
                                   b[, c("grp", "time", "mean", "median"),
                                     with = FALSE]), 
                                    c("a", "b")),
                                    idcol= "id")

dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median'))

where it returns

Aggregate function missing, defaulting to 'length'

When you are rbinding, there is only single column for mean, median, etc. What do you mean by `suffix`. where is it needs to be created? If you need a id column, `idcol` argument from `rbindlist` is useful — akrun, Jul 29 '16 at 02:29
The warning message you posted was yesterday corrected with the brackets at different position and based on your example I am not getting that warning `aggregate function missing,.`` — akrun, Jul 30 '16 at 17:58

akrun · Accepted Answer · 2016-07-29T03:10:46.460

2

We can use rbindlist after placing the datasets in a list

DTfinal <- rbindlist(list(a,b))
dim(DTfinal)
#[1] 400   4
dim(a)
#[1] 200   4
dim(b)
#[1] 200   4

Suppose if both datasets have different number of columns, and we have a vector of column names that we need to keep

nm1 <- intersect(names(a), names(b))
rbindlist(list(a[, nm1, with = FALSE], b[, nm1, with = FALSE]), idcol= "id")

Update

If we need to convert to 'wide' format

DTfinal <-  rbindlist(setNames(list(a,b), c("a", "b")), idcol= "id")
dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median'))
#     grp time       mean_a    mean_b    median_a median_b
#  1:   a    1   0.52171471  25.99502 -0.06558068       25
#  2:   a    2   0.36445108  25.99010  0.13518412       25
#  3:   a    3   0.08993721  25.98522  0.20128790       25
#  4:   a    4 -64.04617391  25.98039  0.40999376       25
#  5:   a    5   0.81730847  25.97561 -0.03481697       25
# ---                                                     
#196:   d   46   1.62818374 176.67568 -0.26695999      176
#197:   d   47  -1.45259871 176.67340  0.14893356      176
#198:   d   48   9.59796683 176.67114 -0.05834959      176
#199:   d   49  -2.74285453 176.66890 -0.22094347      176
#200:   d   50   1.22109043 176.66667 -0.08172928      176

edited Jul 29 '16 at 03:10

answered Jul 29 '16 at 02:12

akrun

874,273
37
540
662

Is this possible when they have non-overlapping columns, and I only want to keep the ones listed? dim(a) = 5000x5, dim(b) = 200x4. – michel Jul 29 '16 at 02:19
@michel I updated the post, instead of the 'nm1', you can use the custom column names that you need to keep. – akrun Jul 29 '16 at 02:21
My problem is that although a and b have the same column names, they refer to different means and medians (the outcome and response). Rbindlisting on these columns does not keep an indicator of which variable the mean or median are computed on. The 'mixing' of the aggregate functions are confusing to me. – michel Jul 29 '16 at 02:23
@michel You can use the `idcol` argument in `rbindlist`. Perhaps you meant that. Updated the post – akrun Jul 29 '16 at 02:26
Is there a way to get the outcome to have 4 columns a.mean, a.median, b.mean, b.median? – michel Jul 29 '16 at 02:29
@michel It is the usual way of subsetting columns in data.table. For detailed info, the description is there in `?data.table` – akrun Jul 29 '16 at 02:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118587/discussion-between-michel-and-akrun). – michel Jul 29 '16 at 02:35

why is dcast so impossible to pass a non-aggregate function?

1 Answers1

Update