2

I am given a large data-table that needs to be aggregated according to the first column:

The problem is the following:

  1. For several columns, one just has to form the sum for each category (given in column 1)
  2. For other columns, one has to calculate the mean
  3. There is a 1-1 correspondence between the entries in the first and second columns. Such that the entries of the second column should be kept.

The following is a possible example of such a data-table. Let's assume that columns 3-9 need to be summed up and columns 10-12 need to be averaged.

library(data.table)
set.seed(1)
a<-matrix(c("cat1","text1","cat2","text2","cat3","text3"),nrow=3,byrow=TRUE)
M<-do.call(rbind, replicate(1000, a, simplify=FALSE)) # where m is your matrix
M<-cbind(M,matrix(sample(c(1:100),3000*10,replace=TRUE ),ncol=10))
M <- as.data.table(M)

The result should be a table of the form

     V1    V2 V3 V4 V5  V6 V7 V8 V9 V10 V11 V12
1: cat1 text1 27 81 78  95 27 22 12  76  18  76
2: cat2 text2 38 48 70 100 11 97  8  53  56  33
3: cat3 text3 58 18 66  24 14 73 18  27  92  70

but with entries the corresponding sums respective averages.

M--
  • 25,431
  • 8
  • 61
  • 93
Strickland
  • 590
  • 4
  • 14
  • 2
    What is `a`? Also, please provide a small reproduicble example and expected output – akrun May 29 '19 at 17:11
  • ```M[, c(lapply(.SD[, 2:8, with=FALSE], sum), lapply(.SD[, 9:11, with=FALSE], mean)), by = "cat"]``` – M-- May 29 '19 at 17:55
  • Using `M[, c(lapply(.SD[, 3:9, with=FALSE], sum), lapply(.SD[, 10:12, with=FALSE], mean)), by = "V1"]` in the above example, yields: `Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument` – Strickland May 29 '19 at 18:46
  • @Strickland that's because your V3:V12 are actually class of `chr` not `dbl` I added a line before that to convert them to numeric class. See my answer. – M-- May 29 '19 at 19:44

1 Answers1

2
M[, names(M)[-c(1,2)] := lapply(.SD, as.numeric), 
    .SDcols = names(M)[-c(1,2)]][, 
                                  c(lapply(.SD[, ((3:9)-2), with=FALSE], sum), 
                                    lapply(.SD[, ((10:12)-2), with=FALSE], mean)), 
                                   by = eval(names(M)[c(1,2)])]


#>      V1    V2    V3    V4    V5    V6    V7    V8    V9    V10    V11    V12
#> 1: cat1 text1 51978 49854 48476 49451 49620 49870 50248 50.193 51.516 49.694
#> 2: cat2 text2 50607 50097 50572 50507 48960 51419 48905 49.700 49.631 48.863
#> 3: cat3 text3 51033 50060 49742 50345 51532 51299 50957 50.192 50.227 50.689      
M--
  • 25,431
  • 8
  • 61
  • 93