Sum multiple variables by group

Question

I have a R data frame like this with 45389 rows

             gene_id     KOIN1    KOIN2     KOIN3     KOIP1    KOIP2       KOIP3
1 ENSMUSG00000000001 6.0056300 4.677550 6.3490400 9.9992300 9.931780 12.56900000
2 ENSMUSG00000000003 0.0000000 0.000000 0.0000000 0.0000000 0.000000  0.00000000
3 ENSMUSG00000000028 0.9988830 0.407537 1.5629300 0.1845460 1.899790  0.85186600
4 ENSMUSG00000000031 0.0000000 0.818696 0.3708190 0.0419544 0.000000  0.02832700
5 ENSMUSG00000000037 0.0160579 0.172857 0.0988266 0.0000000 1.174690  0.00726742
6 ENSMUSG00000000049 0.3923090 0.000000 0.0000000 0.0000000 0.124112  0.01811530

and so on...

There are some duplicates in the gene_id column. For example,

5090  ENSMUSG00000025515 0.00000000 0.00000000 0.1572500 0.000000000 0.000000 0.0000000
5091  ENSMUSG00000025515 0.00000000 0.00000000 0.1572500 0.000000000 0.000000 0.0000000
5095  ENSMUSG00000025515 0.00000000 0.00000000 0.0386388 0.000000000 0.000000 0.0000000
5096  ENSMUSG00000025515 0.00000000 0.00000000 0.0386388 0.000000000 0.000000 0.0000000
5100  ENSMUSG00000025515 0.00000000 0.00000000 0.0000000 0.000000000 0.000000 0.0000000
5101  ENSMUSG00000025515 0.00000000 0.00000000 0.0000000 0.000000000 0.000000 0.0000000
5105  ENSMUSG00000025515 0.33817000 0.06733700 0.4894620 0.000000000 0.000000 0.0000000
5106  ENSMUSG00000025515 0.33817000 0.06733700 0.4894620 0.000000000 0.000000 0.0000000
5110  ENSMUSG00000025515 0.00863568 0.00000000 0.0337577 0.000000000 0.000000 0.0000000
5111  ENSMUSG00000025515 0.00863568 0.00000000 0.0337577 0.000000000 0.000000 0.0000000

What I basically want to do is collapse all the duplicates down into one row with the values for each column being a sum of all the values.

I thought ddply from plyr package would work but it still gives me all the duplicates

newdataframe <- ddply(dataframe,"gene_id",numcolwise(sum))

This is what I ran.

Any suggestions?

maybe you need `summarise` in there somehow? The non-numeric values won't get collapsed ... guessing (not testing), `ddply(dataframe,"gene_id",summarise,numcolwise(sum))` ? — Ben Bolker, Oct 16 '15 at 17:57
Or `library(data.table) ; setDT(df)[, lapply(.SD, sum), by = gene_id]` — David Arenburg, Oct 18 '15 at 15:05

score 7 · Answer 1 · answered Oct 16 '15 at 17:50

Another option:

library(dplyr)
df %>%
  group_by(gene_id) %>%
  summarise_each(funs(sum))

Which gives:

#Source: local data frame [7 x 7]
#
#             gene_id     KOIN1    KOIN2     KOIN3     KOIP1    KOIP2       KOIP3
#              (fctr)     (dbl)    (dbl)     (dbl)     (dbl)    (dbl)       (dbl)
#1 ENSMUSG00000000001 6.0056300 4.677550 6.3490400 9.9992300 9.931780 12.56900000
#2 ENSMUSG00000000003 0.0000000 0.000000 0.0000000 0.0000000 0.000000  0.00000000
#3 ENSMUSG00000000028 0.9988830 0.407537 1.5629300 0.1845460 1.899790  0.85186600
#4 ENSMUSG00000000031 0.0000000 0.818696 0.3708190 0.0419544 0.000000  0.02832700
#5 ENSMUSG00000000037 0.0160579 0.172857 0.0988266 0.0000000 1.174690  0.00726742
#6 ENSMUSG00000000049 0.3923090 0.000000 0.0000000 0.0000000 0.124112  0.01811530
#7 ENSMUSG00000025515 0.6936114 0.134674 1.4382170 0.0000000 0.000000  0.00000000

score 5 · Answer 2 · answered Oct 16 '15 at 17:40

Plain old aggregate would do:

newdataframe <- aggregate(. ~ gene_id, dataframe, sum)

The formula reads everything else aggregated by gene_id, and sum to compute the sum of all values. You could also use mean for instance.

If you just want some of the other columns, you can cbind them:

newdataframe <- aggregate(cbind(col1, col2) ~ gene_id, dataframe, sum)

Sum multiple variables by group

2 Answers2

Linked

Related