2

I have a R data frame like this with 45389 rows

             gene_id     KOIN1    KOIN2     KOIN3     KOIP1    KOIP2       KOIP3
1 ENSMUSG00000000001 6.0056300 4.677550 6.3490400 9.9992300 9.931780 12.56900000
2 ENSMUSG00000000003 0.0000000 0.000000 0.0000000 0.0000000 0.000000  0.00000000
3 ENSMUSG00000000028 0.9988830 0.407537 1.5629300 0.1845460 1.899790  0.85186600
4 ENSMUSG00000000031 0.0000000 0.818696 0.3708190 0.0419544 0.000000  0.02832700
5 ENSMUSG00000000037 0.0160579 0.172857 0.0988266 0.0000000 1.174690  0.00726742
6 ENSMUSG00000000049 0.3923090 0.000000 0.0000000 0.0000000 0.124112  0.01811530

and so on...

There are some duplicates in the gene_id column. For example,

5090  ENSMUSG00000025515 0.00000000 0.00000000 0.1572500 0.000000000 0.000000 0.0000000
5091  ENSMUSG00000025515 0.00000000 0.00000000 0.1572500 0.000000000 0.000000 0.0000000
5095  ENSMUSG00000025515 0.00000000 0.00000000 0.0386388 0.000000000 0.000000 0.0000000
5096  ENSMUSG00000025515 0.00000000 0.00000000 0.0386388 0.000000000 0.000000 0.0000000
5100  ENSMUSG00000025515 0.00000000 0.00000000 0.0000000 0.000000000 0.000000 0.0000000
5101  ENSMUSG00000025515 0.00000000 0.00000000 0.0000000 0.000000000 0.000000 0.0000000
5105  ENSMUSG00000025515 0.33817000 0.06733700 0.4894620 0.000000000 0.000000 0.0000000
5106  ENSMUSG00000025515 0.33817000 0.06733700 0.4894620 0.000000000 0.000000 0.0000000
5110  ENSMUSG00000025515 0.00863568 0.00000000 0.0337577 0.000000000 0.000000 0.0000000
5111  ENSMUSG00000025515 0.00863568 0.00000000 0.0337577 0.000000000 0.000000 0.0000000

What I basically want to do is collapse all the duplicates down into one row with the values for each column being a sum of all the values.

I thought ddply from plyr package would work but it still gives me all the duplicates

newdataframe <- ddply(dataframe,"gene_id",numcolwise(sum))

This is what I ran.

Any suggestions?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Sang soo Seo
  • 21
  • 1
  • 3
  • maybe you need `summarise` in there somehow? The non-numeric values won't get collapsed ... guessing (not testing), `ddply(dataframe,"gene_id",summarise,numcolwise(sum))` ? – Ben Bolker Oct 16 '15 at 17:57
  • Or `library(data.table) ; setDT(df)[, lapply(.SD, sum), by = gene_id]` – David Arenburg Oct 18 '15 at 15:05

2 Answers2

7

Another option:

library(dplyr)
df %>%
  group_by(gene_id) %>%
  summarise_each(funs(sum))

Which gives:

#Source: local data frame [7 x 7]
#
#             gene_id     KOIN1    KOIN2     KOIN3     KOIP1    KOIP2       KOIP3
#              (fctr)     (dbl)    (dbl)     (dbl)     (dbl)    (dbl)       (dbl)
#1 ENSMUSG00000000001 6.0056300 4.677550 6.3490400 9.9992300 9.931780 12.56900000
#2 ENSMUSG00000000003 0.0000000 0.000000 0.0000000 0.0000000 0.000000  0.00000000
#3 ENSMUSG00000000028 0.9988830 0.407537 1.5629300 0.1845460 1.899790  0.85186600
#4 ENSMUSG00000000031 0.0000000 0.818696 0.3708190 0.0419544 0.000000  0.02832700
#5 ENSMUSG00000000037 0.0160579 0.172857 0.0988266 0.0000000 1.174690  0.00726742
#6 ENSMUSG00000000049 0.3923090 0.000000 0.0000000 0.0000000 0.124112  0.01811530
#7 ENSMUSG00000025515 0.6936114 0.134674 1.4382170 0.0000000 0.000000  0.00000000
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
5

Plain old aggregate would do:

newdataframe <- aggregate(. ~ gene_id, dataframe, sum)

The formula reads everything else aggregated by gene_id, and sum to compute the sum of all values. You could also use mean for instance.

If you just want some of the other columns, you can cbind them:

newdataframe <- aggregate(cbind(col1, col2) ~ gene_id, dataframe, sum)
Julián Urbano
  • 8,378
  • 1
  • 30
  • 52