R efficiently add up tables in different order

Question

At some point in my code, I get a list of tables that looks much like this:

[[1]]
     cluster_size start end number       p_value
13             2    12  13    131 4.209645e-233
12             1    12  12    100 6.166824e-185
22            11    12  22    132 6.916323e-143
23            12    12  23    133 1.176194e-139
13             1    13  13     31  3.464284e-38
13            68    13 117     34  3.275941e-37
23            78    23 117      2  4.503111e-32

....

[[2]]
      cluster_size start end number       p_value
13             2    12  13    131 4.209645e-233
12             1    12  12    100 6.166824e-185
22            11    12  22    132 6.916323e-143
23            12    12  23    133 1.176194e-139
13             1    13  13     31  3.464284e-38

....

While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.

The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?

Edit: I was asked for a dput file of the data. It's located here: http://alrig.com/code/

In the sample case, the order of the rows happen to match. That will not always be the case.

When posting data in R, it's a good idea to provide a reproducible example, for instance by doing `dput` on a reasonable subset of your data and posting the result. — David Robinson, Sep 15 '12 at 04:20

score 3 · Accepted Answer · edited May 23 '17 at 12:09

3

Seems like you can do this in two steps

Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.

Assuming your data was named X, here's what you could do:

library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
   cluster_size start end number          sump
1             1    12  12    100 5.550142e-184
2             1    13  13     31  3.117856e-37
3             1    22  22      1  9.000000e+00
...
29          105    23 117      2  6.271469e-16
30          106    22 146     13  7.266746e-25
31          107    23 146     12  1.382328e-25

Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

edited May 23 '17 at 12:09

Community

1
1

answered Sep 15 '12 at 04:35

Chase

67,710
18
144
161

Thanks for this. This is very close. The only difference is that I need to have the formula work by collapsing over two variables, not just one. In my example it would be $start and $end. – user1357015 Sep 15 '12 at 04:49
@user1357015 - I think my updated answer does exactly what you want, right? I just updated it once I saw you gave some reproducible data...if not - can you explain what you mean in more detail? – Chase Sep 15 '12 at 04:51
Hi, ok, this seems a ton closer but now there are duplicates. For example, in each element of the list, the table had 21 rows. In the result, there are 31 rows. The row with $start = 13 and $end = 117 is duplicated. The p-values are different two so it's not just a matter of cutting – user1357015 Sep 15 '12 at 04:58
@user - the second argument to `ddply` is the variable(s) that you want to group on...I think I must have misinterpreted your grouping variables...currently, the code groups on `cluster_size, start, end, number`. If you only want to group on `start,end`, adjust the code accordingly. – Chase Sep 15 '12 at 05:03
Ah, perfect. I see, still going through .ddply after your post so didn't catch that right away! – user1357015 Sep 15 '12 at 05:06
Hm, looking at this a final time, is there a way I can keep cluster_size and number in the final matrix, but not "group" on them. Eg, for a particular start, end, whatever was the cluster size that's what it would be in the final result. No matter what table you look end, for a particular (start,end) combination, cluster_size and number would never change. – user1357015 Sep 15 '12 at 05:26

R efficiently add up tables in different order

1 Answers1