Aggregate observations across samples by rownames (dplyr) in R

Question

Aloha,

I am trying to get the total counts for each row name in my sample matrix. For some reason, I have tried both rowsum and then converting to a data frame and using dplyr::group_by but they are giving errors. Here is a subset of example data:

mat = matrix(c(0,1,2,3,4), nrow=3, ncol = 5)
rownames(mat) <- c("CHO", "NO", "O")
colnames(mat) <-  c("sample_1", "sample_2", "sample_3", "sample_4", "sample_5")`

I would like to have a resulting data frame with the formula name, then the sum of observations across samples and the percent of samples formula was observed in overall.

It seems easy enough but I have tried all different combinations aggregating the data with no avail and would be very appreciative of some guidance.

YOu have unique row names. Not clear about what you want to aggregate — akrun, Nov 25 '20 at 00:36
What is your expected output for `mat` that you shared? Do you want to sum the data rowwise? — Ronak Shah, Nov 25 '20 at 03:06

score 0 · Accepted Answer · answered Nov 25 '20 at 00:57

We may need only rowSums

rowSums(mat)

If there are duplicate rownames (in the example data, the rownames are unique), then we use rowsum with group specified as the rownames

rowsum(mat, row.names(mat))

and then we use rowSums on top of that

rowSums(rowsum(mat, row.names(mat)))

score 0 · Answer 2 · answered Nov 25 '20 at 18:17

The answer from @akrun does exactly what the poster asks, but I find myself in a similar but slightly different situation a lot, but with a dataframe that has what would be duplicate rownames in a matrix, so these are now values in a column instead (called "compound" below), like this:

set.seed(2347813)
df <- data.frame(matrix(sample(c(0,1,2,3,4,5,6,7,8,9), 
                               size=30, replace=T), nrow=6, ncol=5)) 
colnames(df) <-  c("sample_1", "sample_2", "sample_3", "sample_4", "sample_5")
df$compound <- c("CHO", "NO", "O", "CHO", "NO", "O")

which looks like:

  sample_1 sample_2 sample_3 sample_4 sample_5 compound
1        0        1        4        9        1      CHO
2        3        8        3        0        5       NO
3        8        9        7        1        7        O
4        8        2        9        7        7      CHO
5        3        8        9        0        5       NO
6        6        1        6        7        5        O

then I want to summarize the data in various ways but I want to group it by compound, and use piping (%>% below) from dplyr/tidyverse.
like the original question, if we want the total by compound we would do:

df %>%
  group_by(compound) %>%
  summarize(total=sum(c_across(starts_with("sample"))))

which would give us:

  compound total
  <chr>    <dbl>
1 CHO         48
2 NO          44
3 O           57

but I think the best part is the ability to do multiple summarizing operations at once. let's say we wanted the total, mean, standard deviation, and the mean of only sample_1 through sample_3, we can get that all in one piped command:

df %>%
  group_by(compound) %>%
  summarize(total=sum(c_across(starts_with("sample"))), 
            grand_mean=mean(c_across(starts_with("sample"))),
            sd=sd(c_across(starts_with("sample"))), 
            mean_13=mean(c_across(sample_1:sample_3))) # mean of sample 1-3

which gives us:

# A tibble: 3 x 5
  compound total grand_mean    sd mean_13
  <chr>    <dbl>      <dbl> <dbl>   <dbl>
1 CHO         48        4.8  3.58    4   
2 NO          44        4.4  3.20    5.67
3 O           57        5.7  2.71    6.17

combining piping (%>%), group_by, mutate, and the new versions of across (above I used c_across) you can get a lot done in one go.

Aggregate observations across samples by rownames (dplyr) in R

2 Answers2