The answer from @akrun does exactly what the poster asks, but I find myself in a similar but slightly different situation a lot, but with a dataframe that has what would be duplicate rownames in a matrix, so these are now values in a column instead (called "compound" below), like this:
set.seed(2347813)
df <- data.frame(matrix(sample(c(0,1,2,3,4,5,6,7,8,9),
size=30, replace=T), nrow=6, ncol=5))
colnames(df) <- c("sample_1", "sample_2", "sample_3", "sample_4", "sample_5")
df$compound <- c("CHO", "NO", "O", "CHO", "NO", "O")
which looks like:
sample_1 sample_2 sample_3 sample_4 sample_5 compound
1 0 1 4 9 1 CHO
2 3 8 3 0 5 NO
3 8 9 7 1 7 O
4 8 2 9 7 7 CHO
5 3 8 9 0 5 NO
6 6 1 6 7 5 O
then I want to summarize the data in various ways but I want to group it by compound, and use piping (%>%
below) from dplyr/tidyverse.
like the original question, if we want the total by compound we would do:
df %>%
group_by(compound) %>%
summarize(total=sum(c_across(starts_with("sample"))))
which would give us:
compound total
<chr> <dbl>
1 CHO 48
2 NO 44
3 O 57
but I think the best part is the ability to do multiple summarizing operations at once. let's say we wanted the total, mean, standard deviation, and the mean of only sample_1 through sample_3, we can get that all in one piped command:
df %>%
group_by(compound) %>%
summarize(total=sum(c_across(starts_with("sample"))),
grand_mean=mean(c_across(starts_with("sample"))),
sd=sd(c_across(starts_with("sample"))),
mean_13=mean(c_across(sample_1:sample_3))) # mean of sample 1-3
which gives us:
# A tibble: 3 x 5
compound total grand_mean sd mean_13
<chr> <dbl> <dbl> <dbl> <dbl>
1 CHO 48 4.8 3.58 4
2 NO 44 4.4 3.20 5.67
3 O 57 5.7 2.71 6.17
combining piping (%>%
), group_by
, mutate
, and the new versions of across
(above I used c_across
) you can get a lot done in one go.