3

I have a dataframe structure like below:

No    A    B    C    D    Group
=========================
1    2    3    1    4    GA
2    4    5    3    1    GA
3    8    6    1    3    GA
4    6    1    3    2    GB
5    8    9    1    2    GB
6    8    1    9    1    GB

I want to calculate each cell percentage by their respective group.

Is there any faster way rather than looping? The size is really big so I need a faster method.

My expected result:

No    A      B       C       D    Group
=======================================
1    2/14    3/14    1/5     4/8    GA
2    4/14    5/14    3/5     1/8    GA
3    8/14    6/14    1/5     3/8    GA
4    6/22    1/11    3/13    2/5    GB
5    8/22    9/11    1/13    2/5    GB
6    8/22    1/11    9/13    1/5    GB
asachet
  • 6,620
  • 2
  • 30
  • 74
Bharata
  • 685
  • 3
  • 11
  • 23
  • It is not duplicate because the previous post only calculate 1 column percentage. What I want is multiple column. – Bharata Nov 26 '18 at 09:07
  • @h3rm4n @zx8754 I don't see how this is a duplicate of the suggested other questions. One is about aggregating multiple columns (dplyr's `summarise_at`), the other is about creating percentages for a single column (dplyr's `mutate`). Creating percentages for multiple columns is different and not addressed in either of those two (and requires dplyr's `mutate_at` which is in none of the 7 suggested answers) – asachet Nov 26 '18 at 10:22
  • yes, this is not duplicate because before I write this question, I have gone to those pages and I don't find the answer. – Bharata Nov 26 '18 at 10:45

2 Answers2

7

You can use the dplyr package.

For one column:

df %>%
group_by(Group) %>%
mutate(A_percent = A / sum(A)) # could use `A` instead of `A_percent`

For several columns at the same time, you can do the following which will overwrite the existing columns as you asked:

df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs(./sum(.)))

Note that if you wanted to create new columns instead of overwriting, you could have done:

df %>%
group_by(Group) %>%
mutate_at(vars(A:D), funs("percent" = ./sum(.)))

This would have created new columns with a "_percent" suffix.

If you have many columns, you may want a more powerful way to select the columns to process. Have a look at the list of select helpers you can use in vars(...).You can also simply use numerical indexes.

asachet
  • 6,620
  • 2
  • 30
  • 74
  • for vars(A:D), how do I choose the columns from the colnames? The number of the columns is big so it is impossible to enumerate. Can I use numerical index? – Bharata Nov 26 '18 at 09:39
  • Yes a numerical index will work just fine as well. If there is any sort of pattern (e.g. `var_A`, `var_B`, ..., `var_ZZZ`) then you can exploit that as well (in this case you could use `vars(starts_with("var_))` to select the columns. You can use regex with `vars(matches(regex))`. – asachet Nov 26 '18 at 09:42
  • Or much more simply, if the only columns you do *not* want to select are `No` and `Group`, you can select by elimination: `vars(everything(), -No, -Group)` – asachet Nov 26 '18 at 09:44
  • thanks, I use numerical index and it works like a charm! I need to learn dplyr from now – Bharata Nov 26 '18 at 10:44
2

With dplyr, we can group_by Group and use mutate_all to find ratio of all columns, column-wise.

library(dplyr)
df %>%
  select(-No) %>%
  group_by(Group) %>%
  mutate_all(funs(./sum(.)))


#     A      B      C     D Group
#  <dbl>  <dbl>  <dbl> <dbl> <fct>
#1 0.143 0.214  0.2    0.5   GA   
#2 0.286 0.357  0.6    0.125 GA   
#3 0.571 0.429  0.2    0.375 GA   
#4 0.273 0.0909 0.231  0.4   GB   
#5 0.364 0.818  0.0769 0.4   GB   
#6 0.364 0.0909 0.692  0.2   GB   
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Using `mutate_all` forces you to drop column `No` which is unnecessary. You can simply select what you want to mutate directly with `mutate_at`. – asachet Nov 26 '18 at 09:20
  • @antoine-sac yes, you are right. Thanks! You have included that version in your post. :) – Ronak Shah Nov 26 '18 at 09:27