R reshape2 or dplyr/tidyr: calculating values' percentage of sub-total

Question

Let's use the airquality dataset as a base.

myaqm <- melt(airquality, id=c("Month", "Day"), na.rm = TRUE)

This gives me a simplification of my real dataset. So far, I've done this:

myaqm_dcast <- dcast(myaqm, Day+variable~Month, value.var = "value", sum, margins=c("Day", "variable", "Month"))

Which gives me this:

> head(myaqm_dcast, n=10)
   Day variable     5     6     7     8     9  (all)
1    1    Ozone  41.0   0.0 135.0  39.0  96.0  311.0
2    1  Solar.R 190.0 286.0 269.0  83.0 167.0  995.0
3    1     Wind   7.4   8.6   4.1   6.9   6.9   33.9
4    1     Temp  67.0  78.0  84.0  81.0  91.0  401.0
5    1    (all) 305.4 372.6 492.1 209.9 360.9 1740.9
6    2    Ozone  36.0   0.0  49.0   9.0  78.0  172.0
7    2  Solar.R 118.0 287.0 248.0  24.0 197.0  874.0
8    2     Wind   8.0   9.7   9.2  13.8   5.1   45.8
9    2     Temp  72.0  74.0  85.0  81.0  92.0  404.0
10   2    (all) 234.0 370.7 391.2 127.8 372.1 1495.8

However, I'm trying to create an additional variables'-percentage-of-days'-subtotal column for each current numeric column. So my goal is something like:

Day     variable        5       5(day %)        6       6(day %)        7       7(day %)        8       8(day %)        9       9(day %)        (all)       (all)(day %)
1       Ozone       41      13.4%       0       0.0%        135     27.4%       39      18.6%       96      26.6%       311     17.9%
1       Solar.R     190     62.2%       286     76.8%       269     54.7%       83      39.5%       167     46.3%       995     57.2%
1       Wind        7.4     2.4%        8.6     2.3%        4.1     0.8%        6.9     3.3%        6.9     1.9%        33.9        1.9%
1       Temp        67      21.9%       78      20.9%       84      17.1%       81      38.6%       91      25.2%       401     23.0%
1       (all)       305.4       100.0%      372.6       100.0%      492.1       100.0%      209.9       100.0%      360.9       100.0%      1740.9      100.0%
2       Ozone       36      15.4%       0       0.0%        49      12.5%       9       7.0%        78      21.0%       172     11.5%
2       Solar.R     118     50.4%       287     77.4%       248     63.4%       24      18.8%       197     52.9%       874     58.4%
2       Wind        8       3.4%        9.7     2.6%        9.2     2.4%        13.8        10.8%       5.1     1.4%        45.8        3.1%
2       Temp        72      30.8%       74      20.0%       85      21.7%       81      63.4%       92      24.7%       404     27.0%
2       (all)       234     100.0%      370.7       100.0%      391.2       100.0%      127.8       100.0%      372.1       100.0%      1495.8      100.0%

Sorry for the terrible formatting! But as you can hopefully see, the new additional columns give a percentage of each variable for that day and that month.

I've found another Stack Overflow helper suggest using tidyr and dplyr but I just couldn't adapt their example to my needs. Would someone please show me what to do?

Ryan Morton · Accepted Answer · 2017-01-24T23:08:07.220

4

I wrote a percentage function and used that with dplyr. Then I join the columns together.

pct <- function(x) {x/sum(x)}

df <- myaqm_dcast %>%
  filter(variable != "(all)") %>%
  group_by(Day) %>%
  mutate_each(funs(pct), 3:8) %>%
  inner_join(myaqm_dcast, by = c("Day", "variable"))

Edit: You can modify the percentage function to print however you need (*100, paste the % symbol).

Edit 2: If you can live without the (all) rows, I've filtered it out. You can always calculate the column sums using the summarise_each() function.

edited Jan 24 '17 at 23:08

answered Jan 24 '17 at 22:33

Ryan Morton

2,605
1
16
19

1

This would be the right answer if one were to divide every column by column 8 - but that is not the question. For each column, you have to loop over every 5 rows and divide by `(all)` . So first you have to melt by column 5, 6, 7, 8, 9, (all) and then divide by row `all` – iskandarblue Jan 24 '17 at 22:46
Thanks the_darkside. Ryan's answer seems to work as I get six new columns with decimals that match the percentages in my example (thanks Ryan!), however I'd love for you to elaborate on your explanation.
The only thing I want to add is that I had to use instead:
pct <- function(x) {x/sum(x)*2}
as I think the inclusion of (all) in the sum was causing the decimals to be half of what they ought to be.
Note: briefly holding off accepting answer for a while in case you wanted to add something extra. – Belinda-Jane Netan Jan 24 '17 at 23:00
1

This is because Ryan's solution adds the values for each of the 4 variables together with the `(all)` row - meaning the total is divided by twice the amount it should be. Instead of multiplying by 2 in the `pct` function you could insert `slice(-n())` in the line before `mutate_each`. – Jan 24 '17 at 23:07
Yeah, I missed the (all) rows in the original answer. Sorry about that. – Ryan Morton Jan 24 '17 at 23:09
@Ryan As a further refinement of the 'pct' function, let's say we want to exclude Wind from the sum of (all), but still wish to calculated it as another %. That is, (all) == Ozone + Solar.R + Temp, but we still provide a percentage for Wind/(all). How do we perform the exclusion within the pct? Also, should I ask this as a new question? – Belinda-Jane Netan Jan 29 '17 at 19:40
I'd change the filter to `filter(variable %in% (c('Ozone', 'Solar.R', 'Temp')))` and leave the function alone. The function is just the rote calculation for the mutate_each() function and any filtering should probably be done separately. – Ryan Morton Jan 30 '17 at 15:24

score 1 · Answer 2 · answered Jan 25 '17 at 12:57

1

Why would you reshape your data? Dataframe myaqm meets the requirements of tidy data (each column is a variable, each row is an observation). You can do your calculations in this format:

library(dplyr)
myaqm %>%
  group_by(Day, variable) %>%
  mutate(all = sum(value),
     perc = paste0(round(100 * value/all, 2), "%")

Even for creating plots (eg. by ggplot) this format is better suited than the reshaped one.

If really necessary you can reshape with tidyr/dplyr:

...
gather(key, val, -c(Month:variable, all)) %>%
unite(temp, Month, key) %>%
spread(temp, val)

answered Jan 25 '17 at 12:57

MarkusN

3,051
1
18
26

Thanks for your help, but it looks like your percentages are based on the sum of all months for a given day and given variable, rather than the sum of variables for a given day in a given month. To your question "Why would you reshape your data?" let me paraphrase the client and say, "Because I'm paying you to give it to me in that format." Thanks again for your help! – Belinda-Jane Netan Jan 25 '17 at 18:44

R reshape2 or dplyr/tidyr: calculating values' percentage of sub-total

2 Answers2