1

I have a dataset in the following format stored in a large tibble in R:

Yr Gender Number Disease Age Population
1901 M 100 A 1-5 1500
1901 F 105 A 1-5 1400
1901 M 23 A 6-10 1300
1901 F 87 A 6-10 1400
1901 M 102 B 1-5 1500
1901 F 103 B 1-5 1400
1901 M 42 B 6-10 1300
1901 F 67 B 6-10 1400

The population is linked to each age bracket and gender within each year. I need to group by year and gender and then calculate (number/population) for each sex for each year (disease type isn't important).

I'm using this code: group_by(Yr,Gender) %>% summarise(rate=sum(Number/sum(unique(Population))))

However I know this is giving me incorrect results as when I summate the M and F figures produced, they come to more than they should do each year (by almost double).

Can anyone point me in the right direction here, I think it's because there are differing population figures for genders and ages but I can't figure out how to overcome this issue?

EDIT

Thanks for the replies - In the output I would like the rate to be calculated using the sum of numbers per gender but over the total population for the year rather than the grouped gender population i.e total number (per gender)/total population(for that year).

Alex
  • 13
  • 3
  • You may need to do this in two steps. First group by yr, gender and are. And summarize the sum of number and mean on population. Then group by yr an gender to summarize sum of number / sum of population – Dave2e Jan 29 '22 at 01:36
  • `group_by(Yr, Gender, Age, Population) %>% summarise(Number = sum(Number)) %>% summarize(rate = sum(Number) / Population)` – Jon Spring Jan 29 '22 at 04:39

1 Answers1

0

With the various different sums, this becomes a multi step process.

First step is to summarize by year, gender and age in order to determine the total number per age group and remove the redundant population values for the group.
The next step is summarize the total population for that year across all ages and genders. The last step is to summarize the intermediate values to determine the tot number by gender/tot pop for each year.

#test data
df<- structure(list(Yr = c(1901L, 1901L, 1901L, 1901L, 1901L, 1901L, 
                         1901L, 1901L, 1902L, 1902L, 1902L, 1902L, 1902L, 1902L, 1902L, 1902L), 
                  Gender = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F"), 
                  Number = c(100L, 105L, 23L, 87L, 102L, 103L, 42L, 67L, 100L, 105L, 23L, 87L, 102L, 103L, 42L, 67L), 
                  Disease = c("A", "A", "A", "A", "B", "B", "B", "B", "A", "A", "A", "A", "B", "B", "B", "B"), 
                  Age = c("1-5", "1-5",   "6-10", "6-10", "1-5", "1-5", "6-10", "6-10", "1-5", "1-5", "6-10", 
                          "6-10", "1-5", "1-5", "6-10", "6-10"), 
                  Population = c(1500L, 1400L, 1300L, 1400L, 1500L, 1400L, 1300L, 1400L, 1000L, 1100L, 
                   1200L, 1300L, 1000L, 1100L, 1200L, 1300L)), 
             class = "data.frame", row.names = c(NA, -16L))

library(dplyr)
#step 1 & step 2 (begins after the ungroup) 
intermediate <- df %>% group_by(Yr,Gender, Age) %>% 
           summarize(Num= sum(Number), Pop = mean(Population)) %>%
           ungroup() %>% 
           group_by(Yr) %>% mutate(YearPop = sum(Pop)) %>% ungroup()

#final step
answer <- intermediate1 %>% group_by(Yr, Gender) %>% summarize(rate = sum(Num) / mean(YearPop))

#final answer  
        Yr Gender   rate
      <int> <chr>   <dbl>
   1  1901 F      0.0646
   2  1901 M      0.0477
   3  1902 F      0.0787
   4  1902 M      0.0580
Dave2e
  • 22,192
  • 18
  • 42
  • 50