0

I have created a summarized data frame using R's 'summarize' function, including two factors - "Size of Firm" & "Case Status" - and number of records (n) for each combination of "Size of Firm" and "Case Status". There are three levels for size of firm and four levels for Case Status, so I have 12 total rows in this summarized data frame. Here is the script for the summarized data frame (including the preceding 'group by' function):

df <- group_by(df, df$Firm.Size, df$`Case Status`)
summ_firm <- summarize(df, num_records = n())

I want to create a new column in the summarized data frame that provides a proportion of an individual row's number of records (i.e. number of records for a given combination of "Firm Size" and "Case Status") with respect to the total records for the relevant Firm Size.

In other words, if "Small Firms" have a total of 100 records and the row containing records for "Small Firms" that were "Certified" (level of case status) has 20 records, I would want this new column to populate with 0.2 for that row.

Here is the actual output of 'summ_firm' referenced earlier in the post.

  `df_nona_firm$Firm.Size` `df_nona_firm$\`Case Status\`` num_records
   <fct>                    <fct>                                <int>
 1 0-99 Employees           Certified                            32565
 2 0-99 Employees           Certified-Expired                    24493
 3 0-99 Employees           Denied                                6346
 4 0-99 Employees           Withdrawn                             3155
 5 1,000+ Employees         Certified                            63649
 6 1,000+ Employees         Certified-Expired                    51981
 7 1,000+ Employees         Denied                                3532
 8 1,000+ Employees         Withdrawn                             4078
 9 100-999 Employees        Certified                            24752
10 100-999 Employees        Certified-Expired                    19095
11 100-999 Employees        Denied                                2830
12 100-999 Employees        Withdrawn                             2537
J. Staak
  • 11
  • 2
  • Please paste the output of `dput(summ_firm)` or a subset into your question so that we can easily access your data. – Djork Feb 28 '18 at 02:34

1 Answers1

1

This should work:

library(dplyr)
summ_firm <- df %>%
  group_by(Firm.Size, Case.Status) %>%
  summarize(records = n()) %>%
  group_by(Firm.Size) %>%
  mutate(proportion = records/sum(records))
  • Fantastic - thanks so much. This is what I went with, and it worked great: summ_firm <- df %>% group_by(Firm.Size, Case.Status) %>% summarise(n=n()) %>% mutate(proportion = n/sum(n)) Wondering if R knows to use the first variable in "group_by" as the denominator in the proportion calculation? – J. Staak Feb 28 '18 at 02:52
  • It wouldn't. You would have to explicity specify that (see the 2nd group_by statement) – user124543131234523 Feb 28 '18 at 21:56