3

How do I use dplyr to create proportions of a level of a factor variable for each state? For example, I'd like to add a variable that indicates the percent of females within each state to the data frame.

# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

Here's an attempt that I know is wrong, but gets me access to the information:

 middle %>%
   group_by(state, gender %in%c("Female")) %>%
   summarise(count = n()) %>%
   mutate(test_count = count)

I have a hard time with the count and mutate functions, which makes it hard to get much further. It doesn't behave as I'd expect.

bfoste01
  • 337
  • 1
  • 2
  • 14
  • 1
    Do you want a new data frame with one row per state or do you want your old data frame where every row has the percentage of females for that state? – Gregor Thomas Aug 09 '16 at 18:52
  • I need a new row in original data frame that would be the percent of female in that state. For example, the value for females in maine would repeat for all females in maine. – bfoste01 Aug 09 '16 at 19:47

3 Answers3

11

To add a new column to your existing data frame:

school.data %>% 
    group_by(state) %>%
    mutate(pct.female = mean(gender == "Female"))

Use summarize rather than mutate if you just want one row per state rather than adding a column to the original data.

school.data %>%
   group_by(state) %>%
   summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
#    state pct.female
#   <fctr>      <dbl>
# 1  Idaho       0.75
# 2  Maine       0.70
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • This throws an error in R-studio where a ',' is expected at the end of the mutate expression. Also, the mean function only would work if my categorical variable is binary, some of my real data have multiple levels (i.e. > 2). – bfoste01 Aug 09 '16 at 19:49
  • 1
    The code works fine - I suggest you recheck and make sure you the syntax. I can add a modification for more levels. I'm surprised that you didn't ask for a percent male column if you want to be able to generalize to more levels. Do you want percentage columns for all levels except the first? – Gregor Thomas Aug 09 '16 at 19:55
  • It wasn't a bug in the line of code. I was trying different things in previous lines and it was a hold over, so that is fixed. I have several factor variables with multiple levels. For some of these factor variables I want to be able to select any given level and get the percent of the factor level within each state. If it was white, hispanic, black... I'd want to be able to calculate percent hispanic within states and have that created column added to my data frame. I don't always care about the percent of other factor levels. – bfoste01 Aug 09 '16 at 20:02
  • Well, `mean` would work fine for the percent hispanic as well, `mean(race_ethnicity == "hispanic")`. If you wanted compound groups (e.g., percent white or hispanic) you can still use `mean` and just replace `==` with `%in%`, e.g., `mean(race_ethnicity %in% c("hispanic", "white"))`. Whether your data has 2 or more levels, your *conditions* are binary (hispanic/not hispanic, female/not female, hispanic OR white/not hispanic nor white...) – Gregor Thomas Aug 09 '16 at 20:05
11

Gregor's answer gets to the heart of it. Here's a version that would give you counts and proportions for both genders per state:

library(dplyr)

gender.proportions <- group_by(school.data, state, gender) %>% 
  summarize(n = length(student.id)) %>% # count per gender
  ungroup %>% group_by(state) %>% 
  mutate(proportion = n / sum(n)) # proportion per gender

#   state gender     n proportion
#  <fctr> <fctr> <int>      <dbl>
#1  Idaho Female    16       0.80  
#2  Idaho   Male     4       0.20
#3  Maine Female    11       0.55
#4  Maine   Male     9       0.45

Edit:

In reference to OP's comment/request, the code below would repeat the male and female proportions for each individual student in each state:

gender.proportions <- group_by(school.data, state) %>% 
  mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))

   student.id  state gender prop.female prop.male
        <int> <fctr> <fctr>       <dbl>     <dbl>
1         479  Idaho   Male         0.8       0.2
2         634  Idaho Female         0.8       0.2
3         175  Idaho Female         0.8       0.2
4         527  Idaho Female         0.8       0.2
5         368  Idaho Female         0.8       0.2
6         423  Idaho   Male         0.8       0.2
7         357  Idaho Female         0.8       0.2
8         994  Idaho Female         0.8       0.2
9         479  Idaho Female         0.8       0.2
10        634  Idaho Female         0.8       0.2
# ... with 30 more rows
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • This is very close to what I'm looking for. I'm doing multilevel modeling, and that is relevant because essentially what I need is a variable that is say, prop_female, and would repeat .80 for all idaho and .55 for all maine in the master dataset – bfoste01 Aug 09 '16 at 19:42
  • 1
    @bfoste01 Edited my response to produce exactly that. – jdobres Aug 09 '16 at 20:13
  • 1
    Our solutions end up being essentially the same, yes. I initially thought you wanted something slightly different. – jdobres Aug 09 '16 at 20:29
3

Here is one solution using a left_join.

state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

school.data %>%
    group_by(state) %>%
    mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
    summarise(female_count = sum(gender_id)) %>%

    left_join(school.data %>%
                  group_by(state) %>%
                  summarise(state_count = n()),

              by = c("state" = "state")
    ) %>%
    mutate(percent_female = female_count / state_count)
Nick Becker
  • 4,059
  • 13
  • 19