0

I have data that looks like this:

dat <- data.frame(county_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
                  income_level = c("income1", "income2", "income3", "income4", 
                  "income5","income1", "income2", "income3", "income4", "income5"),
                  frequency = c(4, 1, 5, 11, 4, 7, 12, 0, 5, 7),
                  total = c(25, 25, 25, 25, 25, 31, 31, 31, 31, 31)
                  )

I would like to identify the median income level for each unique county (in the code above, there are two counties, but the actual dataset I am working with has thousands).

For instance, in County A the median number is the 13th observation when arranged in ascending order. This means the median for County A falls within income level 4.

For County B, the median is the 16th observation, which means the median income level for County B is income level 2.

I would like to use some version of a for loop, lapply, etc. to create a new column that would return the median income level for each unique county in the dataset (e.g., the median income bracket for County A, the median income bracket for County B, etc.)

Any guidance or suggestions would be much appreciated.

Lee
  • 1

3 Answers3

1

Something like this may help!

do.call(rbind.data.frame,lapply(unique(dat$county_id), function(x){
      country<-dat[dat$county_id == x,'frequency']
      cbind.data.frame(Country=x, median=which(cumsum(country)>ceiling(sum(country)/2))[1])
    }))
CRP
  • 405
  • 2
  • 11
1

I think @TTL has a good, if hard to read, answer. Here is another idea.

In general, you should store a variable like income_level as an ordered factor. Then you can find the median of the factor values.

To find the median here, you can 'expand' the dataset so the number of observations in each group matches the value of frequency and find the median the normal way.

One small issue is that base R doesn't define a method for finding the method of an ordered factor. The standard method is to convert the factor to numeric median(as.numeric(income_level)) (and lose the factor names). Another is to use missMethods, which defines a method for finding the median of an ordered factor.

Base R

library(missMethods)  # Defines median for ordered factor

dat$income_level <- factor(dat$income_level, 
                           levels = paste0('income', 1:5),
                           ordered = TRUE)

# Expand dataset
dat_exp <- dat[rep(1:nrow(dat), dat$frequency), ]

# Calculate medians
with(dat_exp, aggregate(list(income_level = income_level), 
                        by = list(county_id = county_id), FUN = median))

dplyr

library(missMethods)
library(dplyr)

# Make income_level an ordered factor
dat <- mutate(dat, income_level = factor(income_level, levels = paste0('income', 1:5),
                               ordered = TRUE))

# Expand and find medians
dat[rep(1:nrow(dat), dat$frequency), ] %>% 
  group_by(county_id) %>% 
  summarize(med_income_level = median(income_level))
# # A tibble: 2 x 2
#   county_id income_level
#   <chr>     <ord>       
# 1 A         income4     
# 2 B         income2 
0

You can repeat income_level frequency number of times and calculate median value.

library(dplyr)

dat %>%
  group_by(county_id) %>%
  summarise(med = median(rep(income_level, frequency)))

# county_id   med    
#  <chr>     <chr>  
#1 A         income4
#2 B         income2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213