Capturing the range indicated by the use of "to" in a column - tidycensus (U.S. Census API)

Question

How might I capture all the years of age in a column with values like "20 to 24 years" for one group and "22 to 24 years" for another group? This will enable me to confirm I have all the working age (18-64) variable names captured in a tidycensus (R package) U.S. Census API query.

Goal

What I want is, for ages 20-24 in this example, a data frame that extracts the ages from label entries like "22 to 24 years":

MEN  WOMEN ETHNORACE
18   18    BLACK
19   19    BLACK
20   20    BLACK
21   21    BLACK
22   22    BLACK
23   23    BLACK
24         BLACK

I can then easily create a data frame that has all the ages and compare to see if any are missing.

Census variables (tidycensus)

One can see at https://api.census.gov/data/2019/acs/acs5/variables.html that at least the American Community Survey (ACS) by the U.S. Census has age range fields with varying syntax (e.g. "20 years" and "22 to 24 years"):

Example rows from tidycensus package's load_variables function

tidycensus R package version 1.1

## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)

v19     <- load_variables(2019, "acs5", cache = TRUE)
v19 %>% 
  dplyr::filter(
    str_detect(label, "18|20|24") & 
               concept %in% c("SEX BY AGE",
                              "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
               grepl('FEMALE', toupper(label))
                )

v19_Total_AndBlack_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, "18|20|24") & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE', toupper(label)))

 print(v19_Total_AndBlack_Age18_24)

  name        label                                      concept                                     
  <chr>       <chr>                                      <chr>                                       
1 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
2 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
3 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...

In this example, I want to make sure every age from 18-24 for the Total and Black populations is present in a dataframe like the following - notice the use of the Census API names from the above's v19_Total_AndBlack_Age18_24.

v19_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
 )

Notice that Total "22 to 24 years" compares to Black "20 to 24 years".

Let's focus on dataframe v19_Total_AndBlack_Age18_24 above, which lists out the Census API names and labels for ages 18 - 24, and aim to confirm all years are present.

I can get all of the numbers in the ages with a regular expression via:

unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"

But my attempts to group by the category are failing, and I still need to get a vector that spans the age ranges when the word "to" appears as in "20 to 24".

v19_Total_AndBlack_Age18_24_grp <- 
  v19_Total_AndBlack_Age18_24 %>% 
   mutate(EthnoRace = case_when(
   grepl('BLACK', concept) ~ "BLACK",
   TRUE ~ "TOTAL"))

v19_Total_AndBlack_Age18_24_grp %>% 
  group_by(EthnoRace) %>% 
  mutate(ages = str_extract_all(label, "\\d{2"))

Error

Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".

I am making progress using this SO solution to concatenate regular expression matches https://stackoverflow.com/a/57059653/2305061. Hope to report full solution soon. — Rick Pack, Feb 03 '22 at 19:07
Hello Rick, I have a much easier solution which includes pre processing to wrangle data. What I would do is to first parse the column label and re create the years in a convenient way for you. That means, the queries will be easier after that. So, using string processing you can create a new column called "estimated_years", where you hold a vector of numeric values. The same for sex, gender, ethnic, etc ... — AugtPelle, Feb 04 '22 at 19:51
@AugtPelle, that is definitely in the spirit of what I want to do. You might provide a more thorough response as a solution. It would likely address my desire to "confirm I have all the working age (18-64) variable names captured in a tidycensus (R package) U.S. Census API query". — Rick Pack, Feb 07 '22 at 17:27

wibeasley · Answer 1 · 2022-02-04T21:55:20.260

I tend to do this in two steps. The first step specifies some characteristic in a metadata file. The second step applies the metadata to the problem.

It looks like you'll need to approximate some, because the levels don't cleanly fit your boundaries. For example, "15 to 19 years old" straddles 18.

(To keep things simpler, I'm assuming you'll never want to keep the "20 to 24 years" level, but exclude the "20 to 24 years old" level.

# Step 1a: create a list of poential age labels
v19 |> 
  dplyr::mutate(
    concept_age = grepl(pattern = "AGE$", concept),       # Concept must end with "AGE"
  ) |> 
  dplyr::filter(concept_age) |> 
  tidyr::separate_rows(label, sep = "!!")  |>             # Isolate the different dimensions of a variable
  dplyr::rename(level = label) |> 
  dplyr::mutate(
    level_year  = grepl(pattern = "\\byears?\\b", level), # Label must contain "year" or "years"
  ) |> 
  dplyr::filter(level_year) |> 
  dplyr::count(level, name = "variable_count") |>         # Reduce to the unique (overlapping) age levels
  dplyr::mutate(
    desired = TRUE                                        # Create variable to manually toggle in Step 2
  ) |> 
  dplyr::arrange(level) |>                                # Careful this is still a string, so "26" precedes "3"
  # View()
  readr::write_csv(path_metadata_age_label)

# Step 1b: Manual edit the 78 `desired` values in the csv & save.

Step 1 output:

level,variable_count,desired
10 to 14 years,2,TRUE
12 to 14 years,5,TRUE
12 to 17 years,5,TRUE
15 to 17 years,45,TRUE
15 to 19 years old,6,TRUE
...

# Step 2a: Read your metadata, retaining only the desired age levels.
pattern_age <-
  path_metadata_age_label |> 
  readr::read_csv() |> 
  dplyr::filter(desired) |> 
  dplyr::mutate(
    level = paste0("\\b", level, "\\b")   # Text starts & stops with a word boundary
  ) |> 
  dplyr::pull(level) |> 
  paste(collapse = "|")

# Step 2b: Apply the age levels to `v19`
v19 |> 
  dplyr::mutate(
     keep = grepl(pattern_age, label, perl = T)
  ) |> 
  dplyr::filter(keep)

Step 2 output:

   name       label                                    concept    keep 
   <chr>      <chr>                                    <chr>      <lgl>
 1 B01001_003 Estimate!!Total:!!Male:!!Under 5 years   SEX BY AGE TRUE 
 2 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years    SEX BY AGE TRUE 
 3 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years  SEX BY AGE TRUE 
 4 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years  SEX BY AGE TRUE 
...

Thank you but this appears to be addressing a different question than what I intended, @wibeasley. "Old" happens not to appear in the label values. Please consider the solution I will post momentarily. — Rick Pack, Feb 14 '22 at 20:07

score 0 · Accepted Answer · answered Feb 15 '22 at 03:06

First, for what ages, genders, and ethnoracial groups does one want data? This could be modified to only choose one gender. Gender_var needs to appear in the label column at https://api.census.gov/data/2019/acs/acs5/variables.html (or can use load_variables() like below when dataframe v19 is created).

Parameters

Only have to set these for code to run.

min_age_desired <- 18
max_age_desired <- 24
Gender_var = c("MALE", "FEMALE")
EthnoRace_var = c("BLACK","TOTAL")

Now, let's create a QC data frame with all the ethnoracial groups and age groups we need.

Load R packages

library(arsenal)
library(dplyr)
library(stringr)
library(tidyr)
library(tidycensus)

options(scipen = 8)

Validation data frame

has all the ages and ethnoracial groups one wants from the Census API

AGE_var  = as.numeric(seq(min_age_desired, max_age_desired, 1))
all_grp_qc_frm <- 
  data.frame(
    ## dupli
    expand.grid(EthnoRace_var,
                Gender_var,
                AGE_var
          )
  )

colnames(all_grp_qc_frm) <- 
  c("EthnoRace", "Gender", "AGE")

all_grp_qc_frm$AGE <- as.numeric(
  all_grp_qc_frm$AGE)
all_grp_qc_frm$EthnoRace <- as.character(
  all_grp_qc_frm$EthnoRace)
all_grp_qc_frm$Gender <- as.character(
  all_grp_qc_frm$Gender)

all_grp_qc_frm <- all_grp_qc_frm %>% 
  arrange(EthnoRace,Gender,AGE)

print(all_grp_qc_frm)

   EthnoRace Gender AGE
1      BLACK   MALE  18
2      BLACK   MALE  19
3      BLACK   MALE  20
4      BLACK   MALE  21
5      BLACK   MALE  22
6      BLACK   MALE  23
7      BLACK   MALE  24
8      BLACK FEMALE  18
9      BLACK FEMALE  19
10     BLACK FEMALE  20
11     BLACK FEMALE  21
12     BLACK FEMALE  22
13     BLACK FEMALE  23
14     BLACK FEMALE  24
15     TOTAL   MALE  18
16     TOTAL   MALE  19
17     TOTAL   MALE  20
18     TOTAL   MALE  21
19     TOTAL   MALE  22
20     TOTAL   MALE  23
21     TOTAL   MALE  24
22     TOTAL FEMALE  18
23     TOTAL FEMALE  19
24     TOTAL FEMALE  20
25     TOTAL FEMALE  21
26     TOTAL FEMALE  22
27     TOTAL FEMALE  23
28     TOTAL FEMALE  24

Load the Census variables with tidycensus

This is for the 2019 American Community Survey 5-year estimates

v19     <- load_variables(2019, "acs5", cache = TRUE)

Subset those variables to those needed

There are many variables available through the Census API.

To subset, let's first get a vector with each age from 18 to 24 separated by a pipe.

working_age_vec <- paste0(seq(18,24,1), collapse = "|")

Notice I need to use the correct concept values to get Black and the Total population across ethnoracial groups.

v19_Total_And_EthnoRace_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, working_age_vec) & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE|MALE', toupper(label)))

print(v19_Total_And_EthnoRace_Age18_24)

# A tibble: 12 x 3
   name        label                                      concept                                     
   <chr>       <chr>                                      <chr>                                       
 1 B01001_007  Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE                                  
 2 B01001_008  Estimate!!Total:!!Male:!!20 years          SEX BY AGE                                  
 3 B01001_009  Estimate!!Total:!!Male:!!21 years          SEX BY AGE                                  
 4 B01001_010  Estimate!!Total:!!Male:!!22 to 24 years    SEX BY AGE                                  
 5 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
 6 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
 7 B01001_033  Estimate!!Total:!!Female:!!21 years        SEX BY AGE                                  
 8 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
 9 B01001B_007 Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
10 B01001B_008 Estimate!!Total:!!Male:!!20 to 24 years    SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
11 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
12 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)

Pull those variables with the Census API

Census_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
  )

Get concept values like SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE) and label values like Estimate!!Total:!!Male:!!10 to 14 years.

Census_Total_AndBlack_Age18_24 <- left_join(
  Census_Total_AndBlack_Age18_24, 
  v19 %>% 
    select(name, concept, label) %>% 
    rename(variable = name)
 )

Regular expression extractions + sequence

Regular expressions to extract ages and create a vector expressing the sequence from the lowest to highest age per range.

Census_Total_AndBlack_Age18_24_grp <- 
  Census_Total_AndBlack_Age18_24 %>%
  distinct(label, concept) %>% 
    ## regular expression to extract all the numbers in labels like
    ## Estimate!!Total:!!Male:!!5 to 9 years
    mutate(ages = sapply(str_extract_all(label,"\\d{2}"),
                            function(x) paste(x,collapse=""))) %>% 
    mutate(start = str_sub(ages, 1, 2),
             end = str_sub(ages, 3, 4)) %>% 
    mutate(
      start = case_when(
            is.na(start) ~ "99",
            TRUE ~ start),
      end = case_when(
            is.na(end) ~ "99",
            TRUE ~ end)) %>% 
    dplyr::filter(grepl('Female|Male', label)) %>% 
    mutate(Gender = case_when(
      grepl('Female', label) ~ "FEMALE",
      grepl('Male', label) ~ "MALE",
      TRUE ~ "MISSING")) %>% 
    mutate(EthnoRace = case_when(
      grepl('BLACK', concept) ~ "BLACK",
      TRUE ~ "TOTAL")) %>% 
    mutate(end = case_when(
      is.na(end) | end == "" ~ start,
      TRUE ~ end))
    
    Census_Total_AndBlack_Age18_24_grp_sum <- Census_Total_AndBlack_Age18_24_grp %>% 
      group_by(EthnoRace, Gender) %>% 
      summarize(AGE = as.numeric(unlist(purrr::map2(start, end, `:`)))) %>% 
      ungroup() %>% 
      distinct(EthnoRace, Gender, AGE)

QC comparison

Remember that the desired ethnoracial groups, gender groups, and ages are printed at the top of this solution.

arsenal::comparedf(all_grp_qc_frm, Census_Total_AndBlack_Age18_24_grp_sum)

[...]

Not shared: 0 variables and 0 observations.

Differences found in 0/3 variables compared.
0 variables compared have non-identical attributes.

A note that I was not aware of the breakdown argument for tidycensus get_estimates() that I believe would make for an easier solution. Example presented at https://stackoverflow.com/questions/70925840/how-to-calculate-age-group-wise-us-population-data-in-percentage/71178524#71178524 — Rick Pack, Feb 20 '22 at 20:34