How might I capture all the years of age in a column with values like "20 to 24 years" for one group and "22 to 24 years" for another group? This will enable me to confirm I have all the working age (18-64) variable names captured in a tidycensus (R package) U.S. Census API query.
Goal
What I want is, for ages 20-24 in this example, a data frame that extracts the ages from label entries like "22 to 24 years"
:
MEN WOMEN ETHNORACE
18 18 BLACK
19 19 BLACK
20 20 BLACK
21 21 BLACK
22 22 BLACK
23 23 BLACK
24 BLACK
I can then easily create a data frame that has all the ages and compare to see if any are missing.
Census variables (tidycensus)
One can see at https://api.census.gov/data/2019/acs/acs5/variables.html that at least the American Community Survey (ACS) by the U.S. Census has age range fields with varying syntax (e.g. "20 years" and "22 to 24 years"):
Example rows from tidycensus package's load_variables function
tidycensus R package version 1.1
## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)
v19 <- load_variables(2019, "acs5", cache = TRUE)
v19 %>%
dplyr::filter(
str_detect(label, "18|20|24") &
concept %in% c("SEX BY AGE",
"SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
grepl('FEMALE', toupper(label))
)
v19_Total_AndBlack_Age18_24 <-
v19 %>% dplyr::filter(
str_detect(label, "18|20|24") &
concept %in% c("SEX BY AGE",
"SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
grepl('FEMALE', toupper(label)))
print(v19_Total_AndBlack_Age18_24)
name label concept
<chr> <chr> <chr>
1 B01001_031 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE
2 B01001_032 Estimate!!Total:!!Female:!!20 years SEX BY AGE
3 B01001_034 Estimate!!Total:!!Female:!!22 to 24 years SEX BY AGE
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...
In this example, I want to make sure every age from 18-24 for the Total and Black populations is present in a dataframe like the following - notice the use of the Census API names from the above's v19_Total_AndBlack_Age18_24
.
v19_Total_AndBlack_Age18_24 <-
get_acs(
year = 2019,
geography = "zcta",
variables = c(v19_Total_AndBlack_Age18_24$name)
)
Notice that Total "22 to 24 years" compares to Black "20 to 24 years".
Let's focus on dataframe v19_Total_AndBlack_Age18_24
above, which lists out the Census API names and labels for ages 18 - 24, and aim to confirm all years are present.
I can get all of the numbers in the ages with a regular expression via:
unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"
But my attempts to group by the category are failing, and I still need to get a vector that spans the age ranges when the word "to" appears as in "20 to 24".
v19_Total_AndBlack_Age18_24_grp <-
v19_Total_AndBlack_Age18_24 %>%
mutate(EthnoRace = case_when(
grepl('BLACK', concept) ~ "BLACK",
TRUE ~ "TOTAL"))
v19_Total_AndBlack_Age18_24_grp %>%
group_by(EthnoRace) %>%
mutate(ages = str_extract_all(label, "\\d{2"))
Error
Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".