2

I am cleaning demographic data that has been submitted by 10+ school districts and the submissions are not standardized/uniform. I would like to find patterns and recode them so that the data is clean and simple.

Let's say I have a variable called Race, and one of the categories is Native Hawaiian - Pacific Islander.

School A submits this category as Native Hawaiian or Other Pacific Islander. School B submits this category as Native Hawaiian/Pacific Islander. School C submits this category as Native Hawaiian or Pacific Islander.

How could I recode this such that if R sees the word Pacific anywhere in the variable, it will recode to Native Hawaiian - Pacific Islander?

Here is the original data:

df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
                                 "Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
                                 "Black or African American", "Black", "Black/African American"))

Here is the ideal cleaned data:

df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
                                "Native Hawaiian - Pacific Islander","Black - African American",
                                "Black - African American","Black - African American"))
missgwolf
  • 356
  • 1
  • 11

2 Answers2

3

grepl() will return TRUE for strings that contain "Pacific" and False otherwise. Use that to subset your vector and replace with the string you want:

df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"
Brigadeiro
  • 2,649
  • 13
  • 30
  • 1
    Thank you! This is particularly simple and useful because of the `ignore.case=TRUE` argument that can be added to `grepl`. – missgwolf Aug 18 '21 at 19:04
2

Using str_detect with case_when

library(dplyr)
library(stringr)
df_original %>% 
    mutate(Race2 = case_when(str_detect(Race, '\\bPacific\\b') ~
      "Native Hawaiian - Pacific Islander", 
            TRUE ~ "Black - African American"))

-output

                                Race                              Race2
1 Native Hawaiian or Other Pacific Islander Native Hawaiian - Pacific Islander
2          Native Hawaiian/Pacific Islander Native Hawaiian - Pacific Islander
3       Native Hawaiian or Pacific Islander Native Hawaiian - Pacific Islander
4                 Black or African American           Black - African American
5                                     Black           Black - African American
6                    Black/African American           Black - African American

Another option is to create a key/value dataset with the pattern and its corresponding value to replace and then do a regex_left_join (from fuzzyjoin) with the original data

library(fuzzyjoin)
keydat <- tibble(Race = c("Pacific", "Black"), 
   Race2 = c("Native Hawaiian - Pacific Islander", "Black - African American"))
regex_left_join(df_original, keydat) %>%
        transmute(Race = Race2)
#Joining by: "Race"
#                                Race
#1 Native Hawaiian - Pacific Islander
#2 Native Hawaiian - Pacific Islander
#3 Native Hawaiian - Pacific Islander
#4           Black - African American
#5           Black - African American
#6           Black - African American
akrun
  • 874,273
  • 37
  • 540
  • 662