Best way to replace a variety of write-in answers with numbers using R

Question

I am using R to clean a dataset. Part of my dataset looks like:

record_id | organization | other_work_loc
1               12            CCC
2               12            AMG
3               12            TAO
4                1
5                2
6                7

other_work_loc is a free response column with highly variable entries. It only has data if organization = 12. I would like to re-categorize the organization and other_work_loc data into one column (org_cat) with three categories (1, 2, 3). Most of the other_work_loc data will be recategorized to '3.'

dataset<- dataset %>% mutate(org_cat = case_when (organization == 1 | organization == 2 ~ '1',
                                                            organization >= 3 & organization <12 ~ '2',
                                                            other_work_loc == 'CCC' | other_work_loc == AMG ~ '3'))

This code works, but there are 100 free responses in 'other_work_loc.' The majority will be recategorized as '3.' However, 22 need to be categorized as '1' or '2' and I'm wondering if there's a more elegant way than writing out how to recode each individual response?

`other_work_loc %in% c("CCC","AMG","TAO","FOO","BAR") ~ "3"`? — r2evans, May 30 '23 at 20:16

C. Murtaugh · Answer 1 · 2023-05-31T19:15:54.790

Use Excel or the like to create a dataframe with columns organization, other_work_loc and newvar, where the last two are your free response answers and their corresponding numeric replacement values - basically a look-up table. I named mine lut.csv, and it looks like this:

organization    other_work_loc  newvar
12              CCC             3
12              AMG             3
12              TAO             2
1                               1

I named your dataframe df.csv, and after loading tidyverse, used left_join to carry out the replacement:

df <- read_csv('df.csv') %>% print()
lut <- read_csv('lut.csv') %>% print()

left_join(df, lut)

Joining with `by = join_by(organization, other_work_loc)`
# A tibble: 6 x 4
  record_id organization other_work_loc newvar
      <dbl>        <dbl> <chr>           <dbl>
1         1           12 CCC                 3
2         2           12 AMG                 3
3         3           12 TAO                 2
4         4            1 NA                  1
5         5            2 NA                 NA
6         6            7 NA                 NA

Key points:

Even though I left other_work_loc blank in the LUT for organization #1, it was able to successfully match to that line of your original file, just based on organization.
I didn't fill out the entire LUT, so organizations #2 and #7 ended up with NA for newvar.
For organization #12, you can much more easily edit the LUT file to add additional free responses and their corresponding newvar entries, than write additional lines of case_when code.

Thank you both for answering my question! I ended up using the join_by option, very helpful! — Phoebe Long, Jun 05 '23 at 14:25

Best way to replace a variety of write-in answers with numbers using R

1 Answers1