Need solution for recoding large number of variables

Question

I have a dataset with the English and Spanish version of a questionnaire. The questionnaires ask whether individuals have ever received a large number of different diagnoses. Each variable takes the form prev_dx_major_depression for the English data and prev_dx_major_depression_span for the Spanish data.

I would like to combine the two into a single variable. I am currently using the following code to achieve this purpose:

mutate(
    prev_dx_major_depression = if_else(prev_dx_major_depression == 1 | 
                                         prev_dx_major_depression_span == 1,
                                            1, 0
                                               ))

However, I know this is highly inefficient for such a large number of variables. My hunch is that I'll need to use some combination of mutate_at, recode, starts_with and ends_with. However, I am a bit stuck at this point and am not sure how to match up the corresponding variables together.

Here is some sample data:

sample_data <- 
  structure(
    list(
      id = 1:5,
      prev_dx_major_depression = c(0, 1, 1,
                                   0, 0),
      prev_dx_bipolar = c(0, 0, 0, 0, 0),
      prev_dx_generalized_anxiety = c(1,
                                      1, 0, 0, 0),
      prev_dx_major_depression_span = c(NA, NA, NA, NA,
                                        1),
      prev_dx_bipolar_span = c(NA, NA, NA, NA, NA),
      prev_dx_generalized_anxiety_span = c(NA,
                                           NA, NA, NA, 1)
    ),
    class = "data.frame",
    row.names = c(NA,-5L)
  )

It would be helpful if you could provide a sample dataset using `dput(x)`. — Nad Pat, Nov 26 '21 at 20:07
Strongly agree. If you could provide some reproducible data with 2 or 3 variable pairs and about 5 rows of data that would illustrate the problem nicely and give us something to work with. `dput(your_data[1:5, c("name_of_id_column", "prev_dx_major_depression", "prev_dx_major_depression_span", "example_column2", "example_column2_span")])` would be perfect. — Gregor Thomas, Nov 26 '21 at 20:26
Thanks for the tip - I updated the question with some sample data. — runlikeagirl, Nov 26 '21 at 20:45

score 2 · Accepted Answer · answered Nov 26 '21 at 21:41

One option would be to

Rename your variables to add a postfix engl to the english data columns
Convert to long format such that we end up with a column containing variable names and two columns for Spanish and English data
Get your unique values for each variable
Convert back to wide format

library(dplyr)
library(tidyr)

rename_with(sample_data, ~ paste0(.x, "_engl"), .cols = !c(ends_with("_span"), id)) %>% 
  pivot_longer(-id, names_to = c("var", ".value"), names_pattern = "^(.*)_(.*)$") %>% 
  mutate(value = if_else(span %in% 1 | engl %in% 1, 1, 0)) %>% 
  select(-engl, -span) %>% 
  pivot_wider(names_from = var, values_from = value)
#> # A tibble: 5 × 4
#>      id prev_dx_major_depression prev_dx_bipolar prev_dx_generalized_anxiety
#>   <int>                    <dbl>           <dbl>                       <dbl>
#> 1     1                        0               0                           1
#> 2     2                        1               0                           1
#> 3     3                        1               0                           0
#> 4     4                        0               0                           0
#> 5     5                        1               0                           1

Need solution for recoding large number of variables

1 Answers1