3

I have data from a survey, where several questions are in the format

"Do you think that [xxxxxxx]"

The possible answers to the questions are in the format

"I am certain that [xxxxxxx]" "I think it is possible that [xxxxxx]" "I don't know if [xxxxxx]"

and so on.

I would now like to recode these factors so that "I am certain" = 1, "I think it is possible" = 2 and so on. I have been playing with dplyr::recode but it does not seem to work with regular expressions.

For example:

set.seed(12345)

possible_answers <- c(
    "I am certain that", "I think it is possible that",
    "I don't know if is possible that", "I think it is not possible that",
    "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
    Q1 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 1"
    ),
    Q2 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 2"
    ),
    Q3 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 3"
    ),
    Q4 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 4"
    ),
    Q5 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 5"
    )
)

I can do something like

survey %>% 
    mutate_at("Q1", recode,
                "I am certain that topic 1" = 1,
                "I think it is possible that topic 1" = 2,
                "I don't know if is possible that topic 1" = 3,
                "I think it is not possible that topic 1" = 4,
                "I am certain that it is not possible that topic 1" = 5,
                "It is impossible for me to know if topic 1" = 6)

but doing it for all questions would be cumbersome.

I would like to do

survey %>% 
    mutate_at(vars(starts_with("Q")), recode,
                "I am certain that (.*)" = 1,
                "I think it is possible that (.*)" = 2,
                "I don't know if is possible that (.*)" = 3,
                "I think it is not possible that (.*)" = 4,
                "I am certain that it is not possible that (.*)" = 5,
                "It is impossible for me to know if (.*)" = 6)

But this changes everything to NA, because it does not see the strings as regular expressions.

user438383
  • 5,716
  • 8
  • 28
  • 43
nico
  • 50,859
  • 17
  • 87
  • 112
  • Why not use [`base::startsWith()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)? – SamR Jun 23 '23 at 12:17
  • @SamR I am not sure how that would solve the problem, could you provide an example? – nico Jun 23 '23 at 12:29
  • Exactly the same way as the answer by Dave Armstrong, just with a fixed string rather than regex, which should be slightly faster. The important caveat about order still applies. – SamR Jun 23 '23 at 12:36
  • The problem with the fixed string is that I have to create a different case for each question and with >20 questions it becomes too messy. – nico Jun 23 '23 at 12:51
  • It's only the start of the string that is fixed. See the docs linked above: `startsWith()` is equivalent to but much faster than `substring(x, 1, nchar(prefix)) == prefix` – SamR Jun 23 '23 at 12:52
  • OK, I see it would work in that case. Unfortunately, in some cases, I still need regular expression because the strings are a bit more complex than the example I gave, and it's not just the start that is common. Thanks anyway for pointing that out – nico Jun 23 '23 at 13:54

2 Answers2

3

Without the data I can't test, but you should be able to use mutate(across(...)) with case_when() to do this. Note that since "I am certain that" will also match "I am certain that it is not possible", you need to do the latter first so that the search for "I am certain" only catches the positive cases.

survey %>% 
  mutate(across(starts_with("Q"), 
                ~case_when(
                  grepl("I am certain that it is not possible that", .x) ~ 5,
                  grepl("I am certain that", .x) ~ 1, 
                  grepl("I think it is possible that", .x) ~ 2, 
                  grepl("I don't know if is possible that", .x) ~ 3, 
                  grepl("I think it is not possible that", .x) ~ 4,
                  grepl("It is impossible for me to know if", .x) ~ 6)))
nico
  • 50,859
  • 17
  • 87
  • 112
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
1

Another options is the first cut the "topic X" at the end of each string and then recode all variables in one go with recode():

library(dplyr)
library(stringr)


recode_vec <- setNames(as.character(1:6), possible_answers)

survey |> 
  mutate(across(starts_with("Q"),
                \(x) {
                  str_replace_all(x,
                                  "(.*)\\stopic\\s\\d$",
                                  "\\1") |> 
                  recode(!!! recode_vec)
                }
                )
         )
#>    Q1 Q2 Q3 Q4 Q5
#> 1   6  6  4  3  1
#> 2   3  6  3  4  1
#> 3   2  2  1  1  5
#> 4   4  1  6  1  2
#> 5   2  6  5  4  2
#> 6   5  6  4  3  3
#> 7   3  1  2  6  3
#> 8   2  4  6  1  5
#> 9   6  4  2  5  3
#> 10  3  2  4  3  1

Data from OP

set.seed(12345)

possible_answers <- c(
  "I am certain that", "I think it is possible that",
  "I don't know if is possible that", "I think it is not possible that",
  "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
  Q1 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 1"
  ),
  Q2 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 2"
  ),
  Q3 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 3"
  ),
  Q4 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 4"
  ),
  Q5 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 5"
  )
)

Created on 2023-06-23 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39