2

I have a set of strings I need to manipulate. Of each, in case they include a set of substrings, I want to keep the substring, otherwise leave it untouched.

Here follows an example:

keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
expected_result <- c("DETROIT","USA","UNITED STATES")

MCS
  • 1,071
  • 9
  • 23

2 Answers2

2

You can use

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
keep <- c("USA","UNITED STATES")

regex <- paste0(".*\\s*\\b(",paste0(keep,collapse="|"), ")\\b")
sub(regex, "\\1", data)
## => [1] "DETROIT"       "USA"           "UNITED STATES"

See the R demo online.

The regex is .*\s*\b(USA|UNITED STATES)\b, see its online demo.

Details:

  • .* - any zero or more chars as many as possible
  • \s* - zero or more whitespaces
  • \b(USA|UNITED STATES)\b - a whole word USA or UNITED STATES, captured into Group 1 (\1 in the replacement pattern).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

You could use str_extract to extract the pattern if present. This returns NA in case the pattern is missing which you can replace with original data.

keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

result <- stringr::str_extract(data, keep)
result[is.na(result)] <- data[is.na(result)]
trimws(result)
#[1] "DETROIT"       "USA"           "UNITED STATES"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213