Remove text before an array of subtexts

Question

I have a set of strings I need to manipulate. Of each, in case they include a set of substrings, I want to keep the substring, otherwise leave it untouched.

Here follows an example:

keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
expected_result <- c("DETROIT","USA","UNITED STATES")

score 2 · Accepted Answer · answered Feb 16 '21 at 10:52

You can use

data <- c("DETROIT","DETROIT USA","DETROIT UNITED STATES")
keep <- c("USA","UNITED STATES")

regex <- paste0(".*\\s*\\b(",paste0(keep,collapse="|"), ")\\b")
sub(regex, "\\1", data)
## => [1] "DETROIT"       "USA"           "UNITED STATES"

See the R demo online.

The regex is .*\s*\b(USA|UNITED STATES)\b, see its online demo.

Details:

.* - any zero or more chars as many as possible
\s* - zero or more whitespaces
\b(USA|UNITED STATES)\b - a whole word USA or UNITED STATES, captured into Group 1 (\1 in the replacement pattern).

score 1 · Answer 2 · answered Feb 16 '21 at 10:35

You could use str_extract to extract the pattern if present. This returns NA in case the pattern is missing which you can replace with original data.

keep <- c("USA","UNITED STATES")
keep <- paste0(paste0(" ",keep,"$"),collapse="|")

result <- stringr::str_extract(data, keep)
result[is.na(result)] <- data[is.na(result)]
trimws(result)
#[1] "DETROIT"       "USA"           "UNITED STATES"

Remove text before an array of subtexts

2 Answers2