1

How does one do str_replace with a "starting with" ^ and a vector?

I am trying to remove the prefixes (Mr., Ms., Dr., Capt., etc.) from a list of names, only from the beginning. I have tried: str_replace(name, prefix, ''). This replaces only a few of the prefixes (Mr., Ms., Dr., Capt., etc.) from the vector of names but most prefixes are still present. At the same time I don't want to replace the Dr in say Dr. Drake to ake. Dr. Drake should be Drake.

name <- c('Mrs. Emily S', 'Dr. Richard L', 'Dr. Drake D', 'Mr. Mrdrmsmrs', 'Test Name')
prefix <- c('Dr.', 'Mr.', 'Ms.', 'Mrs.', 'Capt.')
# Wiktor Stribiżew's code
str_replace(name, paste0("^(?:", paste(prefix, collapse="|"), ")(?!\\.)"), '')

There are whitespaces. However we can remove those with trimws() or stringr::str_trim()

Highland
  • 148
  • 1
  • 7
  • Can you include multiple string samples in your question and what the expected outcomes should be? It's unclear what you're asking without this (e.g. is `Dr. Drake` supposed to be `Dr. Drake` or `Drake`?). Also, what have you tried? Please [create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – ctwheels Nov 21 '17 at 18:44
  • 2
    If your `prefix` is something like `c("Dr", "Ms", "Mr")` you may try `str_replace(name, paste0("^(?:",paste(prefix, collapse="|"), ")(?!\\.)"), '')` – Wiktor Stribiżew Nov 21 '17 at 18:58
  • @ctwheels The result should be `Drake`. Updated the description. Sorry about that. – Highland Nov 21 '17 at 19:25
  • 2
    @WiktorStribiżew Looks like it worked! Thank you so much!!! How can I give you reputation? – Highland Nov 21 '17 at 19:31
  • @Highland can you post other string samples to test against as well? – ctwheels Nov 21 '17 at 19:31
  • @ctwheels Added to the question. – Highland Nov 21 '17 at 19:42
  • Please see the answer adjusted to the data you added to the question. Note that I added `sort.by.length.desc` related stuff just in case there is a more generic issue when you have overlapping items in the prefix character vector and when you cannot rely on any boundaries be it a `\b` word boundary or `.`. – Wiktor Stribiżew Nov 21 '17 at 20:03

1 Answers1

1

You want to remove alphanumeric strings defined in your prefix character vector only when they appear at the start of the string. They contain a literal . that must be escaped to denote the literal . in the pattern.

Use

regex.escape <- function(string) {
  gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]  ## Just in case you have overlapping items in prefix char vector

name <- c('Mrs. Emily S', 'Dr. Richard L', 'Dr. Drake D', 'Mr. Mrdrmsmrs', 'Test Name')
prefix <- c('Dr.', 'Mr.', 'Ms.', 'Mrs.', 'Capt.')
prefix <- sort.by.length.desc(prefix) ## This is not important unless any abbreviation ends with more than 1 dot, else you may remove this line for the current problem
res <- trimws(gsub(paste0("^(?:",paste(regex.escape(prefix), collapse="|"), ")"), '', name, perl="TRUE"))
res
## => [1] "Emily S"   "Richard L" "Drake D"   "Mrdrmsmrs" "Test Name"
## OR
## res <- trimws(str_replace(name, paste0("^(?:",paste(regex.escape(prefix), collapse="|"), ")"), ''))

See the online R demo.

Here, paste0("^(?:",paste(regex.escape(prefix), collapse="|"), ")") dynamically creates a pattern like ^(?:Mr\.|Ms\.|Dr\.|Capt\.) that matches strings like this:

  • ^ - start of string
  • (?:Mr\.|Ms\.|Dr\.|Capt\.) - Mr., Ms., Dr., Capt., etc.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • What's the purpose of the 4 and 6 backslashes, to escape other backslahes? The first `str_replace(name, paste0("^(?:", paste(prefix, collapse="|"), ")(?!\\.)"), '')` seems to work fine. Is the updated more robust? – Highland Nov 21 '17 at 20:50
  • @Highland The point is that `(?!\.)` negative lookahead was pointless. You have `Dr.` in the `prefix`. If you use `Dr.` in the regex pattern, it will match `Dr`, `DrD` (in `DrDrake`, etc. because a `.` matches any char (but a line break char in PCRE/ICU regex flavors). So, you need 1) `regex.escape`, 2) if you have `Mr.` and `Mr..` and the first appears before the second, the second `.` will remain because the first alternative inside the alternation group (`(...|...|....)`) that matches "wins" and the rest of the alternatives are not even tried. Hence, `sort.by.length.desc` is used. – Wiktor Stribiżew Nov 21 '17 at 20:53
  • Let's say there were suffixes in the name. Seems like adjusting the last line to `trimws(gsub(paste0("(?:",paste(regex.escape(prefix), collapse="|"), ")$"), '', name, perl="TRUE"))` works. Is that a proper method? – Highland Nov 21 '17 at 21:18
  • @Highland Yes, [it is good](https://regex101.com/r/OW08Cy/1). It will remove the substrings at the end of the string. Make sure you either add `\\b` before `(?:` or sort as I showed, by length in the descending order. – Wiktor Stribiżew Nov 21 '17 at 21:20