1

A particular string can contain multiple instances of a pattern that I'm trying to match. For example, if my pattern is <N(.+?)N> and my string is "My name is <N Timon N> and his name is <N Pumba N>", then there are two matches. I want to replace each match with a replacement that includes an index for which match is being replaced.

So in my string "My name is <N Timon N> and his name is <N Pumba N>", I want to change the string to read "My name is [Name #1] and his name is [Name #2]".

How do I accomplish this, preferably with a single function? And preferably using functions from stringr or stringi?

acylam
  • 18,231
  • 5
  • 36
  • 45
bschneidr
  • 6,014
  • 1
  • 37
  • 52

4 Answers4

3

You can do this with gregexpr and regmatches in Base R:

my_string = "My name is <N Timon N> and his name is <N Pumba N>"

# Get the positions of the matches in the string
m = gregexpr("<N(.+?)N>", my_string, perl = TRUE)

# Index each match and replace text using the indices
match_indices = 1:length(unlist(m))

regmatches(my_string, m) = list(paste0("[Name #", match_indices, "]"))

Result:

> my_string
# [1] "My name is [Name #1] and his name is [Name #2]"

Note:

This solution treats the same match as a different "Name" if it appears more than once. For example the following:

my_string = "My name is <N Timon N> and his name is <N Pumba N>, <N Timon N> again"


m = gregexpr("<N(.+?)N>", my_string, perl = TRUE)

match_indices = 1:length(unlist(m))

regmatches(my_string, m) = list(paste0("[Name #", match_indices, "]"))

outputs:

> my_string
[1] "My name is [Name #1] and his name is [Name #2], [Name #3] again"
acylam
  • 18,231
  • 5
  • 36
  • 45
  • @BIQS Thanks for the edits, but I prefer to keep things simpler in these types of questions. – acylam Nov 02 '17 at 16:20
  • I think we disagree about what's simplest, but it's of course your right to choose in your answer. I like your approach and will probably accept it as the answer depending on what else comes in. – bschneidr Nov 02 '17 at 16:24
  • @BIQS Simpler as in creating fewer intermediate variables that would clutter my workspace. I agree that your edits might be more readable, but not necessarily simpler – acylam Nov 02 '17 at 16:29
  • 1
    Fair enough. I think the final answer is simple and readable, so I'm happy with it. – bschneidr Nov 02 '17 at 16:35
  • @BIQS check out this answer with `dplyr` and `stringr` which I think you prefer – acylam Nov 02 '17 at 16:47
  • 1
    I like it. Could you submit this as a separate answer? It's a fairly different approach and it will give a different result from your original answer in common use cases. For example, the two approaches would yield different results for the string "My name is and his name is , too." The base R approach would yield "My name is [Name #1] and his name is [Name #2], too." while the tidyverse approach would yield "My name is [Name #1] and his name is [Name #1], too." – bschneidr Nov 02 '17 at 17:57
  • @BIQS Very good point, I didn't notice myself. See my updates. – acylam Nov 02 '17 at 18:02
2

Here's a solution that relies on the gsubfn and proto packages.

# Define the string to which the function will be applied
my_string <- "My name is <N Timon N> and his name is <N Pumba N>"

# Define the replacement function
replacement_fn <- function(x) {

  replacment_proto_fn <- proto::proto(fun = function(this, x) {
      paste0("[Name #", count, "]")
  })

  gsubfn::gsubfn(pattern = "<N(.+?)N>",
                 replacement = replacment_proto_fn,
                 x = x)
}

# Use the function on the string
replacement_fn(my_string)
bschneidr
  • 6,014
  • 1
  • 37
  • 52
  • 1
    You might be interested in the glue package: https://github.com/tidyverse/glue which has similar syntax like `"Hiya, I'm {Timon}, yo"` if I understand the vignette correctly (never used it myself). – Frank Nov 02 '17 at 15:40
  • Thanks- I think this is a good suggestion. I'd love to be able to use glue for this, but I haven't figured it out. – bschneidr Nov 02 '17 at 15:46
  • 1
    I think the count part would be a bit difficult `glue(gsub("", "{\\1}", my_string), Timon = "[Name #1]", Pumba = "[Name #2]")# My name is [Name #1] and his name is [Name #2]` – akrun Nov 02 '17 at 15:52
1

Here is a different approach with dplyr + stringr:

library(dplyr)
library(stringr)

string %>%
  str_extract_all("<N(.+?)N>") %>%
  unlist() %>%
  setNames(paste0("[Name #", 1:length(.), "]"), .) %>%
  str_replace_all(string, .)

# [1] "My name is [Name #1] and his name is [Name #2]"

Note:

The second solution extracts the matches with str_extract_all, then uses the matches to create a named vector of replacements, which is finally fed into str_replace_all to search and replace accordingly.

As pointed out by OP, this solution yields different results than the gregexpr + regmatches approach in some cases. For example the following:

string = "My name is <N Timon N> and his name is <N Pumba N>, <N Timon N> again"

string %>%
  str_extract_all("<N(.+?)N>") %>%
  unlist() %>%
  setNames(paste0("[Name #", 1:length(.), "]"), .) %>%
  str_replace_all(string, .)

outputs:

[1] "My name is [Name #1] and his name is [Name #2], [Name #1] again"
acylam
  • 18,231
  • 5
  • 36
  • 45
0

Simple, maybe slow, but should work:

ct <- 1
while(TRUE) {
 old_string <- my_string; 
 my_string <- stri_replace_first_regex(my_string, '\\<N.*?N\\>', 
       paste0('[name', ct, ,']')); 
  if (old_string == my_string) break 
  ct <- ct + 1
}
  • @useR 's solution is better! –  Nov 02 '17 at 16:17
  • 2
    This approach works for the specific example in the question, but whether it works depends on the regex and the replacement. For example, if the regex is something like a space or word boundary (e.g. "\\w+") and the replacement doesn't delete the match, then the user will be stuck in an unending loop. – bschneidr Nov 02 '17 at 16:21