6

Problem description: I'm currently extracting names from a book series. Many characters will go by nicknames, parts of names, or titles. I have a list of names that I'm using as a pattern on all of the data. The problem is that I'm getting multiple matches for full names and the parts of names. There are a total of 3000 names and variations of names that I'm running through a lot of text. The names are currently extracted in order from longest strings to shortest.

Question:

How can I ensure that after a pattern is extracted, that whatever text it matches is then removed from the string?

What I get:

str_extract("Mr Bean and friends", pattern = fixed(c("Mr Bean", "Bean", "Mr")))  
[1] "Mr Bean" "Bean"    "Mr"     

What I want: (I know that I can't achieve this only using str_extract() or one line of code)

str_extract("Mr Bean and friends", pattern = fixed (c("Mr Bean", "Bean", "Mr")))  
[1] "Mr Bean" NA NA    
  • 1
    Using `str_extract("Mr Bean and friends", pattern = "Mr Bean|Bean|Mr")` would return just "Mr Bean". Would that be a solution for your case? – Julius Vainora Feb 03 '19 at 16:22
  • The problem with that is that there are multiple things I want to match in a single string, along with multiple things that I don't want to match. That will work for all variations of Mr. Bean, but if I add in Mrs Bean, then I also want it to match her name which won't work if I compile the name list into one giant regex filled with "|"s. – Christopher Peralta Feb 03 '19 at 16:31

2 Answers2

2

One option would be to update recursively. As we want an output vector of length 'n' equal to the length of pattern vector, create an output vector to store the values, then update the initial string after execution of each 'pattern' by removing the 'pattern' from the string and updating it

library(stringr)
for(i in seq_along(pat))  {
      out[i] <- str_extract(str1, pattern = fixed(pat[i]))
      str1 <- str_remove(str1, pat[i])
 }
out
#[1] "Mr Bean" NA        NA   

Or the same method with vapply and updating the initial string with <<-

unname(vapply(pat, function(p) {
   out <- str_extract(str1, p)
   str1 <<- str_remove(str1, p)
   out}, character(1)))
#[1] "Mr Bean" NA        NA       

data

# initialize an output vector
out <- character(length(pat))
# pattern vector
pat <- c("Mr Bean", "Bean", "Mr")
# initial string
str1 <- "Mr Bean and friends"
str2 <- str1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This assumes that the strings are ordered in such a way that if `a` is a substring of `b` that `a` is not before `b`. – G. Grothendieck Feb 03 '19 at 17:07
  • 1
    The strings are ordered in such a way, the solution works well, the only issue is that this solution is computationally intensive, but I've already accepted that I'll have to run some code for a few hours. – Christopher Peralta Feb 04 '19 at 00:18
1

Would using pmatch work?

my_string <- "Mr Bean and friends"
my_pattern <- c("Mr Bean", "Bean", "Mr")

out <- my_pattern[pmatch(my_pattern,my_string)]
out
[1] "Mr Bean" NA        NA
twb10
  • 533
  • 5
  • 18